Step 1: SeekDeep

Data processing of data using this SeekDeep made by Nick Nathaway. Here is a link to a step-by-step guide for installation.

We used Illumina sequencing data and would like to run all of the three pipelines. We therefore use the global function setupTarAmpAnalysis. The data files were made in accordance to the Initial Set-Up Guidelines for Paired End Reads with MIDs.

STEP 1: Must run this before SeekDeep can work:

export PATH=/home/software/SeekDeep-3.0.1/bin/:$PATH

STEP 2: Run the SeekDeep command

SeekDeep setupTarAmpAnalysis –samples sampleNamesP1.tab.txt –outDir output –inputDir . –idFile multipleGenePairs.id.txt –overlapStatusFnp overlapStatuses.txt –refSeqsDir markerRefSeqs/forSeekDeep/refSeqs/ –extraExtractorCmds=“–barcodeErrors 0 –midWithinStart 8 –primerWithinStart 8 –checkRevComplementForMids true –checkRevComplementForPrimers true” –extraProcessClusterCmds=“–fracCutOff 0.078 –sampleMinTotalReadCutOff 100”

This command was developed after testing for a wide number of parameters outlined here SeekDeep (Hathaway et al. 2017). We used the controls in our pools to parameterise.

setupTarAmpAnalysis

Sequencing Files: --inputDir Input Directory of raw data files; required. These files include:

--fastq1 Input sequence filename, only need 1, fastq first mate text file; required
--fastq1gz Input sequence filename, only need 1, fastq first mate gzipped file; required
--fastq2 Input sequence filename, only needed with –fastq1, fastq second mate text file; required
--fastq2gz Input sequence filename, only needed with –fastq1gz, fastq second mate gzipped file; required

ID Files:

--idFile = ID file containing primer and possible additional MIDs; required.
--samples = A file containing the samples names, columns should go target, sample, pcr1, pcr2 (optional) etc; required.

Illumina Stitching: --overlapStatusFnp A file with two columns, target,status; status column should contain 1 of 3 values (capitalization doesn’t matter): r1BegOverR2End,r1EndOverR2Beg,NoOverlap.

r1BegOverR2End = target size < read length (causes read through),
r1EndOverR2Beg = target size > read length less than 2 x read length,
NoOverlap = target size > 2 x read length; required.

Output: --outDir Directory to setup for analysis; required.

Extra Commands

Step 1: Set up the reference genomes:

export PATH=/home/software/samtools-1.15.1/bin:$PATH

export PATH=/home/software/bowtie2-2.4.5-linux-x86_64:$PATH

SeekDeep genTargetInfoFromGenomes –gffDir pfgenomes/gff/ –genomeDir pfgenomes/genomes/ –primers GeneID.tab.txt –pairedEndLength 250 –dout markerRefSeqs –numThreads 19 –overWriteDir true –verbose

Step 2: Run the command

SeekDeep setupTarAmpAnalysis –samples sampleNamesP8.tab.txt –outDir Run-Pool-8-WRefs –inputDir . –idFile multipleGenePairs.id.txt –overlapStatusFnp overlapStatuses.txt –refSeqsDir markerRefSeqs/markerFasta/ –extraExtractorCmds=“–midWithinStart 100 –primerWithinStart 100 –qualCheckCutOff .75 –qualCheckLevel 30 –checkRevComplementForMids true –checkRevComplementForPrimers true”

All of these commands and their uses are discussed in the SeekDeep extractor Tutorial. The aim of this pipeline is to extract sequences from various sequences input types (fastq,fasta,sff,etc.) with primers and barcodes plus some filtering.

Barcodes:

--barcodeErrors 0 = To further reduce any ambiguity and specifically select for NO errors.

--midWithinStart 8 = By default the primer or barcodes are searched at the very beginning of seq, use this flag to extended the search, should be kept low to cut down on false positives; default=0; --primerWithinStart 8 By default the primer or barcodes are searched at the very beginning of seq, use this flag to extended the search, should be kept low to cut down on false positives; default=0;

Roche MID 47, TGTGAGTAGT, is within the dhps-1 sequence and the output therefore being preferentially labelled as “MID47”. Because of this, we discovered that our midWithinStart and primerWithinStart commands could be adjusted. Reducing the parameter to 0 produced no data. This is because there are 7bp at the start of each sequence preceding the MID. By changing the midWithinStart and primerWithinStart parameters to a smaller value, we can use MID 47 for dhps-2.

Our optimal golidlocks value was –midWithinStart 8 –primerWithinStart 8.

--checkRevComplementForMids true = Check the Reverse Complement of the Seqs As Well For MIDs --checkRevComplementForPrimers true = Check the Reverse Complement of the Seqs As Well For Primers

These commands and uses are discussed in the SeekDeep qluster Tutorial. The qluster pipeline aims to cluster input sequences by collapsing on set differences between the sequences. We did not modify anything from the default parameters.

The SeekDeep processClusters Tutorial provides help for these commands and their uses. Here, we adjusted some filtering parameters to remove low-frequency variants based on preliminary analysis of mock mixed infections of controls.

--fracCutOff 0.078 = Final cluster Fraction Cut off; default=0.005;

Instead of specifying a hard read count cut-off e.g., --clusterCutOff, I opted to try and remove haplotypes that were represented at a frequency in the sample using the function —-fracCutOff:

5% (Boyce et al. 2019; Wamae, Ndwiga, et al. 2022; Wamae, Kimenyi, et al. 2022)
3.5% (Hemming-Schroeder et al. 2020)
1% (Brazeau et al. 2019; Pringle et al. 2019; Osoti et al. 2022)
0.5% (Aydemir et al. 2018; Balasubramanian et al. 2022; Topazian et al. 2022)

For our data, a cut-off of 0.078 or 7.8% was used.

--sampleMinTotalReadCutOff 100 = Sample Minimum Total Read Cut Off, if the total read count for the sample is below this it will be thrown out; default=250;

This new parameter was added in the SeekDeep v. 3.0.1 that was used for the final analysis of the drug resistance dataset in 2023/2024. We realised that there was a new default parameter that removed isolates that had less than 250 total reads (i.e., 250X coverage) in the selected clusters output file.

We then amended this, based on a visualisation of the total read density per marker - this showed that for some markers like dhps-1/2, dhfr and mdr1-2 there were more isolates that had ~100X-250X coverage. We therefore created a threshold of 100X coverage.

extractionProfile.tab.txt: This breaks down the filtering per final extraction sequence file
extractionStats.tab.txt: This has info how the whole extraction went

All the data from the analysis done by processClusters is located in the file: selectedClustersInfo.tab.txt.

Our main variables of interest are:

s_Sample = The name of the sample the haplotype appears in
h_popUID = The population id given to haplotype. These are named with PopUID.XX where XX is the population rank number and this number is determined by the number of samples the haplotype appears in
c_clusterID = The id given to the cluster in the sample, the lower the number the higher the relative abundance in the sample
c_ReadCnt = The total number of reads associated with this cluster summed over all replicates
c_Consensus = The consensus sequence for this cluster
c_bestExpected = A string with the name of the closest reference sequence see below to see how to compare against expected references

References

Aydemir, Ozkan, Mark Janko, Nick J Hathaway, Robert Verity, Melchior Kashamuka Mwandagalirwa, Antoinette K Tshefu, Sofonias K Tessema, et al. 2018. “Drug-Resistance and Population Structure of Plasmodium falciparum Across the Democratic Republic of Congo Using High-Throughput Molecular Inversion Probes.” The Journal of Infectious Diseases 218 (6): 946–55. https://doi.org/10.1093/infdis/jiy223.

Balasubramanian, Sujata, Rachel Curtis-Robles, Bhagath Chirra, Lisa D. Auckland, Alan Mai, Virgilio Bocanegra-Garcia, Patti Clark, et al. 2022. “Characterization of triatomine bloodmeal sources using direct Sanger sequencing and amplicon deep sequencing methods.” Scientific Reports 12 (1): 10234. https://doi.org/10.1038/s41598-022-14208-8.

Boyce, Ross M, Nicholas Brazeau, Travis Fulton, Nick Hathaway, Michael Matte, Moses Ntaro, Edgar Mulogo, and Jonathan J Juliano. 2019. “Prevalence of Molecular Markers of Antimalarial Drug Resistance across Altitudinal Transmission Zones in Highland Western Uganda.” The American Journal of Tropical Medicine and Hygiene 101 (4): 799–802. https://doi.org/10.4269/ajtmh.19-0081.

Brazeau, Nicholas F, Ashenafi Assefa, Hussein Mohammed, Heven Seme, Abeba G Tsadik, Jonathan B Parr, Corinna Keeler, et al. 2019. “Pooled Deep Sequencing of Drug Resistance Loci from Plasmodium falciparum Parasites across Ethiopia.” The American Journal of Tropical Medicine and Hygiene 101 (5): 1139–43. https://doi.org/10.4269/ajtmh.19-0142.

Hathaway, Nicholas J., Christian M. Parobek, Jonathan J. Juliano, and Jeffrey A. Bailey. 2017. “SeekDeep: single-base resolution de novo clustering for amplicon deep sequencing.” Nucleic Acids Research 46 (4): gkx1201–. https://doi.org/10.1093/nar/gkx1201.

Hemming-Schroeder, Elizabeth, Daibin Zhong, Solomon Kibret, Amanda Chie, Ming-Chieh Lee, Guofa Zhou, Harrysone Atieli, Andrew Githeko, James W Kazura, and Guiyun Yan. 2020. “Microgeographic Epidemiology of Malaria Parasites in an Irrigated Area of Western Kenya by Deep Amplicon Sequencing.” The Journal of Infectious Diseases 223 (8): 1456–65. https://doi.org/10.1093/infdis/jiaa520.

Osoti, Victor, Mercy Akinyi, Kevin Wamae, Kelvin M. Kimenyi, Zaydah de Laurent, Leonard Ndwiga, Paul Gichuki, et al. 2022. “Targeted Amplicon Deep Sequencing for Monitoring Antimalarial Resistance Markers in Western Kenya.” Antimicrobial Agents and Chemotherapy 66 (4): e01945–21. https://doi.org/10.1128/aac.01945-21.

Pringle, Julia C., Amy Wesolowski, Sophie Berube, Tamaki Kobayashi, Mary E. Gebhardt, Modest Mulenga, Mike Chaponda, et al. 2019. “High Plasmodium falciparum genetic diversity and temporal stability despite control efforts in high transmission settings along the international border between Zambia and the Democratic Republic of the Congo.” Malaria Journal 18 (1): 400. https://doi.org/10.1186/s12936-019-3023-4.

Topazian, Hillary M., Kara A. Moser, Billy Ngasala, Peter O. Oluoch, Catherine S. Forconi, Lwidiko E. Mhamilawa, Ozkan Aydemir, et al. 2022. “Low Complexity of Infection Is Associated With Molecular Persistence of Plasmodium falciparum in Kenya and Tanzania.” Frontiers in Epidemiology 2: 852237. https://doi.org/10.3389/fepid.2022.852237.

Wamae, Kevin, Kelvin M. Kimenyi, Victor Osoti, Zaydah R. de Laurent, Leonard Ndwiga, Oksana Kharabora, Nicholas J. Hathaway, et al. 2022. “Amplicon sequencing as a potential surveillance tool for complexity of infection and drug resistance markers in Plasmodium falciparum asymptomatic infections.” Journal of Infectious Diseases. https://doi.org/doi.org/10.1093/infdis/jiac144.

Wamae, Kevin, Leonard Ndwiga, Oksana Kharabora, Kelvin Kimenyi, Victor Osoti, Zaydah de Laurent, Juliana Wambua, et al. 2022. “Targeted Amplicon deep sequencing of ama1 and mdr1 to track within-host P. falciparum diversity throughout treatment in a clinical drug trial.” Wellcome Open Research 7: 95. https://doi.org/10.12688/wellcomeopenres.17736.1.