| seq_name | sequence |
|---|---|
| PF3D7_0709000 | MKFASKKNNQKNSSKNDERYRELDNLVQEGNGSRLGGGSCLGKCAHVFKLIFKEIKDNIFIYILSIIYLSVCVMNKIFAKRTLNKIGNYSFVTSETHNFICMIMFFIVYSLFGNKKGNSKERHRSFNLQFFAISMLDACSVILAFIGLTRTTGNIQSFVLQLSIPINMFFCFLILRYRYHLYNYLGAVIIVVTIALVEMKLSFETQEENSIIFNLVLISALIPVCFSNMTREIVFKKYKIDILRLNAMVSFFQLFTSCLILPVYTLPFLKQLHLPYNEIWTNIKNGFACLFLGRNTVVENCGLGMAKLCDDCDGAWKTFALFSFFNICDNLITSYIIDKFSTMTYTIVSCIQGPAIAIAYYFKFLAGDVVREPRLLDFVTLFGYLFGSIIYRVGNIILERKKMRNEENEDSEGELTNVDSIITQ* |
| 3D7.MID108.Pool1.0_f1 | ---------------------------------------!LGKCAHVFKLIFKEIKDNIFIYILSIIYLSVCVMNKIFAKRTLNKIGNYSF---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| SAMPLE1.MID1.Pool1.0_f1 | ---------------------------------------!LGKCAHVFKLIFKEIKDNIFIYILSIIYLSVCVMNKIFAKRTLNKIGNYSF---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| SAMPLE2.MID2.Pool1.0_f1 | ---------------------------------------!LGKCAHVFKLIFKEIKDNIFIYILSIIYLSVCVMNKIFAKRTLNKIGNYSF---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| SAMPLE3.MID3.Pool1.0_f0.7328 | ---------------------------------------!LGKCAHVFKLIFKEIKDNIFIYILSIIYLSVCVMNKIFAKRTLNKIGNYSF---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| SAMPLE3.MID3.Pool1.1_f0.2672 | ---------------------------------------!LGKCAHVFKLIFKEIKDNIFIYILSIIYLSVCVIETIFAKRTLNKIGNYSF---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Step 3: Processing in R
SeekDeep and MACSE Outputs:
The final outputs from MASCE (“_NT.fasta” or “_AA.fasta”) and SeekDeep (“selectedClustersInfo.tab.txt”) are as follows:
Here you have the sample name (seq_name) and the sequence (sequence).
The key variables of interest from the SeekDeep outputs include:
- The suffix in the s_SampleID column:
.0_f1: 0 = Major Haplotype, f1 = 100% of the isolate.0_f0.73280 = Major Haplotype, f0.7328 = 73.28% of the isolate.1_f0.26721 = Minor Haplotype, f0.2672 = 26.72% of the isolate
- h_popUID: Same numbers as the major and minor haplotypes denoted in that suffix
- c_ReadCnt: Read Count per amplicon/haplotype
Note, “Marker” was a user-generated column to denote which marker these data are attributed to e.g., crt.
| Marker | s_Sample | h_popUID | c_ReadCnt |
|---|---|---|---|
| crt | 3D7.MID108.Pool1.0_f1 | 0 | 2205 |
| crt | SAMPLE1.MID1.Pool1.0_f1 | 0 | 1772 |
| crt | SAMPLE2.MID2.Pool1.0_f1 | 0 | 4368 |
| crt | SAMPLE3.MID3.Pool1.0_f0.7328 | 0 | 3789 |
| crt | SAMPLE3.MID3.Pool1.1_f0.2672 | 1 | 2682 |
R Processing Guide
Our goal now is to interpret these results. The following code was used to process the data.
Create the following folders in a repository:
%%{init: { 'logLevel': 'debug', 'theme': 'base' } }%%
gitGraph
commit id: "01_SeekDeep_Output"
commit id: "02_MACSE_Output_AA"
commit id: "02_MACSE_Output_NT"
commit id: "03_DrugR_AA_Sequences"
commit id: "03_DrugR_NT_Sequences"
commit id: "03_DrugR_AA_Changes"
commit id: "03_DrugR_NT_Changes"
commit id: "04_Compiled_AA_Sequences"
commit id: "04_Compiled_NT_Sequences"
commit id: "05_All_DrugR_AA_Sequences"
commit id: "05_All_DrugR_NT_Sequences"
commit id: "05_All_DrugR_Haplotypes"
- 01_SeekDeep_Output: Contains
extractionStats.tab.txt,selectedClustersInfo.tab.txt,extractionProfile.tab.txtfiles for each pool and marker output from SeekDeep. This can be used at your own discretion to match processed sequences with read count data etc. - 02_MACSE_Output_AA: Contains all
_AA.fastafiles output from MACSE. - 02_MACSE_Output_NT: Contains all
_NT.fastafiles output from MACSE. - 03_DrugR_AA_Sequences: Output of processed amino acid files created from
Step_01_MACSE_to_R.qmd. - 03_DrugR_NT_Sequences: Output of processed nucleotide files created from
Step_01_MACSE_to_R.qmd. - 03_DrugR_AA_Changes: Output of processed non-synonymous changes files created from
Step_01_MACSE_to_R.qmd. - 03_DrugR_NT_Changes: Output of processed synonymous and non-synonymous changes files created from
Step_01_MACSE_to_R.qmd. - 04_Compiled_AA_Sequences: Output of compiling all amino acid files together created from
Step_01_MACSE_to_R.qmd. - 04_Compiled_NT_Sequences: Output of compiling all nucleotide files together created from
Step_01_MACSE_to_R.qmd. - 05_All_DrugR_AA_Sequences: Output of cleaned amino acid files created from
Step_02_R_Cleaning.qmd. - 05_All_DrugR_NT_Sequences: Output of cleaned nucleotide files created from
Step_02_R_Cleaning.qmd. - 05_All_DrugR_Haplotypes: Output of cleaned condensed haplotype files created from
Step_02_R_Cleaning.qmd.
Step 01 : MACSE to R
First we wish to process the MACSE output data with the suffix “_NT.fasta” and “_AA.fasta”. This involves:
- Aligning to the reference genome e.g., 3D7, Dd2 and HB3 to gain the position of each codon.
- For nucleotide processing: Ensure columns are created for each codon: each 3 characters, i.e., nucleotides.
- For amino acid processing: Ensure columns are created for each codon: each 1 character, i.e., amino acid.
INPUTS:
- “_NT.fasta” files: MACSE output
- “_AA.fasta” files: MACSE output
OUTPUTS:
“_NT_Aligned.csv”: Data frame of aligned sequences to the 3D7 reference genome where each codon represented by a unique column.
“_NT_Aligned_YorN.csv”: Data frame of whether the codon MATCHES YES (Y) or NO (N) the 3D7 reference.
“_NT_Aligned_NTchange.csv”: Data frame of when codons DID NOT MATCH 3D7 reference and what the nature of the change is.
“_AA_Aligned.csv”: Data frame of aligned sequences to the 3D7 reference genome where each codon represented by a unique column.
“_AA_Aligned_YorN.csv”: Data frame of whether the codon MATCHES YES (Y) or NO (N) the 3D7 reference.
“_AA_Aligned_NTchange.csv”: Data frame of when codons DID NOT MATCH 3D7 reference and what the nature of the change is.
After we have processed each .fasta file to a .csv file, we can (1, 2) merge all .csv files per marker for NT (nucleotide) sequences and AA (amino acid) sequences and (3) write the .csv file for these merged files.
Step 02: R Cleaning
The aim of this .qmd is to manipulate the processed MACSE files into files that contain variables of interest (e.g., Survey, Marker) and separate files for Controls, Repeats and Isolates. Further, this notebook will calculate the number of synonymous and non-synonymous mutations per isolate/sample and contain a column with all of the non-synonymous changes (from 3D7 reference).
Process Sequence Data
AIMS: - Remove symbols: *!- - Separate SeqID code into “Isolate”, “MID”, “MID_N”, “Pool”, “Date”, “Haplotype” - Create haplotype specific to marker
INPUTS:
- *All_NT_Aligned.csv Use as the sequence file and create haplotypes (e.g., CVMNT)
OUTPUTS:
- 05_All_DrugR_NT_Sequences/Controls/*Controls_Haps.csv
- 05_All_DrugR_NT_Sequences/Isolates/*Isolates_Haps.csv
- 05_All_DrugR_NT_Sequences/Repeats/*Repeats_Haps.csv
- 05_All_DrugR_Haplotypes/temp/*Haps_only.csv
The Controls_Haps, Isolates_Haps and Repeats_Haps files will contain the haplotype (Haps), the entire sequence separated by codon number and other info (SeqID, Reference and Marker). The Haps_only file will only contain the haplotype information (for the corresponding SeqID, Reference and Marker).
Process Codon Match Data
INPUTS:
- *All_NT_Aligned_YorN.csv: Use to determine how many NT changes from 3D7 reference (e.g., 1 change)
- *All_AA_Aligned_YorN.csv: Use to determine how many AA changes from 3D7 reference (e.g., 1 change)
OUTPUTS:
- *All_NT_Aligned_YorN_Count.csv: Collapsed into a column of the total number of NT differences
- *All_AA_Aligned_YorN_Count.csv: Collapsed into a column of the total number of AA differences
Process Codon Differences
INPUTS:
- *All_NT_Aligned_NTchange.csv: Use to determine the actual NT change from 3D7 reference (e.g., AAA76ACA)
- *All_AA_Aligned_AAchange.csv: Use to determine the actual AA change from 3D7 reference (e.g., K76T)
OUTPUTS:
- *All_NT_Aligned_NTchange_Final.csv: Contains a collapsed column with all NT changes
- *All_NT_Aligned_AAchange_Final.csv: Contains a collapsed column with all AA changes
Join Data
INPUTS:
All/temp/files Includes _Haps.csv, YorN_Counts.csv, change_Final.csv for both NT and AA (separated for each Fraction cut-off to analyse and marker)
OUTPUTS:
- Isolates_Final_Haps
- Controls_Final_Haps
- Repeats_Final_Haps
All input files are merged at similar columns “SeqID”, “Marker” and “Reference”
Step 03: Specific Changes: dhps-2 and mdr1-2
Both dhps-2 and mdr1-2 contain isolates that appear to have SAME genotype but are said to be in different clusters.
CHALLENGE: For dhps-2 there is a string of A’s that are variable. PROPOSED SOLUTION: Remove codons 520 and 520.1 (corresponding to nucleotides 50 to 75 in the sequence files). Also NOTE: MID 47 (TGTGAGTAGT, Roche) exists within the dhps-2 amplicon sequence. Our current SeekDeep parameters account for this by only searching for the MID in the first 8 nucleotides of the amplicon.
| Nucleotides 50 to 75 | n |
|---|---|
| TTAAAAAAAAAAAAAAACAAATTCT | 2 |
| TTAAAAAAAAAAAAAACAAATTCTA | 204 |
| TTAAAAAAAAAAAAACAAATTCTAT | 18 |
| TTAAAAAAAAAAAACAAATTCTATA | 1790 |
| TTAAAAAAAAAACAAATTCTATAGT | 964 |
| TTAAAAAAAAAAACAAATTCTATAG | 1 |
CHALLENGE: I found that this is because for mdr1-2, either the last one or last six nucleotide differ. PROPOSED SOLUTION: Remove codons 1077, 1078, 1079.
| Last 6 nucleotides | n |
|---|---|
| AAAATG | 1088 |
| AAAATT | 981 |
| TTTCTG | 550 |