Tapestri DNA Pipeline calls FLT3-ITDs in two ways:
- GATK: Uses GATK’s standard genotype calling algorithm. Detected ITDs are represented in the standard variant nomenclature in Tapestri Insights, e.g., FLT3:chr13:28608314:T/TCCCCGAT.
- Custom algorithm: Uses a second calling pipeline specifically targeting ITD variants inside the FLT3 locus. Detected ITDs are represented without the gene name to clearly distinguish them from GATK-called variants, e.g., chr13:28608314:T/TCCCCGAT.
Both algorithms run in parallel during data processing.
This describes how Mission Bio’s (MB) script works when looking for FLT3-ITD in .bam files.
The script does not perform or work on an mpileup, so it is not a genotype caller. The basis for the script is the information on misalignments after BWA-MEM.
In the presence of an insertion, the caller will try to map, as best as it can, to the locus. If the read goes through the insertion, it is likely that the genotype caller will find it and report it correctly. If the read doesn’t go through the insertion, the mapper will soft clip the read at the end. Genotype callers avoid soft-clipped regions at the end of reads because that might reflect the presence of adapters. The MB script takes advantage of soft-clipped regions and reports them as potential insertions.
How it works
The script is called specifically on the two amplicons that are present in the panel, targeting both exon 14 and exon 15. For each region, each .bam file corresponding to a cell is scanned for insertions and soft clipping. The total number of reads in the .bam file is based on the unique read names, which avoids the problem of having unpaired forward and reverse reads. If there are reads that have insertions/clipping, the script considers the starting position of the majority of the reads with the insertion as the possible insertion. It is possible one read out of many disagrees in the starting position because of an error.
If the total number of reads is greater than a cutoff (default: 10) and the number and ratio of non-REF reads are greater than a cutoff (default: 4 and 0.1), then the cell is considered as having a non-REF allele.
If the ratio of non-REF reads is greater than a cutoff (default: 0.9), the cell is HOM ALT, otherwise it is HET. If the cell has enough total reads but not enough non-REF reads, it is considered a HOM REF. Otherwise, it is reported as “no call.”
For each amplicon, a file is generated where the information of every cell is reported in a pseudo-vcf format for ease of use in the later stages of the pipeline.
The Filter field of cells that have possible insertions is set to PASS, the QUAL field of the same cells is set to 10000, and the genotype quality field to 100. If the Filter field does not have possible insertions, then the genotype quality field is set to ‘.’. The PL is set to 0 for the assumed genotype and 1 otherwise. These values are not probabilities, unlike the values reported by genotype callers but are there to combine the information into the .vcf file in subsequent steps.
The REF allele for insertions is set to ‘.’. This makes it easy to distinguish the output of the MB script from the output of genotype callers. Also, for insertions where the read doesn’t go through the insert, it might not be possible to know the real point of insertion, and therefore, the REF allele would be unknown.
For reporting purposes, each of the two files looking at possible insertions in every cell for each amplicon is reported to the customer. For visualization purposes, the original .vcf file is merged with the output from the MB script before producing the .loom file.
To accomplish this, the files produced in the previous step are turned into .vcf format. Every position and insertion is reported on a separate line. If a cell does not have the insertion reported, the number of REF reads for that cell is used to report it as HOM REF (if greater than the total minimum number of reads) or as no call. Also, if the cell has an insertion in the same position but the insertion is not the same as the one being reported, the same logic applies.
Only insertions with an allele count (AC) greater than 0 and allele frequency (AF) greater than 1/2000 are reported. This avoids reporting positions where errors are generating a false insertion.
The two .vcf-like files from this step (one for exon 14 and another for exon 15) are then merged. If the genotype caller and the MB script call an insertion in the same position, the information from the genotype caller is retained, and the information from the MB script is ignored. To do this, the .vcf file from the genotype caller needs to have one variant per line, reduced to its allelic primitive.