FASTQ files generated from the Illumina sequencer are processed using the Tapestri DNA Pipeline. Adapter sequences are trimmed from the sequenced reads. Reads that are too short are discarded, and the Tapestri barcode structures are extracted from the reads.
The reads are then mapped to the genome using BWA-MEM. The extracted barcodes on the mapped reads are error corrected on a whitelist of known barcodes using a Hamming distance approach. Reads lacking an insert sequence between gene-specific primers or mapped to off-target loci are discarded. The barcodes are identified as cells using a three-step process.
The cells are genotyped with the Genome Analysis Toolkit using GATK best practices recommendations via a joint calling approach. Each cell is haplotyped at a reference confidence mode to enable per-bp confidence estimates for a site being strictly homozygous-reference. The per-bp resolution is maintained while merging the genomic-vcf (gvcfs) for all cells using GATKs combineGVCFs tool.
Finally, joint genotyping is performed for all cells using GATKs genotypeGVCFs tool. Loci found to be non-variant are maintained in the final output. Genotyping parameters were optimized for high sensitivity, maximum 2 alternate alleles were reported for each site, minimum base quality for variant calling was 10, and heterozygosity was 0.001.
Internal tandem duplications found in the FLT3 gene are identified using a custom genotyping method. Some panels, e.g., Acute Myeloid Leukemia and Myeloid, have 2 amplicons targeting exon 14 and/or exon 15 in the FLT3 gene. We look for insertions in the FLT3 amplicons and map to the locus. Read more details here.
Multiallelic variants are decomposed into biallelic variants and then normalized to ensure each VCF entry is left-aligned and parsimonious. Blacklisted loci are filtered out, and all loci more than the 1000 QUAL threshold are tagged for downstream processing. The positions that passed our filtering criteria are called variants. The genotypes and cell matrix are converted into an open-source .loom format, which allows efficient storage, data retrieval, and sharing of large omics datasets.