The scMRD pipeline leverages a combination of algorithms optimized for detecting rare variants, ad hoc filters, and germline variant filtering to identify somatic variants. The first step applies simple ad hoc filters, such as requiring a variant allele frequency be ≥ 35%, depth of coverage ≥ 10, genotype quality ≥ 30, and be mutated in ≥ 3 cells. The next step applies two algorithms optimized for detecting rare variants. The single variant detection method utilizes a statistical model and a reference set of control samples to estimate the per-variant error distribution and a p-value. The co-occurring variant detection method looks for variants that co-occur together in a statistically significant way, which leverages the single-cell nature of the data to reduce false positive rates.
After the single and co-occurring variants are identified, the results are combined and further filtered with ad-hoc filters:
- Remove if near a homopolymer.
- Remove if synonymous.
- Remove if variants are within 300 base pairs of each other.
- For single-variant filtering, remove if mutated in < 6 cells.
- For heterozygous variants, remove if there are no cells observed that are homozygous. Due to allele drop out (ADO), heterozygous variants are expected to yield a small percentage of cells that are homozygous for the variant.
The final step removes variants that are likely germline, which is an important step when calling somatic variants since they may falsely indicate the presence of cancer. The algorithm labels variants as germline if either of these conditions are met:
- The variant is present in the gnomAD database with an allele population frequency ≥ 0.1%.
- The variant was labeled as germline in the user-provided VCF file.