Clonal Insights Software Algorithm Overview

  • Updated

The single-cell Clonal Insights Software consists of two steps: (1) the standard DNA+Protein pipeline, followed by (2) the tertiary pipeline for detection of rare somatic variants and clones. The processing steps for the DNA+Protein pipeline can be found here. The H5 file created by the DNA+Protein pipeline is used by the CIS to predict and report the somatic mutations, clonal architecture, and protein expression profile of a given sample.

Step 1: Demultiplex Samples

The CIS product and pipeline allows multiplexing of up to three patient samples in the same Tapestri run. The demultiplexing algorithm leverages germline SNP genotypes to label each cell according to which patient it originated from. The customer must provide known germline SNP genotypes for each patient via a multi-sample CSV file, which can be acquired via bulk sequencing or using the Tapestri’s Bulk NGS ancillary kit. The pipeline will fail if there are < 5 germline SNPs provided, and gives a warning if there are < 10.  The algorithm constructs a database of the germline SNPs, including the expected genotype profile for cell doublets. Each cell is then assigned to the best-matching patient or cell doublet using a likelihood-based method. The final two steps in the pipeline remove poor quality cells to reduce misassignments: (i) filter out cells where < 30% of the germline SNPs are genotyped, and (ii) filter out cells if the number of genotype discrepancies between the best matched patient is ≥ 10 or if the percentage is ≥ 30%.

Step 2: Find Somatic Variants

The software leverages a combination of optimized algorithms, ad hoc filters, and variant annotations to identify somatic variants.

Step 1 : Single and Co-occurring Variant Detection

The first step applies simple ad hoc filters, such as requiring depth of coverage ≥ 10 reads, genotype quality ≥ 30, and being mutated in ≥ 3 cells. The next step applies two algorithms optimized for detecting rare variants. The single variant detection method utilizes a statistical model and a reference set of control samples to estimate the per-variant error distribution and a p-value. The co-occurring variant detection method looks for variants that co-occur in cells in a statistically significant way, which leverages the single-cell nature of the data to reduce false positive rates.

Note: If the background error rate file is not provided then the threshold for the percentage of cells with a mutation for the variant to be considered as somatic is set to 1% for all variants.

Step 2: Similar Clonality Filter (only relevant for multiplexed samples)

If samples were multiplexed in a Tapestri run, the % of cells carrying each mutation in each sample will be calculated for each variant. The variants which passed all Step 1 filters will be processed using the following new filter:

  1. Variants found in all samples with mutation rate (clonality) within 10% of each other (or 5 times the size of the smallest clone, whichever is smaller) will be filtered out for all samples.
  2. For any given sample, the baseline error rate will be assumed to be min(10, 1% * largest clonality + lowest clonality). If that sample has a clonality smaller than this baseline error, then the variant will be filtered out in that sample. The 1% * largest clonality term is used to account for incorrect demultiplexing. 

Step 3: Adhoc Filters

After the variants are identified in Step 2, the variants are filtered using the following ad-hoc filters:

  • Remove if synonymous.
  • For co-occurring variants, remove if variants are within 300 base pairs of each other.

Step 4: Germline Variant Filters

The final step removes variants that are likely germline. The algorithm labels variants as germline if either of these conditions are met:

  • The variant is present in the gnomAD database with an allele population frequency ≥ 0.01.
  • The variant was labeled as germline in the user-provided CSV file.

Step 3: Phylogeny Construction

Phylogeny construction uses COMPASS, an open-source third-party tool optimized using clinical AML samples from Tapestri. This algorithm uses DNA information, such as SNVs, to determine which cells are more similar to each other, while accounting for errors such as ADO, and variability from sequencing coverage across the DNA panel. The algorithm further uses scoring algorithms that maximize biological plausibility of the resulting phylogenetic tree and clonal architecture. The end result is the clustering together of cells into clones that share the same zygosity for the variants.

Step 4: Reporting 

The final step in the scMRD pipeline constructs per-patient HTML reports that visualize the somatic alterations, clonal architecture, protein signatures, and quality control (QC) plots for the sample. The summary page visualizes the clonal architecture, somatic mutation zygosity per clone, protein expression per clone, and two tables providing detailed information and annotation for each clone and mutation. Additional advanced data visualizations are included, such as a phylogenetic tree, protein UMAP, protein co-expression plot, and QC plots for demultiplexing. 

Time course analysis

The h5 files from the single sample analyses are used for time course analysis. As the h5 already contains the somatic variants, step 2 is skipped. The samples are merged together and then Step 3 and Step 4 are rerun to create the reports. For time course analysis, additional visualizations are included, such as a fishplot and violin plot to show changes in clone size and protein expression over time, respectively.

Share this article:

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request