Pipeline v3 - Correlation UMAP
We use the read amplification profile to identify the 2 populations - singlets and non cell barcodes. Non cell barcodes have similar profiles and form dense clusters on the UMAP and correlation coverage plots. They have a higher correlation coefficient due to their bulk-like read profile as compared to singlets. These characteristics help differentiate invalid barcode clusters from singlet clusters.
The barcodes are classified using the following process:
- Identify candidate barcodes: any barcode with greater than 1.25 reads times the number of amplicons in the panel and greater than 1 read in 40% of the amplicons.
- Perform a UMAP on the normalized read counts matrix merged with the correlation coverage matrix of the candidate barcodes. Run flat-HDBSCAN clustering on it to generate a predefined number of clusters.
- Classify the clusters based on the correlation coverage values. The clusters are scored on how well they are separated on the correlation coverage plot, if they have enough cells assigned to it and the relative position of the non cell barcode and cell cluster on the correlation coverage plot.
- The clustering with the highest score is selected and the clusters are classified as “valid-cell” or “invalid-barcode” (non cell barcodes). The unassigned barcodes are recovered based on their proximity to the identified clusters.
- If only 1 cluster is identified and there are fewer than 5000 candidate barcodes, then all candidate barcodes are called cells. Otherwise, if the clusters cannot be classified , then the completeness method as seen in v2 pipeline is used.
Pipeline v2 - Completeness Algorithm
Reads lacking an insert sequence between gene-specific primers or mapped to off-target loci are discarded. We use amplicon read completeness in each barcode to call cells from barcodes and require barcodes to have at least 80 % data completeness for the working amplicons. This method has been effective at improving the data completeness in called cells and reducing the noise from bad barcodes.
The barcodes are identified as cells using a three-step process:
- We select the barcodes that pass a total reads cutoff, which is defined by the number of amplicons in the panel times 8 reads.
- We calculate the panel performance as 0.2 * the mean of all amplicon reads for all qualified barcodes.
We identify good-performing amplicons as those that pass the panel performance threshold.
- We identify the barcodes that have at least 80% of these good-performing amplicons. These barcodes are then called cells.
Cell Calling in the DNA + Protein Pipeline
The protein pipeline retains all barcodes with at least 1 valid protein read. After the completion of the DNA and Protein portions of the pipeline, only the barcodes that are common between both DNA and Protein assays are retained for the final merged DNA+Protein data.