DNA Pipeline output files overview
Tapestri DNA outputs these files:
- allele.drop.out.report.txt
This text file lists the positions used to calculate the allele dropout (ADO) rate and the minimum requirements needed for the calculation. It displays the number of valid ADO alleles and the average percentage of cells with ADO. Information for each of the ADO alleles that were not used for the calculation is also listed along with the reason they were not included.
If the panel does not have ADO amplicons, then the ADO calculation is not done, and the file shows, “No file found for ADO calculation.” - barcode.cell.coverage.tsv
This file lists each amplicon and the number of mean reads per cell. For each amplicon, it also lists whether the amplicon passed the threshold. The threshold calculation is the mean of the total mean reads, e.g., the mean of column B * 0.2. If the mean reads for an amplicon is above this number, it passed the threshold and has a value of TRUE. If it is below this number, the value is FALSE.
- barcode.cell.distribution.merged.tsv
A file that reports the number of forward reads assigned to each amplicon for each cell found for all tubes in a run. For more information, see this file. - barcode.cell.distribution.tsv
A file that reports the number of forward reads assigned to each amplicon for each cell found. This file is index-specific, meaning for only one tube. For example, rg0.barcode.cell.coverage.tsv contains the information for tube 1 and rg1.barcode.cell.coverage.tsv for tube 2.
The rows list the barcodes and the columns the amplicons. The number of barcodes is the number of rows minus one, which accounts for the header row.
The barcode.cell.distribution.merged.tsv contains all the information from all tubes in a run. - barcode.cellfinder.distribution.tsv
Similar to the barcode.cell.distribution.tsv file. The barcodes (rows) are not restricted to barcodes associated to cells, but include all error-corrected barcodes that are identified by the cellfinder and listed in the cellfinder plot (e.g., all barcodes displayed on the x-axis in the cellfinder plot). - barcode.cumsum.txt
This text file lists the total cumulative reads for all barcodes to be used for creating the log-log plot in the CellFinder_report.pdf. - CellFinder_report.pdf
This file contains three plots, which visualize barcode distribution and include details on cell uniformity and amplicon distribution. It is index-specific, meaning for only one tube. For more information, see this file. - cells.loom
This file is an input file for Tapestri Insights. LOOM is an efficient file format for very large omics datasets, consisting of a main matrix, optional layers, and a variable number of row and column annotations. For more information, see this file. - cells.txt
Lists all the barcodes called as cells by cellfinder.
- cells.vcf.gz
This compressed annotated .vcf file conforms to the standard GATK format. It contains all the variants for all the barcodes called as cells. - DIR.info
A text file listing the parameters Tapestri DNA Pipeline used, including minimum values and quality thresholds, GATK version, and directory locations.
- DIR.yaml
A text file listing the parameters for the genome, the directory location, and dependencies for the Tapestri DNA Pipeline run. - flagstat.txt
This text file shows the mapping statistics from the reference genome alignment step, listing the number of QC-passed reads and QC-failed reads for items like read1, read2, properly paired, and mapped. This file is index-specific, meaning for only one tube. For more information on the flagstat format, see this description. - flt3.itd.report.txt
This report details the FLT3-ITD variants for each cell. The columns contain the following information, which follows the VCF specification:
- Cell barcode with the index – If all 8 tubes are pooled, then all the barcodes will have -1 appended. If there were multiple tubes, it will have a -1 or -2 appended to represent the appropriate tube.
- #CHROM – Chromosome number
- POS – Position in the genome
- ID – Unique identifier
- REF allele – If there is a • in this column, it means that there is an insertion.
- ALT allele – If the previous column has a •, then this is what is to be inserted. If the previous column was not a •, then this is the indel.
- QUAL – Phred-scaled quality score. For FLT3-ITD calls, these cells are set to a specific value instead of being calculated. Refer to this FAQ for more information.
- FILTER – Filter status
- INFO – Additional information
- The 10th column contains the header information for the next column. The acronyms follow the VCF protocol. GT:AD:DP:GQ:PL
- GT – Genotype: 0/0 the cell is homozygous, 0/1 the cell is heterozygous, 1/1 the cell is homozygous alternative
- AD – Unfiltered allele depth
- DP – Filtered depth
- GQ – Quality of the assigned genotype
- PL – Normalized Phred-scaled likelihood of possible genotypes
- The final column contains the values for the previous column, separated by colons
- gatk_progress.csv
This file shows the GATK progress throughout the run. It displays Done when complete.
- index.cells.bam
This file lists a read group (RG) tag for each read, which can be used to find the number of reads for a barcode. It is index-specific, meaning for only one tube.
A .bam file is a binary version of a sequence alignment map (SAM). For more information on these formats, refer to this PDF. To visualize these files, use tools like IGV. - index.mapped.bam
This file is generated after mapping to the reference genome and selecting high-quality primary alignments with a mapping quality score of > 30. It is generated after alignment but before barcode correction or cell calling. It maps all the barcodes and is index-specific, meaning for only one tube.
A .bam file is a binary version of a sequence alignment map (SAM). For more information on these formats, refer to this PDF. To visualize these files, use tools like IGV. - mapped.target.count.txt
This file lists the total reads mapped to each barcode.
- metadata.csv
This comma-separated file lists the following data for each tube in a run:- Tube prefix
- Total reads
- Reads trimmed
- Reads mapped to reference genome
- Reads mapped to insert
- Cells called
- Mapping error rate
- Reads mapped to cell
- Chemistry version
- Flagstat file name
The data for each tube is on a separate row.
- part2_progress.csv
This file shows the progress through Part2 of the run and lists all the steps that were successfully executed. - pipeline_input.json
A text file mapping the tubes and lanes. - pipeline_output.json
A text file that breaks down metrics by the tube and for the run.
For each tube, it displays:
- Total read pairs
- Percent of read pairs for valid cells
- Number of read pairs for valid cells
- Number of cells found
- Average reads per cell
- Panel uniformity
For the run, it displays:
- The information above consolidated for all tubes
- Percentage dropout rate
- Chemistry version
- run.options
A text file listing the parameters used by DNA Pipeline, such as fastq.gz file names; minimums and maximums for FLT3, ALT alleles, ADO, DP, GQ, AF, and LOD; whitelist; and adapters. - summary.txt
This file reports the sequencing metrics. For each tube of the run, it lists the following information:
- Total number of read pairs
- Percentage of reads trimmed
- Percentage of reads discarded
- Percentage of reads mapped to genome
- Percentage of reads mapped to insert
- Percentage of error rate
- Number of cells found
- Percentage of reads assigned to cells
- tapestri_log.txt
A detailed report on the individual steps performed by Tapestri DNA Pipeline. It contains the various information from summary.txt and flagstat.txt, including:
- Number of read pairs, read pairs used as BWA input, pairs discarded
- Reads mapped, % reads mapped, reads mapped to insert, % reads mapped to insert
- Number of cells found, number of reads assigned to cells, % of reads assigned to cells
- Error rate, as calculated by samtools, which is a ratio of mismatches and bases mapped (cigar)
- vcf_header.txt
This file contains all the barcodes that are in the .vcf file and in the same order. This is a single line but wraps into column format. The barcodes are for all tubes in a run. If each barcode has an appended -1, then the 8 tubes were pooled. If -1 and -2 are appended, then that represents separate tubes in the same run.