Genome Editing Pipeline Output Files Overview

  • Updated

The following output files are created based on the Genome Editing (GE) Pipeline - GE DNA-only and GE DNA+Protein, knock-out (KO) and base editing (BE) reports.

 

Pipeline Files

  • SAMPLE_NAME_raw_barcodes.txt (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) This file is generated after barcode correction but before alignment. It contains the total reads pre-alignment for each barcode identified during barcode correction. It is a two column file (separated by tabs) with the first column being the corrected barcode the second column being the reads for that barcode.

  • SAMPLE_NAME_cells.txt (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) This file is generated after cell calling. It contains a list of all barcodes which were identified as cells. It is a single column file and contains one barcode per line.
  • SAMPLE_NAME.ge.h5 (GE-KO, GE-BE) / SAMPLE_NAME.ge_protein.h5 (GE+Protein-KO, GE+Protein-BE) This is a multi-omics file format that contains the data for all the analytes in a single file. For more details, read this article.
  • SAMPLE_NAME_cell_barcode_distribution.tsv (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A file that reports the number of forward reads assigned to each amplicon for each cell found in a run. For more information, see this file.
  • SAMPLE_NAME_all_barcode_distribution.tsv (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) This file reports the number of forward reads assigned to each amplicon for each barcode found in the run. 
  • SAMPLE_NAME.primers_window.tab (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) This file contains amplicon coordinates (from the .designSummary.tab file) which overlap with target coordinates (from .submitted file). These are the coordinates which are used for Haplotype construction and other downstream steps such as report generation. 
  • SAMPLE_NAME.primers.tab (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) This file contains amplicon coordinates (from the .designSummary.tab file).
  • SAMPLE_NAME.amplicon_groups.tab (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) This file contains information for which amplicons are grouped together prior to adding read tags to SAMPLE_NAME.primers.bam file. Consecutive amplicons which are within 10kb of each other are grouped together into a singular amplicon group.
  • SAMPLE_NAME.report.html (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) This .html report is generated for each run. It summarizes the run details in the form of various metrics and plots that can be used to understand the performance. For more information about the report, please refer to the GE User Guide.
  • SAMPLE_NAME_primers.bam (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) This file lists a read group (RG) tag for each read, which can be used to find the number of reads for the barcode. Additionally it contains the following custom tags for each read:

 

Read tag Description Example
rp The SAM tag to use for the current record's assigned primer genomic coordinates chr1:26644621-26644641
mp The SAM tag to use for the current record's mate's assigned primer genomic coordinates chr1:26644531-26644551
ra The SAM tag to use for the current record's assigned primer identifier CO123_AMP1
ma The SAM tag to use for the current record's mate's assigned primer identifier CO123_AMP2

A .bam file is a binary version of a sequence alignment map (SAM). For more information on these formats, refer to this PDF. To visualize these files, use tools like IGV

  • SAMPLE_NAME_primers.bai (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) This is the .bam index file used by the IGV tool to view the alignments.
  • SAMPLE_NAME.aligned.bam (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) This file is generated after mapping to the reference genome. It is generated after barcode correction and alignment but before cell calling. It maps all the barcodes in the run. A .bam file is a binary version of a sequence alignment map (SAM). For more information on these formats, refer to this PDF. To visualize these files, use tools like IGV.
  • SAMPLE_NAME.pipeline_metrics.json (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) This file contains the several basic run metrics in the .json format. It is a machine-readable format and provides an easy way to do additional analysis on the metrics.

Report Files

CSV

  • top_on_target_alleles.csv (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A CSV file containing the data used to create the “Top on-target variants” table from the “Advanced” tab of the report.
  • top_off_target_alleles.csv (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A CSV file containing the data used to create the “Top off-target variants” table from the “Advanced” tab of the report.
  • top_edit_sites_per_guide.csv (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A CSV file containing the data used to create the “Top edit sites per group” plot from the “Advanced” tab of the report.
  • top_edit_combinations_per_guide.csv (GE-KO, GE+Protein-KO) A CSV file containing the data used to create the “Top edit combinations per group” table from the “Advanced” tab of the report.
  • top10_ontarget_variant_zygosity.csv (GE-KO, GE+Protein-KO) A CSV file containing the data used to create the “Zygosity of top 10 on-target variants” plot from the “Advanced” tab of the report.
  • summary_of_editing.csv (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A CSV file containing the data used to create the “Summary of editing” table from the “Summary” tab of the report.
  • panel_uniformity.csv (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A CSV file containing the data used to create the “Panel uniformity summary” table from the “Summary” tab of the report.
  • ontarget_edit_cooccurrence.csv (GE-KO, GE+Protein-KO) A CSV file containing the data used to create the “On-target editing co-occurrence” plot/table from the “Advanced” tab of the report. It has 1 column per on-target with the column name being the on-target name (from column#4 of the .submitted panel file) and an additional column called “% cells”. Each row represents a unique combination of editing outcomes across cells which have genotyping information for all on-targets and the value in “% cells” columns is the percentage of cells for that combination. The value for each on-target is the numerical genotype where 0 denotes wild-type, 1 denotes a mono-allelic edit and 2 denotes a bi-allelic edit.
  • ontarget_editing_zygosity.csv (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A CSV file containing the data used to create the “Zygosity of on-target edits” plot from the “Advanced” tab of the report.
  • ontarget_allele_status.csv (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A CSV file containing the data used to create the “Distribution of on-target alleles” plot from the “Summary” tab of the report.
  • indel_start_locations.csv (GE-KO, GE+Protein-KO) A CSV file containing the data used to create the “On-target INDEL start position” plot from the “Advanced” tab of the report.
  • indel_length_distribution.csv (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A CSV file containing the data used to create the “On-target INDEL lengths” plot from the “Advanced” tab of the report.
  • ontarget_mutation_distribution.csv (GE-BE, GE+Protein-BE) A CSV file containing the data used to create the “On-target variant distribution” plot from the “Advanced” tab of the report.
  • ontarget_edit_cooccurrence_with_protein.csv (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A CSV file containing the data used to create the “On-target editing status v/s Protein Expression” plot from the “Advanced” tab of the report.

HTML

  • top_on_target_alleles.html (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A HTML file containing a standalone version of the “Top on-target variants” table from the “Advanced” tab of the report.
  • top_off_target_alleles.html (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A HTML file containing a standalone version of the “Top off-target variants” table from the “Advanced” tab of the report.
  • top_edit_sites_per_guide.html (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A HTML file containing a standalone version of the “Top edit sites per group” plot from the “Advanced” tab of the report.
  • top_edit_combinations_per_guide.html (GE-KO, GE+Protein-KO) A HTML file containing a standalone version of the “Top edit combinations per group” table from the “Advanced” tab of the report.
  • top10_ontarget_variant_zygosity.html (GE-KO, GE+Protein-KO) A HTML file containing a standalone version of the “Zygosity of top 10 on-target variants” plot from the “Advanced” tab of the report.
  • summary_of_editing.html (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A HTML file containing a standalone version of the “Summary of editing” table from the “Summary” tab of the report.
  • panel_uniformity.html (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A HTML file containing a standalone version of the “Panel uniformity summary” table from the “Summary” tab of the report.
  • ontarget_edit_cooccurrence.html (GE-KO, GE+Protein-KO) A HTML file containing a standalone version of the “On-target editing co-occurrence” plot/table from the “Advanced” tab of the report.
  • ontarget_editing_zygosity.html (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A HTML file containing a standalone version of the “Zygosity of on-target edits” plot from the “Advanced” tab of the report.
  • ontarget_allele_status.html (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A HTML file containing a standalone version of the “Distribution of on-target alleles” plot from the “Summary” tab of the report.
  • indel_start_locations.html (GE-KO, GE+Protein-KO) A HTML file containing a standalone version of the “On-target INDEL start position” plot from the “Advanced” tab of the report.
  • indel_length_distribution.html (GE-KO, GE-BE, GE+Protein-KO, GE+Protein-BE) A HTML file containing a standalone version of the “On-target INDEL lengths” plot from the “Advanced” tab of the report.
  • ontarget_mutation_distribution.html (GE-BE, GE+Protein-BE) A HTML file containing a standalone version of the “On-target variant distribution” plot from the “Advanced” tab of the report.

GE Multi-sample KO and BE Files

  • multi_sample.report.html  (KO, BE) This new .html report is generated for each run. It summarizes the run details in the form of various tables and plots that can be used to understand the performance of all samples run through the multi-sample pipeline.
  • top_edit_sites.csv  (KO, BE) A CSV file containing the data used to create the “Top 5 edit sites per group” plot.
  • top_edit_combinations.csv  (KO) A CSV file containing the data used to create the “Top 5 edit combinations per group” table.
  • ontarget_edit_zygosity.csv  (KO, BE) A CSV file containing the data used to create the “On-target editing status” plot.
  • ontarget_edit_cooccurrence.csv  (KO) A CSV file containing the data used to create the “On-target editing co-occurrence” plot.
  • ontarget_mutation_distribution.csv (BE) A CSV file containing the data used to create the “On-target variant distribution” plot.

Demultiplexed GE DNA Files

A demultiplexed run will contains run level and sample level files. The following files are generated for each:

Run Level Files

All the files defined above in Pipeline Files are generated except the report file which should not be multisample in nature.

Sample Level Files

All the reporting files are created for each sample in addition to a h5 and primer.bam. The list of sample level files is given below:

  • KO or BE report files per sample - Multiple report files are seen as one set is generated per sample.
  • <run-prefix>.<sample_ID>.dmx.h5 - H5 file for the demultiplexed samples where assays are filtered to retain only the cells from the sample.
  • <run-prefix>.<sample_ID>.primers.bam - BAM file filtered on cells for the sample.
  • <run-prefix>.<sample_ID>.primers.bam.csi - Index files for the sample primer.bam file.
Share this article:

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request