Genome Editing Input Files

  • Updated

Genome Editing Input Files

 

The Genome Editing Pipeline requires 3 different types of input files to run:

  1. FASTQ files
  2. Panel files
    1. *.amplicons
    2. *.bed
    3. *.designSummary.tab
    4. *.submitted
    5. *.target_groups.csv
  3. Reference genome files

 

FASTQ files

Input FASTQ files are one or more pairs of forward and reverse FASTQ files (R1/R2). These files should be compressed (.gz). DNA FASTQs are always required to run this Pipeline. Protein FASTQs are only required for the GE DNA + Protein Pipeline. 

 

Panel files

To provide panel and target information, the following files are needed to run the pipeline. These 5 files need to be zipped together and uploaded to Tapestri Pipeline as a ‘Genome Editing Panel’ file type. For further details on how to upload these files please refer to Tapestri Pipeline GE User Guide. The first 2 files (*.amplicons, *.bed,) are standard output from Tapestri Designer or the White Glove panel design process, and can be used without modification. The *designSummary.tab and *.submitted files are also standard output files, but require modification, see below. The *.target_groups.csv is a new file type. 

  1. *.amplicons

This is a file with four columns – chromosome, amplicon start position, amplicon end position, and amplicon ID. The amplicon coordinates include the forward primer, insert region, and reverse primer. It shows one-based, fully closed coordinates for the amplicons. View more details about the coordinate system here. This file should be included in the package generated by the Tapestri Designer. Note: remove the header and save.

Example:

Note that in the examples below, new line characters (\n) are not shown.  These are necessary if these files are created programmatically (e.g., via Python) rather than using a text editor. All target names used below are arbitrary examples only and are not intended to suggest a specific naming convention. 

A four-column, tab-separated file,  which specifies the amplicon locations (chromosome, start coordinate, stop coordinate) and the amplicon name. This file is typically generated by Tapestri Designer or our white Glove panel design process as standard output. This file should not contain a header.

  1. *.bed

This is a four-column .bed file with chromosome, insert start position, insert end position, and amplicon ID. It shows zero-based, half-open coordinates for the insert. View more details about the coordinate system here. This file should be included in the package generated by the Tapestri Designer. Note: remove the header and save.

Example:

A four-column, tab-separated file,  which specifies the insert locations (chromosome, start coordinate, stop coordinate) and the amplicon name. This file is typically generated by Tapestri Designer or our white Glove panel design process as standard output. This file should not contain a header.

  1. *.designSummary.tab

This file has the summary of amplicon information indicating amplicon position, insert position, primer sequence, and amplicon sequence. This tab-delimited file contains a header row and 10 columns, in the order: ampliconID, chromosome, amplicon_start_position,  insert_start_position, insert_end_position, amplicon_end_position,  Ref_Genome,  fwd_seq, rev_seq, amplicon_seq. All coordinates are 1-based. This file should be included in the package generated by the Tapestri Designer. In some cases, the output for the amplicon_seq column may be empty or ‘NA’, this is compatible with the Gene Editing pipelines. Note: this file is produced as a .csv from Tapestri Designer, and will have to be converted to a .tab file. This can be done by opening the .csv file, clicking ‘Save as’ and saving the file as a TAB or tab-delimited file. Ensure that the new file now has a .tab file extension.

Example:

A ten-column, tab-separated file, which is typically generated by Tapestri Designer or our White Glove panel design process as standard output. There is a header (column names) for this file.

 

  1. *.submitted

File format

This file is usually provided by the user when designing the panel and is included in the files package when panel design is completed. This file uses standard BED format coordinates for all targets (editing sites). This input file needs to contain a 4th column, specifying the target’s name. These names must match the names in the first column of the *target_groups.csv file. Note: remove the header from this file.

Target coordinates

The target coordinates (columns 1-3) in this file are used by the pipeline to call cells as edited/unedited. The predicted cut sites of each target should be within these regions, thus any variants (INDELs and/or SNVs) overlapping with these coordinates will be classified as an edit event. 

A user can modify the coordinates of any target before running the gene editing pipeline. Modifications can be made by using a text editor. It is expected that each target region will at least span the predicted cut site. If a single precise cut site is not predicted, the user may expand the target region on both/either end at the risk of incurring higher false positive edit calls.

Target list

If the panel covers genomic regions not used for gene editing detection, or regions not covered by any amplicon, those target regions should not be listed in this file or the .target_groups.csv file below. 

Example:

A four-column, tab-separated file, which specifies the target locations (chromosome, start coordinate, stop coordinate) and the target names. There is no header for this file. 

 

  1. *.target_groups.csv

A target group file is a comma-delimited file (.csv) file used to provide the grouping information for targets. This information will be used for selective visualization in the final analysis report. The file name must end with .target_groups.csv

 

The target group file contains three columns: Target, Group, and Category, as described in the following:

  1. Target: the exact list of target names from the 4th column of the .submitted BED file. All targets from the *.submitted must be included.
  2. Group: the on-target edit and any associated predicted off-target edits (e.g., for CRISPR, a group may include both on- and off-targets edited by the same guide RNA).
    1. The Group names are designated in the “Group” column in the target group file (target_groups.csv).
    2.  It is recommended to keep the group names as short as possible and avoid using space or other special characters except dash “-” and underscore “_”.
  3. Category: should be either On_target or Off_target for each target to specify the intended and unintended editing targets. The category strings need to be an exact match to either “On_target” or “Off_target”, differences in the capitalization or underscore will not work with the GE pipelines. 

Example:

A three-column .csv formatted file which specifies the target grouping and category information to be considered during results reporting and visualizations. There is a header (column names) for this file:

 

Reference genome files

We recommend that you use one of the Mission Bio-provided reference genomes (hg19, hg38, or mm10). These catalog reference genomes can be found pre-uploaded on Tapestri Designer and Tapestri Pipeline (Files → Other Files). If you want to use a custom reference genome, it may help to use one of these catalog reference genomes as the starting point. The reference genome used for the pipeline must match the reference genome the panel was created with. If a custom reference genome was used, please upload the .fa.zip file of the genome to your Tapestri Pipeline account, following the instructions provided here

 

Note: this .zip cannot be created using the Finder function on Mac. This function adds an additional folder to the zip that will cause the panel file to fail uploading. 

 

Attached is an example Genome Editing Panel .zip, including examples of all 5 file types.



Share this article:

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request