The Genome Integrity (GI) Pipeline requires the following input files to run:
- FASTQ files
-
Panel files
- *.gatk
- *.bed
- *.amplicons
- *.amplicon.info.csv
- Reference genome files
- Somatic variant whitelist/blacklist CSV file
- Spike-in Variants file
- Spike-in CNV file
- Demultiplexing Variants file
FASTQ files
Input FASTQ files are one or more pairs of forward and reverse FASTQ files (R1/R2). These files should be compressed (.gz). DNA FASTQs are always required to run this Pipeline. Protein FASTQs are only required for the GI DNA + Protein Pipeline.
Panel files
These 5 files need to be zipped together and uploaded to Tapestri Pipeline as a ‘MM DNA’ panel file type. For further information on how to upload these files please refer to Tapestri Pipeline GI User Guide. The first 2 files ( *.bed, and *.amplicons) are standard output from Tapestri Designer or the White Glove panel design pipeline, and can be used without modification. The *.amplicons.info.csv and the *.per-variant-background-error.csv files are unique to the Multiple Myeloma and Genome Integrity pipelines.
- *.bed
Is created as standard output for all files and is further described here.
- *.amplicons
Is created as standard output for all files and is further described here.
- *.amplicon.info.csv
This is a 3-column CSV file, with the columns: amplicon, purpose and gene. Each amplicon present in the DNA panel used, should be present as one row in the file. Each column is described below.
- amplicon: The name of the amplicon, every amplicon present in the DNA panel used should be listed in this column as its own row.
-
purpose: This defines the purpose of this particular amplicon. The purpose can be one of the following 3 categories: ‘mutation’, ‘gwCNV’, and ‘focalCNV’. If an amplicon has multiple purposes, the purposes can be delimited using a semicolon. For example: ‘mutation;focalCNV’. Each of the 3 purposes are described below.
- ‘mutation’: Used for finding somatic SNPs and INDELs
- ‘gwCNV’: Used for genome wide CNV calls
- ‘focalCNV’: Used for calling focal CNV events
- gene: This defines the gene name for the amplicon, when amplicons are used for focal CNV calls.
Example:
Reference genome files
We recommend that you use one of the Mission Bio-provided reference genomes (hg19, hg38, or mm10). These catalog reference genomes can be found pre-uploaded on Tapestri Designer and Tapestri Pipeline (Files → Other Files). If you want to use a custom reference genome, it may help to use one of these catalog reference genomes as the starting point. The reference genome used for the pipeline must match the reference genome the panel was created with. If a custom reference genome was used, please upload the .fa.zip file of the genome to your Tapestri Pipeline account, following the instructions provided here.
Somatic variant whitelist/blacklist file
This is a 7-column CSV file with the columns: chromosome, position, ref_allele, alt_allele, type, sample_id and genotype. This is an optional file that is used during the somatic clone detection step to include or exclude variants. The type column should specify if the variant is ‘whitelist’ or ‘blacklist’. Unlike the Spike-In variants file and the demultiplexing variants file, this file does not require genotype information.
Example:
Spike-In Variants file
This is a 7-column CSV file with the columns: chromosome, position, ref_allele, alt_allele, genotype, type, and sample_id. This file is used to demultiplex the sample from the spike-in cells. The type column will specify whether the variant is germline or somatic.
NOTE: only germline variants should be used for this file. This file format is similar to the demultiplexing variant file used for multiplexed runs.
Example:
Spike-In CNV file
This is a 4-columns CSV file with the columns: chromosome, start, end, and copy_number. It details the regions with CNVs for the spike-in cell line. For example, if the spike-in cells used are XY, then they are haploid for chrX and that needs to be detailed in this file as copy_number = 1. If the spike-in cell line is known to be fully diploid, this file is not needed.
Example:
Demultiplexing Variants file
This is a 7-column CSV file with the columns: chromosome, position, ref_allele, alt_allele, genotype, type, and sample_id. This file is used to demultiplex the samples. The type column will specify that the variant is germline. This file is only needed for runs that were multiplexed. This file can be user-created or the output of the Merge Bulk Runs pipeline. The file format is explained in depth here.
Example: