Genome Integrity Input Files

Updated November 27, 2024 17:46

The Genome Integrity (GI) Pipeline requires the following input files to run:

FASTQ files
Panel files
1. *.gatk
2. *.bed
3. *.amplicons
4. *.amplicon.info.csv
Reference genome files
Somatic variant whitelist/blacklist CSV file
Spike-in Variants file
Spike-in CNV file
Demultiplexing Variants file

FASTQ files

Input FASTQ files are one or more pairs of forward and reverse FASTQ files (R1/R2). These files should be compressed (.gz). DNA FASTQs are always required to run this Pipeline. Protein FASTQs are only required for the GI DNA + Protein Pipeline.

Panel files

These 5 files need to be zipped together and uploaded to Tapestri Pipeline as a ‘MM DNA’ panel file type. For further information on how to upload these files please refer to Tapestri Pipeline GI User Guide. The first 2 files ( *.bed, and *.amplicons) are standard output from Tapestri Designer or the White Glove panel design pipeline, and can be used without modification. The *.amplicons.info.csv and the *.per-variant-background-error.csv files are unique to the Multiple Myeloma and Genome Integrity pipelines.

*.bed

Is created as standard output for all files and is further described here.

*.amplicons

Is created as standard output for all files and is further described here.

*.amplicon.info.csv

This is a 3-column CSV file, with the columns: amplicon, purpose and gene. Each amplicon present in the DNA panel used, should be present as one row in the file. Each column is described below.

amplicon: The name of the amplicon, every amplicon present in the DNA panel used should be listed in this column as its own row.
purpose: This defines the purpose of this particular amplicon. The purpose can be one of the following 3 categories: ‘mutation’, ‘gwCNV’, and ‘focalCNV’. If an amplicon has multiple purposes, the purposes can be delimited using a semicolon. For example: ‘mutation;focalCNV’. Each of the 3 purposes are described below.
1. ‘mutation’: Used for finding somatic SNPs and INDELs
2. ‘gwCNV’: Used for genome wide CNV calls
3. ‘focalCNV’: Used for calling focal CNV events
gene: This defines the gene name for the amplicon, when amplicons are used for focal CNV calls.

Example:

Reference genome files

We recommend that you use one of the Mission Bio-provided reference genomes (hg19, hg38, or mm10). These catalog reference genomes can be found pre-uploaded on Tapestri Designer and Tapestri Pipeline (Files → Other Files). If you want to use a custom reference genome, it may help to use one of these catalog reference genomes as the starting point. The reference genome used for the pipeline must match the reference genome the panel was created with. If a custom reference genome was used, please upload the .fa.zip file of the genome to your Tapestri Pipeline account, following the instructions provided here.

Somatic variant whitelist/blacklist file

This is a 7-column CSV file with the columns: chromosome, position, ref_allele, alt_allele, type, sample_id and genotype. This is an optional file that is used during the somatic clone detection step to include or exclude variants. The type column should specify if the variant is ‘whitelist’ or ‘blacklist’. Unlike the Spike-In variants file and the demultiplexing variants file, this file does not require genotype information.

Example:

Spike-In Variants file

This is a 7-column CSV file with the columns: chromosome, position, ref_allele, alt_allele, genotype, type, and sample_id. This file is used to demultiplex the sample from the spike-in cells. The type column will specify whether the variant is germline or somatic.

NOTE: only germline variants should be used for this file. This file format is similar to the demultiplexing variant file used for multiplexed runs.

Example:

Spike-In CNV file

This is a 4-columns CSV file with the columns: chromosome, start, end, and copy_number. It details the regions with CNVs for the spike-in cell line. For example, if the spike-in cells used are XY, then they are haploid for chrX and that needs to be detailed in this file as copy_number = 1. If the spike-in cell line is known to be fully diploid, this file is not needed.

Example:

Demultiplexing Variants file

This is a 7-column CSV file with the columns: chromosome, position, ref_allele, alt_allele, genotype, type, and sample_id. This file is used to demultiplex the samples. The type column will specify that the variant is germline. This file is only needed for runs that were multiplexed. This file can be user-created or the output of the Merge Bulk Runs pipeline. The file format is explained in depth here.

Example:

Share this article:

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

FASTQ files

Panel files

Reference genome files

Somatic variant whitelist/blacklist file

Spike-In Variants file

Spike-In CNV file

Demultiplexing Variants file

Was this article helpful?

Related articles