What are the Sample Variants file format specifications and requirements?

  • Updated

Sample Variants CSV File for Genotype Demultiplexing

Sample variants file is an input file for the genotype based demultiplexing pipeline. It should be a comma separated file with one variant listed per line. The header of the file should contain the following columns:

Column name Description Example
chromosome The chromosome the variant is on. Must include the “chr” prefix. chr1
position The position of the variant in 1-based coordinates. 6123151
ref_allele The reference allele of the variant. Special notes:
  • Insertions: must be the base pair in the reference genome just before where the insertion starts.
  • Deletions: must start with the base pair in the reference allele just before where the deletion starts.
A
alt_allele The alternate allele of the variant. Special notes:
  • Insertions: must start with the reference allele.
  • Deletions: must be the base pair in the reference genome just before where the deletion starts.
T
genotype The expected genotype of the variant. Acceptable values:
  • “0” - wildtype
  • “1” - heterozygous
  • “2” - homozygous
  • “3” - missing
1
type The type of variant. Acceptable values (see Table 2):
  • germline
  • whitelist(scMRD only)
  • blacklist(scMRD only)
germline
sample_id The name of your sample.
  • At least 3 characters
  • Must Contains alphabets
  • May contain numbers or "-" or "_"
SampleA

* All columns are required.

A user can create this file from existing germline variant information or it can be generated using the Merge Bulk Runs pipeline. This pipeline merges the H5 files from the bulk NGS data and generates the germline_truth.csv file with a list of differentiating variants between the samples. 

Additional contingencies for the success of the Genotype-based demultiplexing:

  • Minimum 5 variants must overlap between variant file and the panel, as well as between samples in order for demultiplex to work.
  • Samples from close relatives can present a challenge. Thus, a minimum of 5 differentiating variants between the samples must be provided. Differentiating variants can be defined as variant that has different median genotype one sample from the other. For example, if a variant is HOM in one then it should be HET/WT in other OR if it is HET in one then it should be WT/HOM in other OR if it is WT in one then it should be HET/HOM in other. In general, customers should avoid multiplexing related samples together and identify samples from relatives (and their relationship) prior to data processing. 
  • Samples with donor background may also present a challenge. Customers should identify samples that  have received bone marrow transplant and the relationship of the donor prior to data processing and more time for data processing may be needed.  Ideally, germline mutation for donor, if available, should be submitted for analysis

Example:

chromosome position ref_allele alt_allele genotype type sample_id
chr1 115256669 G A 1 germline SampleA
chr1 115256669 G A 0 germline SampleB
chr1 115256669 G A 2 germline SampleC
chr4 55599436 T C 0 germline SampleA
chr4 55599436 T C 2 germline SampleB
chr4 55599436 T C 1 germline SampleC

To download an example sample variants file click here.

Sample Variants CSV File for MRD

The same file can also be used in the scMRD pipeline for demultiplexing and to define the whitelist and blacklist variants. More details on each variant type are given below:

Variant types

type Use Case Required or Optional What happens if it is missing?
germline For demultiplexing. Required (if multiplexed) Demux cannot be done.
germline Remove germline SNPs from somatic variant calls. Optional scMRD pipeline will still work, but in rare cases some germline variants may falsely appear as somatic variants.
whitelist Call “known” mutation, or previously detected mutations Optional scMRD pipeline will still work, but in rare cases some somatic variants may be filtered out based on their annotation.
blacklist Remove variants that are biologically uninteresting or are false positive variants. Optional

Report will contain irrelevant or artifact mutations

 

The scMRD pipeline v1.0.2 supports this CSV file format in addition to the VCF format.

Example file. The bottom two rows show an insertion and deletion, respectively.

chromosome position ref_allele alt_allele genotype type sample_id
chr1 115256669 G A 1 germline SampleA
chr2 25458546 C T 2 blacklist SampleA
chr1 115256669 G A 0 germline SampleB
chr2 25469913 C T 2 whitelist SampleB
chr1 115256669 G A 2 germline SampleC
chrX 6125123 A ATACT 1 whitelist SampleC
chr5 51241235 GTAT G 1 whitelist SampleC

 

Share this article:

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request