MRD VCF File Format
VCF, or Variant Call Format, is a standardized text file format used for representing SNP, indel, and structural variation calls. It is an optional file and the MRD runs can be processed without it. It is used by the MRD pipeline to demultiplex samples, define known somatic variants or exclude variants which are being incorrectly called (false positives) by the pipeline. The user-submitted VCF file should conform to VCF v4.2 standard with a header section followed by the variant information per sample. The VCF file is used in the following scenarios:
- Multiplexed samples: the VCF file can be used to define germline genotype information for demultiplexing of samples containing cells from multiple individuals. At least 15 germline variants are required to reliably differentiate the cells. Demultiplexing of samples is not possible without the germline genotype information. These must be annotated with “.” in the INFO column.
- Samples with known variants: the VCF file can be used to define a list of known somatic variants for the sample which should be included in the analysis. These must be annotated with “SOMATIC” in the INFO column.
- Blacklisted variants: the VCF file can be used to define known false-positive variants that should be removed from the analysis. These must be annotated with “BLACKLIST” in the INFO column.
Multiplexed Sample VCF File
The scMRD AML multi-omics kit allows you to multiplex up to three samples from unrelated individuals. When multiplexed Tapestri scMRD libraries are sequenced, the resulting FASTQ files (containing data from all multiplexed samples) are processed by the DNA+Protein pipeline to generate a single h5 file. The scMRD pipeline demultiplexes these samples using per-sample germline genotype information prior to running rare variant detection per sample.
For demultiplexing, the VCF file must contain germline variant information from all of the multiplexed patient samples. I.e., if three samples are multiplexed, the VCF file must contain germline SNVs for all three (it is not possible to provide variants for two, and then deduce the identity of the third patient). The VCF should contain information for at least 15 SNVs. The file should contain a header defining annotations contained in the FORMAT and INFO columns (see example below). For each sample, one column is added to the VCF file to indicate the genotype of that sample for all variants in the file, following the notation of 0/0 (wildtype), 0/1 or 1/0 (heterozygous) and 1/1 (homozygous). To specify both germline and somatic or blacklisted variants, include all three types in a single vcf file, using the INFO field to indicate the variant type.
While the pipeline has basic VCF format validation checks, entering incorrect information for variant locations and alleles can result in failures. Ensure that all information is correct and that coordinates are provided for hg19 only.
Non-multiplexed sample VCF file
In case of non-multiplexed samples, a VCF file is not mandatory, but can still be used to define the somatic and blacklisted variants for the sample (see above).