The following steps are taken to represent bi-allelic variants in our data structures.
After genotype calling by GATK, check if there are any locations with 2 alternate alleles called across all cells.
# Example where C1-C4 represent different cells; C1: T/T, C2: T/C, C3: C/C, C4: T/A
# REF ALT C1 C2 C3 C4 # T C,A 0/0 0/1 1/1 0/2
If a variant has 2 alternate alleles, a new line will be created for each alternate.
# REF ALT C1 C2 C3 C4 # T C 0/0 0/1 1/1 ./0 # T A 0/0 ./0 ./. 0/1
The INFO for the new genotype won't be split. The DP will be the same as the original value for both created variants.
The following steps are taken to represent multi-allelic variants in our data structures.
After genotype calling by GATK, check if any cell has a 1/2 genotype, see Cell 5 in example below.
# Example where C1-C6 represent different cells; C1: T/T, C2: T/C, C3: C/C, C4: T/A, C5: C/A, C6: A/A
# REF ALT C1 C2 C3 C4 C5 C6 # T C,A 0/0 0/1 1/1 0/2 1/2 2/2
If a cell has a 1/2 genotype, a new line will be created with a new genotype.
# REF ALT C1 C2 C3 C4 C5 C6 # T C 0/0 0/1 1/1 ./0 ./1 ./. # T A 0/0 ./0 ./. 0/1 ./1 1/1 # * C+A ./. ./. ./. ./. 0/1 ./.
The INFO for the new genotype won't be split. The DP will be the same as the original value for all created variants. For multi-allelic variants the reference is listed as '*' as it is not present in these cells.