Genome Editing (GE) h5 File

  • Updated

h5 file basics

The h5 file generated by the Genome Editing (GE) pipeline are structured into assays, which are HDF5 groups. Each assay contains metadata, column attributes, row attributes and layers which are all also groups under the parent group i.e. the assay. Assuming that any layer is a matrix of the shape (n_rows x n_cols), these groups have the following rules:

  • All column attributes are HDF5 datasets which contain 1-d array of values of the shape n_cols
  • All row attributes are HDF5 datasets which contain 1-d array of values of the shape n_rows
  • All layers are HDF5 datasets which contain a 2-d array of values of the shape n_rows x n_cols
  • All metadata are HDF5 datasets which contain 1-d array of values of shape 1
  • A value in position i in any column attribute is shared by all cells of column i in any matrix in the layers group
  • A value in position j in any column attribute is shared by all cells of row j in any matrix in the layers group

GE DNA Variants h5 assay

The GE DNA variants h5 assay (named: ge_dna_variants) contains per cell variant information where each row represents a cell and each column represents a unique variant. This assay has the following layers:

 

Layer name Layer description
GT Contains the phased genotype information. More details on the contents of this layer are in the next section
GQ Contains the genotyping quality as calculated by GATK
AD Contains the alternate allele depth  as calculated by GATK
DP Contains the total depth as calculated by GATK

The assay has the following column attributes:

Attribute name Attribute description
amplicons The name of the amplicon that this variant belongs to
chr The chromosome of the genome on which this variant is present
pos The genomic coordinate of the variant start position
ref The reference base(s) for this variant
alt The alternate base(s) for this variant
id The variant described in the format of chromosome:position:reference bases/alternative bases
is_ref 1 if this is a reference variant, representing the WT allele. Else 0
is_snv 1 if this is a SNV variant, else 0
is_nocall 1 if this is a no-call variant, else 0

The assay has the following row attributes:

 

Attribute name Attribute description
barcode The cell barcode which corresponds to this row
sample The sample name for this row (this is the same for all rows for a standard GE h5 file)

GT layer details

The values in the GT layer are encoded into bits using the formula ⅀2n where n is the allele in which the variant is present. This means in that case of a diploid cell the interpretation would be as follows:

 

NGT Value Outcome
NaN This variant is not called in the cell
0 Not possible
1 Heterozygous variant on allele-1, i.e. 0|1
2 Heterozygous variant on allele-2, i.e. 1|0
3 Homozygous variant on both alleles, i.e. 1|1

 

The matrix has one column per target to denote WT allele and one column per target to denote no-call allele. These columns have the following IDs:

  • No-call allele column: <target’s chromosome>:<target’s start coordinate>:./.
  • WT allele column: <target’s chromosome>:<target’s start coordinate>:A/A

Currently all targets will have the WT allele as “A/A” regardless of the reference base at that genomic coordinate. 

In case either of these columns have a GT value that denotes the entire allele (either allele-1 or allele-2 or both alleles) are no-call or WT based on which column has the value. Please check the example below for more details.

GT layer example

 

Column Attributes Amplicon Target_1 Target_1 Target_1 Target_1 Target_1 Target_1
ID chr1:26644621:./. chr1:26644621:A/A chr1:26644621:A/T chr1:26644625:G/C chr1:26644621:A/ATGC chr1:26644625:GCTA/G
is_ref 0 1 0 0 0 0
is_nocall 1 0 0 0 0 0
is_snv 0 0 1 1 0 0
               
NGT Layer Cell_1   3        
Cell_2 3          
Cell_3   2 1     1
Cell_4       2   1
Cell_5           3

This example shows information only for 1 target and 5 cells. The way to interpret the GT information for each of the 5 cells is explained below:

  • Cell_1 has a value of 3 for the WT variant
    • This means that both alleles for this cell are WT
  • Cell_2 has a value of 3 for the no-call variant
    • This means that both alleles for this cell are no-calls
  • Cell_3 has a value of 2 for WT variant and 1 for the A->T SNV and 3bp deletion of “CTA”
    • This means that allele-2 of this cell is WT and allele-1 has both the SNV and the deletion
  • Cell_4 has a value of 2 for the G->C SNV and 1 for the 3bp “CTA” deletion
    • This means that allele-1 of this cell has the deletion and allele-2 has the SNV
  • Cell_5 has a value of 3 for the 3bp “CTA” deletion
    • This means that both the alleles of this cell have the deletion

Note: The ge_dna_variants assay is currently not supported by Mosaic.

GE DNA Translocations assay

The GE DNA variants h5 assay (named: ge_dna_translocations) contains per cell translocation information where each row represents a cell and each column represents a unique translocation. This assay has the following layers:

Layer name Layer description
translocations Contains the number of reads supporting the translocation event in that particular cell
translocation_filters Contains the chimeric:normal read ratio for the translocation event in each cell/translocation combination
breakpoints Contains the identified breakpoint for each translocation/cell combination

 

The assay has the following column attributes:

Attribute name Attribute description
id Contains the name of the 2 amplicons involved in translocation
region_1 Contains the coordinates of the left amplicon involved in translocation
region_2 Contains the coordinates of the right amplicon involved in translocation

 

The assay has the following row attributes:

Attribute name Attribute description
barcode The cell barcode which corresponds to this row
sample_name The sample name for this row (this is the same for all rows for a standard GE h5 file)

Other assays present in the h5 file

dna_read_counts

For DNA only samples, in addition to the ge_dna_variants assay the h5 file also contains the dna_read_counts assay. This assay is the same as the dna_read_counts assay generated by Tapestri DNA pipeline and it can be analyzed using MissionBio’s python based package, Mosaic

protein_read_counts

For DNA+Protein samples, the h5 file will additionally include the protein_read_counts assay. This assay is the same as the protein_read_counts assay generated by Tapestri DNA+Protein pipeline and can be analyzed using MissionBio’s python based package, Mosaic.

The h5 file generated by either the GE DNA pipeline or the GE DNA+Protein pipeline can be loaded using Mosaic. This would result in the dna_read_counts and protein_read_counts being loaded, while the ge_dna_variants assay would be ignored by the python package.

Share this article:

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request