Genome Editing (GE) h5 File

Updated May 21, 2024 07:05

h5 file basics

The h5 file generated by the Genome Editing (GE) pipeline are structured into assays, which are HDF5 groups. Each assay contains metadata, column attributes, row attributes and layers which are all also groups under the parent group i.e. the assay. Assuming that any layer is a matrix of the shape (n_rows x n_cols), these groups have the following rules:

All column attributes are HDF5 datasets which contain 1-d array of values of the shape n_cols
All row attributes are HDF5 datasets which contain 1-d array of values of the shape n_rows
All layers are HDF5 datasets which contain a 2-d array of values of the shape n_rows x n_cols
All metadata are HDF5 datasets which contain 1-d array of values of shape 1
A value in position i in any column attribute is shared by all cells of column i in any matrix in the layers group
A value in position j in any column attribute is shared by all cells of row j in any matrix in the layers group

GE DNA Variants h5 assay

The GE DNA variants h5 assay (named: ge_dna_variants) contains per cell variant information where each row represents a cell and each column represents a unique variant. This assay has the following layers:

Layer name	Layer description
GT	Contains the phased genotype information. More details on the contents of this layer are in the next section
GQ	Contains the genotyping quality as calculated by GATK
AD	Contains the alternate allele depth as calculated by GATK
DP	Contains the total depth as calculated by GATK

The assay has the following column attributes:

Attribute name	Attribute description
amplicons	The name of the amplicon that this variant belongs to
chr	The chromosome of the genome on which this variant is present
pos	The genomic coordinate of the variant start position
ref	The reference base(s) for this variant
alt	The alternate base(s) for this variant
id	The variant described in the format of chromosome:position:reference bases/alternative bases
is_ref	1 if this is a reference variant, representing the WT allele. Else 0
is_snv	1 if this is a SNV variant, else 0
is_nocall	1 if this is a no-call variant, else 0

The assay has the following row attributes:

Attribute name	Attribute description
barcode	The cell barcode which corresponds to this row
sample	The sample name for this row (this is the same for all rows for a standard GE h5 file)

GT layer details

The values in the GT layer are encoded into bits using the formula ⅀2n where n is the allele in which the variant is present. This means in that case of a diploid cell the interpretation would be as follows:

NGT Value	Outcome
NaN	This variant is not called in the cell
0	Not possible
1	Heterozygous variant on allele-1, i.e. 0\|1
2	Heterozygous variant on allele-2, i.e. 1\|0
3	Homozygous variant on both alleles, i.e. 1\|1

The matrix has one column per target to denote WT allele and one column per target to denote no-call allele. These columns have the following IDs:

No-call allele column: <target’s chromosome>:<target’s start coordinate>:./.
WT allele column: <target’s chromosome>:<target’s start coordinate>:A/A

Currently all targets will have the WT allele as “A/A” regardless of the reference base at that genomic coordinate.

In case either of these columns have a GT value that denotes the entire allele (either allele-1 or allele-2 or both alleles) are no-call or WT based on which column has the value. Please check the example below for more details.

GT layer example

Column Attributes	Amplicon	Target_1	Target_1	Target_1	Target_1	Target_1	Target_1
	ID	chr1:26644621:./.	chr1:26644621:A/A	chr1:26644621:A/T	chr1:26644625:G/C	chr1:26644621:A/ATGC	chr1:26644625:GCTA/G
	is_ref	0	1	0	0	0	0
	is_nocall	1	0	0	0	0	0
	is_snv	0	0	1	1	0	0

NGT Layer	Cell_1		3
	Cell_2	3
	Cell_3		2	1			1
	Cell_4				2		1
	Cell_5						3

This example shows information only for 1 target and 5 cells. The way to interpret the GT information for each of the 5 cells is explained below:

Cell_1 has a value of 3 for the WT variant
- This means that both alleles for this cell are WT
Cell_2 has a value of 3 for the no-call variant
- This means that both alleles for this cell are no-calls
Cell_3 has a value of 2 for WT variant and 1 for the A->T SNV and 3bp deletion of “CTA”
- This means that allele-2 of this cell is WT and allele-1 has both the SNV and the deletion
Cell_4 has a value of 2 for the G->C SNV and 1 for the 3bp “CTA” deletion
- This means that allele-1 of this cell has the deletion and allele-2 has the SNV
Cell_5 has a value of 3 for the 3bp “CTA” deletion
- This means that both the alleles of this cell have the deletion

Note: The ge_dna_variants assay is currently not supported by Mosaic.

GE DNA Translocations assay

The GE DNA variants h5 assay (named: ge_dna_translocations) contains per cell translocation information where each row represents a cell and each column represents a unique translocation. This assay has the following layers:

Layer name	Layer description
translocations	Contains the number of reads supporting the translocation event in that particular cell
translocation_filters	Contains the chimeric:normal read ratio for the translocation event in each cell/translocation combination
breakpoints	Contains the identified breakpoint for each translocation/cell combination

The assay has the following column attributes:

Attribute name	Attribute description
id	Contains the name of the 2 amplicons involved in translocation
region_1	Contains the coordinates of the left amplicon involved in translocation
region_2	Contains the coordinates of the right amplicon involved in translocation

The assay has the following row attributes:

Attribute name	Attribute description
barcode	The cell barcode which corresponds to this row
sample_name	The sample name for this row (this is the same for all rows for a standard GE h5 file)

Other assays present in the h5 file

dna_read_counts

For DNA only samples, in addition to the ge_dna_variants assay the h5 file also contains the dna_read_counts assay. This assay is the same as the dna_read_counts assay generated by Tapestri DNA pipeline and it can be analyzed using MissionBio’s python based package, Mosaic.

protein_read_counts

For DNA+Protein samples, the h5 file will additionally include the protein_read_counts assay. This assay is the same as the protein_read_counts assay generated by Tapestri DNA+Protein pipeline and can be analyzed using MissionBio’s python based package, Mosaic.

The h5 file generated by either the GE DNA pipeline or the GE DNA+Protein pipeline can be loaded using Mosaic. This would result in the dna_read_counts and protein_read_counts being loaded, while the ge_dna_variants assay would be ignored by the python package.

Share this article:

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request