h5 file basics
The h5 file generated by the Genome Editing (GE) pipeline are structured into assays, which are HDF5 groups. Each assay contains metadata, column attributes, row attributes and layers which are all also groups under the parent group i.e. the assay. Assuming that any layer is a matrix of the shape (n_rows x n_cols), these groups have the following rules:
- All column attributes are HDF5 datasets which contain 1-d array of values of the shape n_cols
- All row attributes are HDF5 datasets which contain 1-d array of values of the shape n_rows
- All layers are HDF5 datasets which contain a 2-d array of values of the shape n_rows x n_cols
- All metadata are HDF5 datasets which contain 1-d array of values of shape 1
- A value in position i in any column attribute is shared by all cells of column i in any matrix in the layers group
- A value in position j in any column attribute is shared by all cells of row j in any matrix in the layers group
GE DNA Variants h5 assay
The GE DNA variants h5 assay (named: ge_dna_variants) contains per cell variant information where each row represents a cell and each column represents a unique variant. This assay has the following layers:
Layer name | Layer description |
GT | Contains the phased genotype information. More details on the contents of this layer are in the next section |
GQ | Contains the genotyping quality as calculated by GATK |
AD | Contains the alternate allele depth as calculated by GATK |
DP | Contains the total depth as calculated by GATK |
The assay has the following column attributes:
Attribute name | Attribute description |
amplicons | The name of the amplicon that this variant belongs to |
chr | The chromosome of the genome on which this variant is present |
pos | The genomic coordinate of the variant start position |
ref | The reference base(s) for this variant |
alt | The alternate base(s) for this variant |
id | The variant described in the format of chromosome:position:reference bases/alternative bases |
is_ref | 1 if this is a reference variant, representing the WT allele. Else 0 |
is_snv | 1 if this is a SNV variant, else 0 |
is_nocall | 1 if this is a no-call variant, else 0 |
The assay has the following row attributes:
Attribute name | Attribute description |
barcode | The cell barcode which corresponds to this row |
sample | The sample name for this row (this is the same for all rows for a standard GE h5 file) |
GT layer details
The values in the GT layer are encoded into bits using the formula ⅀2n where n is the allele in which the variant is present. This means in that case of a diploid cell the interpretation would be as follows:
NGT Value | Outcome |
NaN | This variant is not called in the cell |
0 | Not possible |
1 | Heterozygous variant on allele-1, i.e. 0|1 |
2 | Heterozygous variant on allele-2, i.e. 1|0 |
3 | Homozygous variant on both alleles, i.e. 1|1 |
The matrix has one column per target to denote WT allele and one column per target to denote no-call allele. These columns have the following IDs:
- No-call allele column: <target’s chromosome>:<target’s start coordinate>:./.
- WT allele column: <target’s chromosome>:<target’s start coordinate>:A/A
Currently all targets will have the WT allele as “A/A” regardless of the reference base at that genomic coordinate.
In case either of these columns have a GT value that denotes the entire allele (either allele-1 or allele-2 or both alleles) are no-call or WT based on which column has the value. Please check the example below for more details.
GT layer example
Column Attributes | Amplicon | Target_1 | Target_1 | Target_1 | Target_1 | Target_1 | Target_1 |
ID | chr1:26644621:./. | chr1:26644621:A/A | chr1:26644621:A/T | chr1:26644625:G/C | chr1:26644621:A/ATGC | chr1:26644625:GCTA/G | |
is_ref | 0 | 1 | 0 | 0 | 0 | 0 | |
is_nocall | 1 | 0 | 0 | 0 | 0 | 0 | |
is_snv | 0 | 0 | 1 | 1 | 0 | 0 | |
NGT Layer | Cell_1 | 3 | |||||
Cell_2 | 3 | ||||||
Cell_3 | 2 | 1 | 1 | ||||
Cell_4 | 2 | 1 | |||||
Cell_5 | 3 |
This example shows information only for 1 target and 5 cells. The way to interpret the GT information for each of the 5 cells is explained below:
- Cell_1 has a value of 3 for the WT variant
- This means that both alleles for this cell are WT
- Cell_2 has a value of 3 for the no-call variant
- This means that both alleles for this cell are no-calls
- Cell_3 has a value of 2 for WT variant and 1 for the A->T SNV and 3bp deletion of “CTA”
- This means that allele-2 of this cell is WT and allele-1 has both the SNV and the deletion
- Cell_4 has a value of 2 for the G->C SNV and 1 for the 3bp “CTA” deletion
- This means that allele-1 of this cell has the deletion and allele-2 has the SNV
- Cell_5 has a value of 3 for the 3bp “CTA” deletion
- This means that both the alleles of this cell have the deletion
Note: The ge_dna_variants assay is currently not supported by Mosaic.
GE DNA Translocations assay
The GE DNA variants h5 assay (named: ge_dna_translocations) contains per cell translocation information where each row represents a cell and each column represents a unique translocation. This assay has the following layers:
Layer name | Layer description |
translocations | Contains the number of reads supporting the translocation event in that particular cell |
translocation_filters | Contains the chimeric:normal read ratio for the translocation event in each cell/translocation combination |
breakpoints | Contains the identified breakpoint for each translocation/cell combination |
The assay has the following column attributes:
Attribute name | Attribute description |
id | Contains the name of the 2 amplicons involved in translocation |
region_1 | Contains the coordinates of the left amplicon involved in translocation |
region_2 | Contains the coordinates of the right amplicon involved in translocation |
The assay has the following row attributes:
Attribute name | Attribute description |
barcode | The cell barcode which corresponds to this row |
sample_name | The sample name for this row (this is the same for all rows for a standard GE h5 file) |
Other assays present in the h5 file
dna_read_counts
For DNA only samples, in addition to the ge_dna_variants assay the h5 file also contains the dna_read_counts assay. This assay is the same as the dna_read_counts assay generated by Tapestri DNA pipeline and it can be analyzed using MissionBio’s python based package, Mosaic.
protein_read_counts
For DNA+Protein samples, the h5 file will additionally include the protein_read_counts assay. This assay is the same as the protein_read_counts assay generated by Tapestri DNA+Protein pipeline and can be analyzed using MissionBio’s python based package, Mosaic.
The h5 file generated by either the GE DNA pipeline or the GE DNA+Protein pipeline can be loaded using Mosaic. This would result in the dna_read_counts and protein_read_counts being loaded, while the ge_dna_variants assay would be ignored by the python package.