Genome Integrity H5 format

Updated January 10, 2025 03:26

h5 file basics

The h5 file generated by the Genome Integrity (GI) pipeline are structured into assays, which are HDF5 groups. Each assay contains metadata, column attributes, row attributes and layers which are all also groups under the parent group i.e. the assay. Assuming that any layer is a matrix of the shape (n_rows x n_cols), these groups have the following rules:

All column attributes are HDF5 datasets which contain 1-d array of values of the shape n_cols
All row attributes are HDF5 datasets which contain 1-d array of values of the shape n_rows
All layers are HDF5 datasets which contain a 2-d array of values of the shape n_rows x n_cols
All metadata are HDF5 datasets which contain 1-d array of values of shape 1
A value in position i in any column attribute is shared by all cells of column i in any matrix in the layers group
A value in position j in any column attribute is shared by all cells of row j in any matrix in the layers group

h5 files generated

The Genome Integrity pipeline generates 4 h5 files as output:

*.dna.sample.h5: this is the final h5 file created, with CNV readouts (typically not used for downstream analysis)
*.dna.sample.dmx.h5: this is the final demultiplexed h5 file (per sample) with CNV readouts
*.dna.*.spikein.h5: this is an h5 file for just the spike-in cells detected (typically not used for downstream analysis)
*.dna.h5: this is the pre-demultiplexing h5 file, which can be used for reprocessing
DNA + Protein runs will have the extension *.dna+protein.* in place of *.dna.*

DNA assay

The h5 file will include the DNA assay. This assay is the same as the DNA assay generated by the Tapestri DNA + Protein pipeline

CNV assay

The CNV assay is unique to this h5 file. It will contain four layers: read_counts, ploidy, smooth-flat-ploidy, and flat-ploidy. The relative copy number for each cell and amplicon is stored in the “ploidy” layer. These are floating point values. Since any cell can only have an integer number of copies, “flat-ploidy” contains the estimate for this. This layer is computed by clustering the cells and calling the copy number for that cluster for each amplicon. “smooth-flat-ploidy” stores the error corrected flat-ploidy values. The correction is performed for each amplicon by using the heterozygous SNVs on that amplicon and the ploidy of neighboring amplicons.

VDJ assay (Multiple Myeloma specific)

This assay only contains a read_counts layer which stores the number of reads for each cell and clonotype.

Protein assay

For DNA+Protein samples, the h5 file will additionally include the protein assay. This assay is the same as the protein assay generated by Tapestri DNA+Protein pipeline.

Share this article:

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request