h5 file basics
The h5 file generated by the Genome Integrity (GI) pipeline are structured into assays, which are HDF5 groups. Each assay contains metadata, column attributes, row attributes and layers which are all also groups under the parent group i.e. the assay. Assuming that any layer is a matrix of the shape (n_rows x n_cols), these groups have the following rules:
- All column attributes are HDF5 datasets which contain 1-d array of values of the shape n_cols
- All row attributes are HDF5 datasets which contain 1-d array of values of the shape n_rows
- All layers are HDF5 datasets which contain a 2-d array of values of the shape n_rows x n_cols
- All metadata are HDF5 datasets which contain 1-d array of values of shape 1
- A value in position i in any column attribute is shared by all cells of column i in any matrix in the layers group
- A value in position j in any column attribute is shared by all cells of row j in any matrix in the layers group
h5 files generated
The Genome Integrity pipeline generates 4 h5 files as output:
- *.dna.sample.h5: this is the final h5 file created, with CNV readouts (typically not used for downstream analysis)
- *.dna.sample.dmx.h5: this is the final demultiplexed h5 file (per sample) with CNV readouts
- *.dna.*.spikein.h5: this is an h5 file for just the spike-in cells detected (typically not used for downstream analysis)
- *.dna.h5: this is the pre-demultiplexing h5 file, which can be used for reprocessing
- DNA + Protein runs will have the extension *.dna+protein.* in place of *.dna.*
DNA assay
The h5 file will include the DNA assay. This assay is the same as the DNA assay generated by the Tapestri DNA + Protein pipeline
CNV assay
The CNV assay is unique to this h5 file. It will contain four layers: read_counts, ploidy, smooth-flat-ploidy, and flat-ploidy. The relative copy number for each cell and amplicon is stored in the “ploidy” layer. These are floating point values. Since any cell can only have an integer number of copies, “flat-ploidy” contains the estimate for this. This layer is computed by clustering the cells and calling the copy number for that cluster for each amplicon. “smooth-flat-ploidy” stores the error corrected flat-ploidy values. The correction is performed for each amplicon by using the heterozygous SNVs on that amplicon and the ploidy of neighboring amplicons.
VDJ assay (Multiple Myeloma specific)
This assay only contains a read_counts layer which stores the number of reads for each cell and clonotype.
Protein assay
For DNA+Protein samples, the h5 file will additionally include the protein assay. This assay is the same as the protein assay generated by Tapestri DNA+Protein pipeline.