Reference genome files are required in genome alignment, haplotype calling, and multiple other steps. In addition to the catalog genomes that Tapestri Pipeline provides, you can upload your own custom reference genome files to create runs.
Pipeline app now creates all the necessary genome related files once you upload the *.fa.zip file by following the steps mentioned above. If you choose to upload the *.fa.zip file, ignore the instructions to prepare a custom genome .zip file mentioned below, as the Tapestri Pipeline app will automatically prepare the same for your fasta file.
Note: These files must be naturally sorted, indexed, and a dictionary created from them to be used by the Pipeline. The instructions below describe how to do this.
Preparation of a custom genome .zip file
Follow this step-by-step procedure to generate all the reference files required to run the Pipeline.
- Create a single line reference genome file – The first line is for the header. On the next line, include the entire sequence for that header. Do not include any blank lines in between.
- Optional step. Sort the genome file using the seqkit package by chromosome name in natural order, meaning the chromosomes could be ordered as chr1, chr2, chr3, and so on, followed by chrM, chrX, and chrY. Install seqkit package via conda by using the command mentioned here.
Once installed, sort using the following command in a bash shell:seqkit sort -N -2 reference.fa > reference_sorted.fa
-
Create the BWA index files.
bwa index -a bwtsw reference_sorted.fa
This step takes some time to complete.
-
Create a genome index file using samtools.
samtools faidx reference_sorted.fa
-
Generate the sequence dictionary.
java -jar tapestripipeline/share/picard-1.126-5/picard.jar CreateSequenceDictionary
REFERENCE=reference_sorted.fa OUTPUT=reference_sorted.dict
-
Create reference_sorted.fa.dict by copying reference_sorted.dict. Both versions of the dict files are needed because of the current implementation of the Pipeline.
cp reference_sorted.dict reference_sorted.fa.dict
- Once created, these files need to be zipped such that all the files are present at the root of the zip folder. Open the location where the files are present and select the files, compress the files to a zip folder. The following files should be part of the zipped folder.
- reference_sorted.fa
- reference_sorted.fa.amb
- reference_sorted.fa.ann
- reference_sorted.fa.bwt
- reference_sorted.fa.pac
- reference_sorted.fa.sa
- reference_sorted.fa.fai
- reference_sorted.dict
- reference_sorted.fa.dict
Note: The zip file name is treated as the genome version in our pipelines. Tapestri Pipeline supports annotations for the hg19 and hg38 genome. This information comes from VarSome. To use this, we highly recommend naming your custom hg19 reference fasta file as "hg19_<suffix>.fa.zip" and hg38 reference as "hg38_<suffix>.fa.zip".
Upload the custom genome .zip file to Tapestri Pipeline
Launch Tapestri Pipeline from Tapestri Portal.
- Click Add Files.
- Click Other.
There are two ways to upload a genome reference file.-
Upload from Local Computer
Drag and drop the .zip file into the upload area or click Choose Files.
Click Upload. -
Import from Amazon S3
Add the Amazon S3 URI for the .zip genome file.
Click Import.
-
Upload from Local Computer
This starts the uploading process. The file will be added to the Other Files table on the Files page. The upload time varies based on the genome .zip file size.
This genome file will now be available for use when creating custom DNA or DNA + Protein runs in Step 01 - Select the Pipeline & Parameters. The dropdown menu for Select the Reference Genome will list the file with the .zip name.