How do I create reference genome files and upload them to Tapestri Pipeline?

  • Updated

Reference genome files are required in genome alignment, haplotype calling, and multiple other steps. In addition to the catalog genomes that Tapestri Pipeline provides, you can upload your own custom reference genome files to create runs.

Note: These files must be naturally sorted, indexed, and a dictionary created from them to be used by the Pipeline. The instructions below describe how to do this.

Preparation of a custom genome .zip file

Follow this step-by-step procedure to generate all the reference files required to run the Pipeline.

  1. Create a single line reference genome file – The first line is for the header. On the next line, include the entire sequence for that header. Do not include any blank lines in between.
  2. Optional step. Sort the genome file by chromosome name in natural order, meaning the chromosomes could be ordered as chr1, chr2, chr3, and so on, followed by chrM, chrX, and chrY.
    Sorting can be done using the following command in a bash shell:
    cat reference.fa | sort -V -k1,1 -t " " | tr "\t" "\n" > reference_sorted.fa
  1. Create the BWA index files.
    bwa index -a bwtsw reference_sorted.fa
    This step takes some time to complete.
  1. Create a genome index file using samtools.
    samtools faidx reference_sorted.fa
  1. Generate the sequence dictionary.
    java -jar tapestripipeline/share/picard-1.126-5/picard.jar CreateSequenceDictionary
    REFERENCE=reference_sorted.fa OUTPUT=reference_sorted.dict
  1. Create reference_sorted.fa.dict by copying reference_sorted.dict. Both versions of the dict files are needed because of the current implementation of the Pipeline.
    cp reference_sorted.dict reference_sorted.fa.dict
  1. Once created, these files need to be zipped such that all the files are present at the root of the zip folder. Open the location where the files are present and select the files, compress the files to a zip folder. The following files should be part of the zipped folder.
    1. reference_sorted.fa
    2. reference_sorted.fa.amb
    3. reference_sorted.fa.ann
    4. reference_sorted.fa.bwt
    5. reference_sorted.fa.pac
    6. reference_sorted.fa.sa
    7. reference_sorted.fa.fai
    8. reference_sorted.dict
    9. reference_sorted.fa.dict

Note: The zip file name is treated as the genome version in our pipelines. Tapestri Pipeline supports annotations for the hg19 genome. This information comes from VarSome. To use this, we highly recommend naming your custom hg19 reference zipped file as "hg19.zip". 

Upload the custom genome .zip file to Tapestri Pipeline

Launch Tapestri Pipeline from Tapestri Portal.

  1. Click Add Files.
    AddFiles.png
  2. Click Other.
    Ref_File_Upload.png
    There are two ways to upload a genome reference file.
    1. Upload from Local Computer
      Drag and drop the .zip file into the upload area or click Choose Files.
      Click Upload.
    2. Import from Amazon S3
      Add the Amazon S3 URI for the .zip genome file.
      Click Import.

This starts the uploading process. The file will be added to the Other Files table on the Files page. The upload time varies based on the genome .zip file size.

This genome file will now be available for use when creating custom DNA or DNA + Protein runs in Step 01 - Select the Pipeline & Parameters. The dropdown menu for Select the Reference Genome will list the file with the .zip name.

cust_ref.png

Share this article:

Was this article helpful?

2 out of 2 found this helpful

Have more questions? Submit a request