How to troubleshoot V3 DNA/DNA+Protein pipeline failures?

  • Updated

A DNA or DNA+Protein run can fail due to multiple reasons, some of the common reasons include incorrect inputs, corrupt FASTQ files, over-sequenced data, etc. Below is a step by step guide to identify the reason for a run failure. 

DNA Pipeline

  1. Download the file "logs/tapestri_run_output.txt" from the Output Files tab if available. This file should show the step where the run has failed. The following failures are possible: 
    1. Preflight checks(QC) Failure - The log file will show the message "Preflight Checks Failed" as the last statement. The common reasons that can cause the run to fail are captured by the preflight checks, also called Quality Control(QC) checks, and can be identified by reviewing the JSON file. These checks are run as the first step for both DNA and Protein pipeline to identify issues with input files or data quality. 

      NOTE: If you have never worked with a JSON file before, use any text editor like Notepad for Windows or TextEdit for Mac to open the file.

      All the checks along with the message seen in the JSON are summarized below:
      1. Download the file qc/<prefix>-qc.json and look at the various checks:
        • GZIP check failure - FASTQ files fail the GZIP check when the files are corrupt, either one or both R1 and R2 can be corrupt. The file can be corrupt from the source or can get corrupted during upload. In case of Gzip failure the following message is seen in the JSON file: 
          "description": "Run requested checks on input files",
          "messages": {
          "error": [
          "'/home/runner-agent/input/TD02142-UCI-2018-010_S1_R2_001.fastq.gz': gzip integrity failure",
          "'/home/runner-agent/input/TD02142-UCI-2018-010_S1_R1_001.fastq.gz': gzip integrity failure"
          ]
          },
          Fix: Check the integrity of local files, upload the files again, and start the run with the new files. 
        • Read number mismatch - R1 and R2 files have different number of reads. The following message is seen during failure: 
          "description": "Run requested checks on input files",
          "messages": {
          "error": [
          "Read count error. The number of reads in R1 - 13562039 do not match R2 - 13562002"
          ]
          },

          Fix: Upload or assign the correct R1/R2 pair during Step 3 of the start run workflow.

        • Panel Genome mismatch - This error is seen when the genome is missing some of the chromosomes/contigs present in the panel files. The error message is QC file is: 
          "amplicons_filename": "/home/runner-agent/input/dna_panel/<panel>.amplicons",
          "description": "Check if all amplicons in amplicon file specified are present in the reference genome.",
          "name": "amplicon_overlap",
          "not_in_reference": "chrA, chrB",
          "reference_genome": "/home/runner-agent/input/genome/<genome>.fa",
          "status": "failed"

          Fix: Select/Upload the correct genome with all the chromosomes/contigs as seen in the panel. To upload a new custom panel follow the steps here

          NOTE: An hg19/hg38 mismatch will not give this error, but will typically result in a complete run with 0 cells called or very low panel uniformity. 

        • Read name mismatch - This error is seen when R1 and R2 are from different sequencing runs or lanes are not merged correctly. We check the read name for the first 100 reads and if any mismatch is seen then the run is stopped with the error as: 
          "description": "Validate R1/R2 match.",
          "error": "Error in sequence file at unknown line: Reads are improperly paired. Read name 'A01789:20:HHKGKDRX2:1:2101:1108:1000 1:N:0:CTTGTCGA+CGATGTTC' in file 1 does not match 'A01789:73:HW2HTDRX2:2:1101:1280:1016 2:N:0:GAGACGAT+ACCGGTTA' in file 2., file 1=<file-name>.fastq.gz, file 2=<file-name>.fastq.gz",
          "name": "r1r2_match",
          "status": "failed"

          Fix: Upload the correctly paired FASTQ files.

        • Coverage check - Mission Bio recommends an average coverage of 60 – 80x, with the pipeline allowing upto 320x coverage assuming 10,000 cells. Anything beyond 320x affects the downstream results and is not recommended. Thus, using the coverage calculation formula as defined here, if the expected coverage value is more than 320x then the run is stopped and the following error message is seen:
          "messages": {
          "error": [
          "Oversequenced sample. Estimated coverage (<xx>e+02) above oversequencing threshold (3.2e+02)"
          ]
          },
          Fix: Downsample the FASTQ files and start a run with these downsampled FASTQ files.

          NOTE: The json will indicate 320x max coverage for 20,000 estimated cells but is adding R1+R2 reads (read pairs). We recommend 80x coverage as Paired Reads, so adjust the value as seen in the json by a factor of 0.5

        • QC warnings - In addition to the errors there can be warnings for the checks, but these do not lead to run failure and are just informative messages related to the data quality.
          "description": "Run requested checks on input files",
          "messages": {
          "warning": [
          "Recommended free space on results drive is 2.0TB or more",
          "fastp maximum N (1.0) above warning threshold (0.05)",
          "Undersequenced sample. Estimated coverage (2.4e-05) below undersequencing threshold (80)"
          ]
          },
    2. If Preflight Checks completed successfully then scroll down the tapestri_log_output.txt file to find the failure step. Though there are not many reasons for failure if all preflight checks were successful. A few issues we have seen in the past are:
      1. Cell Finder failure - If the run uses an incorrect panel or genome, most of the reads get filtered out and there are not enough reads left for the cell finder to work successfully. To fix, review and update the panel and the genome selected for the run.
      2. Variant calling failure - The infrastructure used to process the data may run out of memory in the joint genotyping step for large sized runs. This can be identified by looking at the last few statements in the log file. If you see the statement "Variant Calling Started" but not "Variant Calling Completed" then reach out to support@missionbio.com.
      3. Output file creation failure - Due to the size of data this step might fail due to lack of available memory on the server being used to process data. In this case please reach out to support@missionbio.com.

DNA+Protein Pipeline

DNA+Protein pipeline runs the Protein pipeline, DNA pipeline and the DNA+Protein merge pipeline sequentially. The same process should be followed to identify the error.

  1. There is no single log file like DNA to download, but the process remains the same to check the preflight check JSON file first. The following failures are possible: 
      1. Preflight checks(QC) Failure - Download the file "protein/qc/qc.json", and look for the same checks as seen in DNA run. In addition, the protein preflight checks also validate the protein panel file. The protein panel should follow the structure as defined here, in case of issues the following errors are seen:
          • Incorrect file encoding - If the panel file is not UTF-8 encoded an error like the following is seen(the exact message will differ based on the invalid character in the file):  
            "messages": {
            "error": [
            "Error reading barcode file ('utf-8' codec can't decode byte 0xa0 in position 2364: invalid start byte)"
            ]
            },

            Fix: Update the file format, upload the new file and restart the run.

          • Missing "Sequence" or "Name" column names - A message is seen as: 
            "messages": {
            "error": [
            "Mandatory columns not found: ['Sequence']"
            ]
            },

            Fix: Update the column headers and re-upload the panel file.

          • Invalid barcode characters - If any characters other than "ATGC" are seen in barcode sequence: 
            "messages": {
            "error": [
            "Invalid barcode (NTACGATTTGTCAGGG). Allowed characters: ATCG"
            ]
            },

            Fix: Update the barcode sequence and re-upload the panel file.

          • Duplicate barcodes - If the same sequence is present twice in the panel: 
            "messages": {
            "error": [
            "Duplicated barcodes found: GCAGAAATCTCCCTT"
            ]
            },

            Fix: Update the barcode sequence and re-upload the panel file.

      2. Once the preflight checks for protein are complete no failures are generally observed in the the rest of the protein pipeline.

        NOTE: The protein QC and pipeline occurs before DNA QC. If there is a failure in the protein QC, there could still be undiagnosed issues for the DNA pipeline.

    1. For the DNA part of the run, refer to the steps above to identify the issues.
    2. If both DNA and Protein runs show no error and the DNA+Protein run has failed, then the DNA+Protein merge step has failed. This is due to the system running out of memory, please reach out to support@missionbio.com for help.

To summarize, the runs generally fail due to failure of one of the preflight checks and can be fixed by addressing the root cause. If you cannot identify the reason after going through the steps above then please reach out to support@missionbio.com for help. Restarting a failed run without any updates will likely cause the run to fail again and is not recommended.

Share this article:

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request