From 1dcbce0b09a71abaec4b1e72dd70a2f573e4dc6b Mon Sep 17 00:00:00 2001 From: Peter Amstutz Date: Mon, 18 Jan 2021 17:51:10 -0500 Subject: [PATCH] Running is lesson 2. Arvados-DCO-1.1-Signed-off-by: Peter Amstutz --- README.md | 15 --- dockerfile/Dockerfile | 2 +- lesson1/lesson1.md | 183 ++++++++++++++++++++++++++++++++- lesson2/lesson2.md | 209 ++++++++++++++++++++++++++++++++++++++ scripts/genomeGenerate.sh | 10 -- 5 files changed, 392 insertions(+), 27 deletions(-) create mode 100644 lesson2/lesson2.md delete mode 100755 scripts/genomeGenerate.sh diff --git a/README.md b/README.md index fd9a630..f35a870 100644 --- a/README.md +++ b/README.md @@ -1,19 +1,4 @@ -# Downloading reference data - -``` -mkdir rnaseq -cd rnaseq -wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=pirca-4zz18-sa9s62lzc00jeds/ -``` - -# Building the Docker image - -``` -cd dockerfile -docker build -t cwl-training-rnaseq-tools . -``` - # Acknowledgements These CWL lessons are based on "Introduction to RNA-seq using diff --git a/dockerfile/Dockerfile b/dockerfile/Dockerfile index 69a9327..c73bcf2 100644 --- a/dockerfile/Dockerfile +++ b/dockerfile/Dockerfile @@ -1,2 +1,2 @@ FROM debian:10 -RUN apt-get update && apt-get -y --no-install-recommends install rna-star fastqc samtools subread \ No newline at end of file +RUN apt-get update && apt-get -y --no-install-recommends install subread \ No newline at end of file diff --git a/lesson1/lesson1.md b/lesson1/lesson1.md index 7d4e1b1..060597a 100644 --- a/lesson1/lesson1.md +++ b/lesson1/lesson1.md @@ -1,3 +1,184 @@ -# Turning a bash script into a workflow +# Turning a bash script into a workflow using existing tools In this lesson we will turn `rnaseq_analysis_on_input_file.sh` into a workflow. + +# Setting up + +We will create a new git repository and import a library of existing +tool definitions that will help us build our workflow. + +1. Select "Terminal->New terminal" + +2. Create a new git repository to hold our workflow with this command: + +## Arvados + +``` +git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises +``` + +## Generic + +``` +git init rnaseq-cwl-training-exercises +``` + + +3. Go to File->Open Folder and select rnaseq-cwl-training-exercises + +4. Go to the terminal window + +5. Import bio-cwl-tools with this command: + +``` +git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git +``` + +# Writing the workflow + +1. Create a new file "main.cwl" + +2. Start with this header. + + +``` +cwlVersion: v1.2 +class: Workflow +label: RNAseq CWL practice workflow +``` + +3. Workflow Inputs + +The purpose of a workflow is to consume some input parameters, run a +series of steps, and produce output values. + +For this analysis, the input parameters are the fastq file and the reference data required by STAR. + +In CWL, these are declared in the `inputs` section. + +The inputs section lists each input parameter and its type. Valid +types include `File`, `Directory`, `string`, `boolean`, `int`, and +`float`. + +In this case, the fastq and gene annotation file are individual files. The STAR index is a directory. We can describe these inputs in CWL like this: + +``` +inputs: + fq: File + genome: Directory + gtf: File +``` + +4. Workflow Steps + +A workflow consists of one or more steps. This is the `steps` section. + +Now we need to describe the first step of the workflow. This step is to run `fastqc`. + +A workflow step consists of the name of the step, the tool to `run`, +the input parameters to be passed to the tool in `in`, and the output +parameters expected from the tool in `out`. + +The value of `run` references the tool file. Tip: while typing the +file name, you can get suggestions and auto-completion on a partial +name using control+space. + +The `in` block lists input parameters to the tool and the workflow +parameters that will be assigned to those inputs. + +The `out` block lists output parameters to the tool that are used +later in the workflow. + +You need to know which input and output parameters are available for +each tool. In vscode, click on the value of `run` and select "Go to +definition" to open the tool file. Look for the `inputs` and +`outputs` sections of the tool file to find out what parameters are +defined. + +``` +steps: + fastqc: + run: bio-cwl-tools/fastqc/fastqc_2.cwl + in: + reads_file: fq + out: [html_file, summary_file] +``` + +5. Running alignment with STAR + +STAR has more parameters. Sometimes we want to provide input values +to a step without making them as workflow-level inputs. We can do +this with `{default: N}` + + +``` + STAR: + requirements: + ResourceRequirement: + ramMin: 6000 + run: bio-cwl-tools/STAR/STAR-Align.cwl + in: + RunThreadN: {default: 4} + GenomeDir: genome + ForwardReads: fq + OutSAMtype: {default: BAM} + OutSAMunmapped: {default: Within} + out: [alignment] +``` + +6. Running samtools + +The third step is to generate an index for the aligned BAM. + +For this step, we need to use the output of a previous step as input +to this step. We refer the output of a step by with name of the step +(STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment` + +This creates a dependency between steps. This means the `samtools` +step will not run until the `STAR` step has completed successfully. + +``` + samtools: + run: bio-cwl-tools/samtools/samtools_index.cwl + in: + bam_sorted: STAR/alignment + out: [bam_sorted_indexed] +``` + +7. featureCounts + +As of this writing, the `subread` package that provides +`featureCounts` is not available in bio-cwl-tools (and if it has been +added since writing this, let's pretend that it isn't there.) We will +dive into how to write a CWL wrapper for a command line tool in +lesson 2. For now, we will leave off the final step. + +8. Workflow Outputs + +The last thing to do is declare the workflow outputs in the `outputs` section. + +For each output, we need to declare the type of output, and what +parameter has the output value. + +Output types are the same as input types, valid types include `File`, +`Directory`, `string`, `boolean`, `int`, and `float`. + +The `outputSource` field refers the a step output in the same way that +the `in` block does, the name of the step, a slash, and the name of +the output parameter. + +For our final outputs, we want the results from fastqc and the +aligned, sorted and indexed BAM file. + +``` +outputs: + qc_html: + type: File + outputSource: fastqc/html_file + qc_summary: + type: File + outputSource: fastqc/summary_file + bam_sorted_indexed: + type: File + outputSource: samtools/bam_sorted_indexed +``` diff --git a/lesson2/lesson2.md b/lesson2/lesson2.md new file mode 100644 index 0000000..a5ec6a7 --- /dev/null +++ b/lesson2/lesson2.md @@ -0,0 +1,209 @@ +# Running and debugging a workflow + +1. The input parameter file + +CWL input values are provided in the form of a YAML or JSON file. +Create one by right clicking on the explorer, select "New File" and +create a called file "input.yaml". + +This file gives the values for parameters declared in the `inputs` +section of our workflow. Our workflow takes `fq`, `genome` and `gtf` +as input parameters. + +When setting inputs, Files and Directories are given as an object with +`class: File` or `class: Directory`. This distinguishes them from +plain strings that may or may not be file paths. + + +## Arvados + +``` +fq: + class: File + location: keep:9178fe1b80a08a422dbe02adfd439764+925/raw_fastq/Mov10_oe_1.subset.fq + format: http://edamontology.org/format_1930 +genome: + class: Directory + location: keep:02a12ce9e2707610991bd29d38796b57+2912 +gtf: + class: File + location: keep:9178fe1b80a08a422dbe02adfd439764+925/reference_data/chr1-hg19_genes.gtf +``` + +## Generic + +Note: if you don't have example sequence data or the STAR index files, see the Appendix below. + +``` +fq: + class: File + location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq + format: http://edamontology.org/format_1930 +genome: + class: Directory + location: hg19-chr1-STAR-index +gtf: + class: File + location: rnaseq/reference_data/chr1-hg19_genes.gtf +``` + +2. Running the workflow + +## Arvados + +In vscode, select "main.cwl" and then choose "Terminal -> Run task -> Run CWL workflow on Arvados" + +## Generic + +Type this into the terminal: + +``` +cwl-runner main.cwl input.yaml +``` + +3. Debugging the workflow + +A workflow can fail for many reasons: some possible reasons include +bad input, bugs in the code, or running out memory. In this case, the +STAR workflow might fail with an out of memory error. + +To help diagnose these errors, the workflow runner produces logs that +record what happened, either in the terminal or the web interface. + +Some errors you might see in the logs that would indicate an out of +memory condition: + +``` +EXITING: fatal error trying to allocate genome arrays, exception thrown: std::bad_alloc +Possible cause 1: not enough RAM. Check if you have enough RAM 5711762337 bytes +Possible cause 2: not enough virtual memory allowed with ulimit. SOLUTION: run ulimit -v 5711762337 +``` + +or + +``` +Container exited with code: 137 +``` + +(Exit code 137 most commonly occurs when a container goes "out of memory" and is terminated by the operating system). + +If this happens, you will need to request more RAM. + +4. Setting runtime RAM requirements + +By default, a step is allocated 256 MB of RAM. From the STAR error message: + +> Check if you have enough RAM 5711762337 bytes + +We can see that STAR requires quite a bit more RAM than that. To +request more RAM, add a "requirements" section with +"ResourceRequirement" to the "STAR" step: + +``` + STAR: + requirements: + ResourceRequirement: + ramMin: 7000 + run: bio-cwl-tools/STAR/STAR-Align.cwl +``` + +Resource requirements you can set include: + +* coresMin: CPU cores +* ramMin: RAM (in megabytes) +* tmpdirMin: temporary directory available space +* outdirMin: output directory available space + +After setting the RAM requirements, re-run the workflow. + +5. Workflow results + +The CWL runner will print a results JSON object. It will look something like this (it may include additional fields). + + +``` +{ + "bam_sorted_indexed": { + "location": "file:///home/username/rnaseq-cwl-training-exercises/Aligned.sortedByCoord.out.bam", + "basename": "Aligned.sortedByCoord.out.bam", + "class": "File", + "size": 25370707, + "secondaryFiles": [ + { + "basename": "Aligned.sortedByCoord.out.bam.bai", + "location": "file:///home/username/rnaseq-cwl-training-exercises/Aligned.sortedByCoord.out.bam.bai", + "class": "File", + "size": 176552, + } + ] + }, + "qc_html": { + "location": "file:///home/username/rnaseq-cwl-training-exercises/Mov10_oe_1.subset_fastqc.html", + "basename": "Mov10_oe_1.subset_fastqc.html", + "class": "File", + "size": 383589 + }, + "qc_summary": { + "location": "file:///home/username/rnaseq-cwl-training-exercises/summary.txt", + "basename": "summary.txt", + "class": "File", + "size": 590 + } +} +``` + +This has the same structure as `input.yaml`. The each output +parameter is listed, with the `location` field of each `File` object +indicating where the output file can be found. + +# Appendix + +## Downloading sample and reference data + +Start from your rnaseq-cwl-exercises directory. + +``` +mkdir rnaseq +cd rnaseq +wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=9178fe1b80a08a422dbe02adfd439764+925/ +``` + +## Downloading or generating STAR index + +Running STAR requires index files generated from the reference. + +This is a rather large download (4 GB). Depending on your bandwidth, it may be faster to generate it yourself. + +### Downloading + +Go to the "Terminal" tab in the lower vscode panel. If necessary, select `bash` from the dropdown list in the upper right corner. + +``` +mkdir hg19-chr1-STAR-index +cd hg19-chr1-STAR-index +wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=02a12ce9e2707610991bd29d38796b57+2912/ +``` + +### Generating + +Create `chr1-star-index.yaml`: + +``` +InputFiles: + - class: File + location: rnaseq/reference_data/chr1.fa + format: http://edamontology.org/format_1930 +IndexName: 'hg19-chr1-STAR-index' +Gtf: + class: File + location: rnaseq/reference_data/chr1-hg19_genes.gtf +Overhang: 99 +``` + +Next, go to the "Terminal" tab in the lower vscode panel. If +necessary, select `bash` from the dropdown list in the upper right +corner. Generate the index with your local cwl-runner. + +``` +cwl-runner bio-cwl-tools/STAR/STAR-Index.cwl chr1-star-index.yaml +``` diff --git a/scripts/genomeGenerate.sh b/scripts/genomeGenerate.sh deleted file mode 100755 index c714640..0000000 --- a/scripts/genomeGenerate.sh +++ /dev/null @@ -1,10 +0,0 @@ -#!/bin/bash - -# Generate STAR genome index - -STAR --runThreadN 4 \ ---runMode genomeGenerate \ ---genomeDir unix_lesson/reference_data \ ---genomeFastaFiles unix_lesson/reference_data/chr1.fa \ ---sjdbGTFfile unix_lesson/reference_data/chr1-hg19_genes.gtf \ ---sjdbOverhang 99 -- 2.30.2