From: Peter Amstutz Date: Tue, 26 Jan 2021 19:45:32 +0000 (-0500) Subject: Add more background to lesson 1. X-Git-Url: https://git.arvados.org/rnaseq-cwl-training.git/commitdiff_plain/d177113e4a57ff8fe980212e51fe4317cf0fcb5a Add more background to lesson 1. Add answers to each section. Arvados-DCO-1.1-Signed-off-by: Peter Amstutz --- diff --git a/README.md b/README.md index 48ad847..510d494 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ rnaseq. | Lesson | Description | |----------|-------------| -| [Lesson 1](lesson1/lesson1.md) | Turning a shell script into a workflow from existing tool wrappers | +| [Lesson 1](lesson1/lesson1.md) | Turning a shell script into a workflow by composing existing tools | | [Lesson 2](lesson2/lesson2.md) | Running and debugging a workflow | | [Lesson 3](lesson3/lesson3.md) | Writing a tool wrapper | | [Lesson 4](lesson4/lesson4.md) | Analyzing multiple samples | diff --git a/lesson1/RNAseqWorkflow.png b/lesson1/RNAseqWorkflow.png new file mode 100644 index 0000000..1878db4 Binary files /dev/null and b/lesson1/RNAseqWorkflow.png differ diff --git a/lesson1/answers/main.cwl b/lesson1/answers/main.cwl new file mode 100644 index 0000000..bad27f4 --- /dev/null +++ b/lesson1/answers/main.cwl @@ -0,0 +1,48 @@ +### 1. File header +cwlVersion: v1.2 +class: Workflow +label: RNAseq CWL practice workflow + +### 2. Workflow Inputs +inputs: + fq: File + genome: Directory + gtf: File + +### 3. Workflow Steps +steps: + fastqc: + run: bio-cwl-tools/fastqc/fastqc_2.cwl + in: + reads_file: fq + out: [html_file] + + ### 4. Running alignment with STAR + STAR: + requirements: + ResourceRequirement: + ramMin: 6000 + run: bio-cwl-tools/STAR/STAR-Align.cwl + in: + RunThreadN: {default: 4} + GenomeDir: genome + ForwardReads: fq + OutSAMtype: {default: BAM} + OutSAMunmapped: {default: Within} + out: [alignment] + + ### 5. Running samtools + samtools: + run: bio-cwl-tools/samtools/samtools_index.cwl + in: + bam_sorted: STAR/alignment + out: [bam_sorted_indexed] + +### 7. Workflow Outputs +outputs: + qc_html: + type: File + outputSource: fastqc/html_file + bam_sorted_indexed: + type: File + outputSource: samtools/bam_sorted_indexed diff --git a/lesson1/lesson1.md b/lesson1/lesson1.md index 8677c10..5c45265 100644 --- a/lesson1/lesson1.md +++ b/lesson1/lesson1.md @@ -1,31 +1,47 @@ -# Turning a shell script into a workflow using existing tools +# Turning a shell script into a workflow by composing existing tools -In this lesson we will turn `rnaseq_analysis_on_input_file.sh` into a workflow. +## Introduction -## Setting up +The goal of this training is to walk through the development of a +best-practices CWL workflow by translating an existing bioinformatics +shell script into CWL. Specific knowledge of the biology of RNA-seq +is *not* a prerequisite for these lessons. -We will create a new git repository and import a library of existing -tool definitions that will help us build our workflow. +These lessons are based on "Introduction to RNA-seq using +high-performance computing (HPC)" lessons developed by members of the +teaching team at the Harvard Chan Bioinformatics Core (HBC). The +original training, which includes additional lectures about the +biology of RNA-seq can be found here: -Create a new git repository to hold our workflow with this command: +https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2 -``` -git init rnaseq-cwl-training-exercises -``` +## Background -On Arvados use this: +RNA-seq is the process of sequencing RNA in a biological sample. From +the sequence reads, we want to measure the relative number of RNA +molecules appearing in the sample that were produced by particular +genes. This analysis is called "differential gene expression". -``` -git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises -``` +The entire process looks like this: -Next, import bio-cwl-tools with this command: +![](RNAseqWorkflow.png) -``` -git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git -``` +For this training, we are only concerned with the middle analytical +steps (skipping adapter trimming). + +* Quality control (FASTQC) +* Alignment (mapping) +* Counting reads associated with genes -## The shell script +## Analysis shell script + +This analysis is already available as a Unix shell script, which we +will refer to in order to build the workflow. + +Some of the reasons to use CWL over a plain shell script: portability, +scalability, ability to run on platforms that are not traditional HPC. + +rnaseq_analysis_on_input_file.sh ``` #!/bin/bash @@ -81,6 +97,29 @@ samtools index $counts_input_bam featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam ``` +## Setting up + +We will create a new git repository and import a library of existing +tool definitions that will help us build our workflow. + +Create a new git repository to hold our workflow with this command: + +``` +git init rnaseq-cwl-training-exercises +``` + +On Arvados use this: + +``` +git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises +``` + +Next, import bio-cwl-tools with this command: + +``` +git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git +``` + ## Writing the workflow ### 1. File header @@ -160,7 +199,7 @@ steps: fastqc: run: bio-cwl-tools/fastqc/fastqc_2.cwl in: - reads_file: fq + reads_file: fq out: [html_file] ``` diff --git a/lesson3/answers/featureCounts.cwl b/lesson3/answers/featureCounts.cwl new file mode 100644 index 0000000..9653391 --- /dev/null +++ b/lesson3/answers/featureCounts.cwl @@ -0,0 +1,29 @@ +### 1. File header +cwlVersion: v1.2 +class: CommandLineTool + +### 2. Command line tool inputs +inputs: + gtf: File + counts_input_bam: File + +### 3. Specifying the program to run +baseCommand: featureCounts + +### 4. Command arguments +arguments: [-T, $(runtime.cores), + -a, $(inputs.gtf), + -o, featurecounts.tsv, + $(inputs.counts_input_bam)] + +### 5. Outputs section +outputs: + featurecounts: + type: File + outputBinding: + glob: featurecounts.tsv + +### 6. Running in a container +hints: + DockerRequirement: + dockerPull: quay.io/biocontainers/subread:1.5.0p3--0 diff --git a/lesson3/answers/main.cwl b/lesson3/answers/main.cwl new file mode 100644 index 0000000..7eaf62e --- /dev/null +++ b/lesson3/answers/main.cwl @@ -0,0 +1,58 @@ +cwlVersion: v1.2 +class: Workflow +label: RNAseq CWL practice workflow + +inputs: + fq: File + genome: Directory + gtf: File + +steps: + fastqc: + run: bio-cwl-tools/fastqc/fastqc_2.cwl + in: + reads_file: fq + out: [html_file] + + STAR: + requirements: + ResourceRequirement: + ramMin: 6000 + run: bio-cwl-tools/STAR/STAR-Align.cwl + in: + RunThreadN: {default: 4} + GenomeDir: genome + ForwardReads: fq + OutSAMtype: {default: BAM} + OutSAMunmapped: {default: Within} + out: [alignment] + + samtools: + run: bio-cwl-tools/samtools/samtools_index.cwl + in: + bam_sorted: STAR/alignment + out: [bam_sorted_indexed] + + ### 8. Adding it to the workflow + featureCounts: + requirements: + ResourceRequirement: + ramMin: 500 + run: featureCounts.cwl + in: + counts_input_bam: samtools/bam_sorted_indexed + gtf: gtf + out: [featurecounts] + +outputs: + qc_html: + type: File + outputSource: fastqc/html_file + bam_sorted_indexed: + type: File + outputSource: samtools/bam_sorted_indexed + + ### 8. Adding it to the workflow + featurecounts: + type: File + outputSource: featureCounts/featurecounts diff --git a/lesson3/lesson3.md b/lesson3/lesson3.md index 622f446..9bd5a70 100644 --- a/lesson3/lesson3.md +++ b/lesson3/lesson3.md @@ -64,8 +64,8 @@ describes the resources allocated to running the program. Here we use ``` arguments: [-T, $(runtime.cores), -a, $(inputs.gtf), - -o, featurecounts.tsv, - $(inputs.counts_input_bam)] + -o, featurecounts.tsv, + $(inputs.counts_input_bam)] ``` ### 5. Outputs section @@ -89,8 +89,8 @@ output directory called `featurecounts.tsv` outputs: featurecounts: type: File - outputBinding: - glob: featurecounts.tsv + outputBinding: + glob: featurecounts.tsv ``` ### 6. Running in a container @@ -162,10 +162,10 @@ steps: ResourceRequirement: ramMin: 500 run: featureCounts.cwl - in: - counts_input_bam: samtools/bam_sorted_indexed - gtf: gtf - out: [featurecounts] + in: + counts_input_bam: samtools/bam_sorted_indexed + gtf: gtf + out: [featurecounts] ``` We will add the result from featurecounts to the output: @@ -175,8 +175,7 @@ outputs: ... featurecounts: type: File - outputSource: featureCounts/featurecounts - + outputSource: featureCounts/featurecounts ``` You should now be able to re-run the workflow and it will run the diff --git a/lesson4/answers/part1/alignment.cwl b/lesson4/answers/part1/alignment.cwl new file mode 100644 index 0000000..c46b568 --- /dev/null +++ b/lesson4/answers/part1/alignment.cwl @@ -0,0 +1,56 @@ +cwlVersion: v1.2 +class: Workflow +label: RNAseq CWL practice workflow + +inputs: + fq: File + genome: Directory + gtf: File + +steps: + fastqc: + run: bio-cwl-tools/fastqc/fastqc_2.cwl + in: + reads_file: fq + out: [html_file] + + STAR: + requirements: + ResourceRequirement: + ramMin: 6000 + run: bio-cwl-tools/STAR/STAR-Align.cwl + in: + RunThreadN: {default: 4} + GenomeDir: genome + ForwardReads: fq + OutSAMtype: {default: BAM} + OutSAMunmapped: {default: Within} + out: [alignment] + + samtools: + run: bio-cwl-tools/samtools/samtools_index.cwl + in: + bam_sorted: STAR/alignment + out: [bam_sorted_indexed] + + featureCounts: + requirements: + ResourceRequirement: + ramMin: 500 + run: featureCounts.cwl + in: + counts_input_bam: samtools/bam_sorted_indexed + gtf: gtf + out: [featurecounts] + +outputs: + qc_html: + type: File + outputSource: fastqc/html_file + bam_sorted_indexed: + type: File + outputSource: samtools/bam_sorted_indexed + + featurecounts: + type: File + outputSource: featureCounts/featurecounts diff --git a/lesson4/answers/part1/featureCounts.cwl b/lesson4/answers/part1/featureCounts.cwl new file mode 100644 index 0000000..4407ec9 --- /dev/null +++ b/lesson4/answers/part1/featureCounts.cwl @@ -0,0 +1,23 @@ +cwlVersion: v1.2 +class: CommandLineTool + +inputs: + gtf: File + counts_input_bam: File + +baseCommand: featureCounts + +arguments: [-T, $(runtime.cores), + -a, $(inputs.gtf), + -o, featurecounts.tsv, + $(inputs.counts_input_bam)] + +outputs: + featurecounts: + type: File + outputBinding: + glob: featurecounts.tsv + +hints: + DockerRequirement: + dockerPull: quay.io/biocontainers/subread:1.5.0p3--0 diff --git a/lesson4/answers/part1/main.cwl b/lesson4/answers/part1/main.cwl new file mode 100644 index 0000000..33e0f05 --- /dev/null +++ b/lesson4/answers/part1/main.cwl @@ -0,0 +1,32 @@ +cwlVersion: v1.2 +class: Workflow +label: RNAseq CWL practice workflow + +inputs: + fq: File + genome: Directory + gtf: File + +### 1. Subworkflows +steps: + alignment: + run: alignment.cwl + in: + fq: fq + genome: genome + gtf: gtf + out: [qc_html, bam_sorted_indexed, featurecounts] + +outputs: + qc_html: + type: File + outputSource: alignment/qc_html + bam_sorted_indexed: + type: File + outputSource: alignment/bam_sorted_indexed + featurecounts: + type: File + outputSource: alignment/featurecounts + +requirements: + SubworkflowFeatureRequirement: {} diff --git a/lesson4/answers/part2/alignment.cwl b/lesson4/answers/part2/alignment.cwl new file mode 100644 index 0000000..c46b568 --- /dev/null +++ b/lesson4/answers/part2/alignment.cwl @@ -0,0 +1,56 @@ +cwlVersion: v1.2 +class: Workflow +label: RNAseq CWL practice workflow + +inputs: + fq: File + genome: Directory + gtf: File + +steps: + fastqc: + run: bio-cwl-tools/fastqc/fastqc_2.cwl + in: + reads_file: fq + out: [html_file] + + STAR: + requirements: + ResourceRequirement: + ramMin: 6000 + run: bio-cwl-tools/STAR/STAR-Align.cwl + in: + RunThreadN: {default: 4} + GenomeDir: genome + ForwardReads: fq + OutSAMtype: {default: BAM} + OutSAMunmapped: {default: Within} + out: [alignment] + + samtools: + run: bio-cwl-tools/samtools/samtools_index.cwl + in: + bam_sorted: STAR/alignment + out: [bam_sorted_indexed] + + featureCounts: + requirements: + ResourceRequirement: + ramMin: 500 + run: featureCounts.cwl + in: + counts_input_bam: samtools/bam_sorted_indexed + gtf: gtf + out: [featurecounts] + +outputs: + qc_html: + type: File + outputSource: fastqc/html_file + bam_sorted_indexed: + type: File + outputSource: samtools/bam_sorted_indexed + + featurecounts: + type: File + outputSource: featureCounts/featurecounts diff --git a/lesson4/answers/part2/featureCounts.cwl b/lesson4/answers/part2/featureCounts.cwl new file mode 100644 index 0000000..4407ec9 --- /dev/null +++ b/lesson4/answers/part2/featureCounts.cwl @@ -0,0 +1,23 @@ +cwlVersion: v1.2 +class: CommandLineTool + +inputs: + gtf: File + counts_input_bam: File + +baseCommand: featureCounts + +arguments: [-T, $(runtime.cores), + -a, $(inputs.gtf), + -o, featurecounts.tsv, + $(inputs.counts_input_bam)] + +outputs: + featurecounts: + type: File + outputBinding: + glob: featurecounts.tsv + +hints: + DockerRequirement: + dockerPull: quay.io/biocontainers/subread:1.5.0p3--0 diff --git a/lesson4/answers/part2/main.cwl b/lesson4/answers/part2/main.cwl new file mode 100644 index 0000000..9abc5a9 --- /dev/null +++ b/lesson4/answers/part2/main.cwl @@ -0,0 +1,34 @@ +cwlVersion: v1.2 +class: Workflow +label: RNAseq CWL practice workflow + +### 2. Scattering +inputs: + fq: File[] + genome: Directory + gtf: File + +steps: + alignment: + run: alignment.cwl + scatter: fq + in: + fq: fq + genome: genome + gtf: gtf + out: [qc_html, bam_sorted_indexed, featurecounts] + +outputs: + qc_html: + type: File[] + outputSource: alignment/qc_html + bam_sorted_indexed: + type: File[] + outputSource: alignment/bam_sorted_indexed + featurecounts: + type: File[] + outputSource: alignment/featurecounts + +requirements: + SubworkflowFeatureRequirement: {} + ScatterFeatureRequirement: {} diff --git a/lesson4/answers/part4/alignment.cwl b/lesson4/answers/part4/alignment.cwl new file mode 100644 index 0000000..df31e9b --- /dev/null +++ b/lesson4/answers/part4/alignment.cwl @@ -0,0 +1,42 @@ +cwlVersion: v1.2 +class: Workflow +label: RNAseq CWL practice workflow + +inputs: + fq: File + genome: Directory + gtf: File + +steps: + fastqc: + run: bio-cwl-tools/fastqc/fastqc_2.cwl + in: + reads_file: fq + out: [html_file] + + STAR: + requirements: + ResourceRequirement: + ramMin: 6000 + run: bio-cwl-tools/STAR/STAR-Align.cwl + in: + RunThreadN: {default: 4} + GenomeDir: genome + ForwardReads: fq + OutSAMtype: {default: BAM} + OutSAMunmapped: {default: Within} + out: [alignment] + + samtools: + run: bio-cwl-tools/samtools/samtools_index.cwl + in: + bam_sorted: STAR/alignment + out: [bam_sorted_indexed] + +outputs: + qc_html: + type: File + outputSource: fastqc/html_file + bam_sorted_indexed: + type: File + outputSource: samtools/bam_sorted_indexed diff --git a/lesson4/answers/part4/featureCounts.cwl b/lesson4/answers/part4/featureCounts.cwl new file mode 100644 index 0000000..38ace83 --- /dev/null +++ b/lesson4/answers/part4/featureCounts.cwl @@ -0,0 +1,26 @@ +cwlVersion: v1.2 +class: CommandLineTool + +### 4. Combining results +inputs: + gtf: File + counts_input_bam: + - File + - File[] + +baseCommand: featureCounts + +arguments: [-T, $(runtime.cores), + -a, $(inputs.gtf), + -o, featurecounts.tsv, + $(inputs.counts_input_bam)] + +outputs: + featurecounts: + type: File + outputBinding: + glob: featurecounts.tsv + +hints: + DockerRequirement: + dockerPull: quay.io/biocontainers/subread:1.5.0p3--0 diff --git a/lesson4/answers/part4/main.cwl b/lesson4/answers/part4/main.cwl new file mode 100644 index 0000000..fcbb235 --- /dev/null +++ b/lesson4/answers/part4/main.cwl @@ -0,0 +1,47 @@ +cwlVersion: v1.2 +class: Workflow +label: RNAseq CWL practice workflow + +### 2. Scattering +inputs: + fq: File[] + genome: Directory + gtf: File + +steps: + alignment: + run: alignment.cwl + scatter: fq + in: + fq: fq + genome: genome + gtf: gtf + out: [qc_html, bam_sorted_indexed, featurecounts] + + ### 4. Combining results + featureCounts: + requirements: + ResourceRequirement: + ramMin: 500 + run: featureCounts.cwl + in: + counts_input_bam: alignment/bam_sorted_indexed + gtf: gtf + out: [featurecounts] + +outputs: + qc_html: + type: File[] + outputSource: alignment/qc_html + bam_sorted_indexed: + type: File[] + outputSource: alignment/bam_sorted_indexed + + ### 4. Combining results + featurecounts: + type: File + outputSource: featureCounts/featurecounts + +requirements: + SubworkflowFeatureRequirement: {} + ScatterFeatureRequirement: {} diff --git a/lesson4/lesson4.md b/lesson4/lesson4.md index 6aa45de..91df9bd 100644 --- a/lesson4/lesson4.md +++ b/lesson4/lesson4.md @@ -17,10 +17,10 @@ steps: alignment: run: alignment.cwl in: - fq: fq - genome: genome - gtf: gtf - out: [qc_html, bam_sorted_indexed, featurecounts] + fq: fq + genome: genome + gtf: gtf + out: [qc_html, bam_sorted_indexed, featurecounts] ``` In the outputs section, all the output sources are from the alignment step: @@ -71,12 +71,12 @@ run `alignment.cwl` for each value in the list in the `fq` parameter. steps: alignment: run: alignment.cwl - scatter: fq + scatter: fq in: - fq: fq - genome: genome - gtf: gtf - out: [qc_html, bam_sorted_indexed, featurecounts] + fq: fq + genome: genome + gtf: gtf + out: [qc_html, bam_sorted_indexed, featurecounts] ``` Because the scatter produces multiple outputs, each output parameter diff --git a/lesson5/answers/alignment.cwl b/lesson5/answers/alignment.cwl new file mode 100644 index 0000000..8a54fe4 --- /dev/null +++ b/lesson5/answers/alignment.cwl @@ -0,0 +1,47 @@ +cwlVersion: v1.2 +class: Workflow +label: RNAseq CWL practice workflow + +inputs: + fq: File + genome: Directory + gtf: File + +requirements: + StepInputExpressionRequirement: {} + +steps: + fastqc: + run: bio-cwl-tools/fastqc/fastqc_2.cwl + in: + reads_file: fq + out: [html_file] + + STAR: + requirements: + ResourceRequirement: + ramMin: 6000 + run: bio-cwl-tools/STAR/STAR-Align.cwl + in: + RunThreadN: {default: 4} + GenomeDir: genome + ForwardReads: fq + OutSAMtype: {default: BAM} + OutSAMunmapped: {default: Within} + ### 1. Expressions on step inputs + OutFileNamePrefix: {valueFrom: "$(inputs.ForwardReads.nameroot)."} + out: [alignment] + + samtools: + run: bio-cwl-tools/samtools/samtools_index.cwl + in: + bam_sorted: STAR/alignment + out: [bam_sorted_indexed] + +outputs: + qc_html: + type: File + outputSource: fastqc/html_file + bam_sorted_indexed: + type: File + outputSource: samtools/bam_sorted_indexed diff --git a/lesson5/answers/featureCounts.cwl b/lesson5/answers/featureCounts.cwl new file mode 100644 index 0000000..681697e --- /dev/null +++ b/lesson5/answers/featureCounts.cwl @@ -0,0 +1,25 @@ +cwlVersion: v1.2 +class: CommandLineTool + +inputs: + gtf: File + counts_input_bam: + - File + - File[] + +baseCommand: featureCounts + +arguments: [-T, $(runtime.cores), + -a, $(inputs.gtf), + -o, featurecounts.tsv, + $(inputs.counts_input_bam)] + +outputs: + featurecounts: + type: File + outputBinding: + glob: featurecounts.tsv + +hints: + DockerRequirement: + dockerPull: quay.io/biocontainers/subread:1.5.0p3--0 diff --git a/lesson5/answers/main.cwl b/lesson5/answers/main.cwl new file mode 100644 index 0000000..e934079 --- /dev/null +++ b/lesson5/answers/main.cwl @@ -0,0 +1,50 @@ +cwlVersion: v1.2 +class: Workflow +label: RNAseq CWL practice workflow + +inputs: + fq: File[] + genome: Directory + gtf: File + +steps: + alignment: + run: alignment.cwl + scatter: fq + in: + fq: fq + genome: genome + gtf: gtf + out: [qc_html, bam_sorted_indexed, featurecounts] + + featureCounts: + requirements: + ResourceRequirement: + ramMin: 500 + run: featureCounts.cwl + in: + counts_input_bam: alignment/bam_sorted_indexed + gtf: gtf + out: [featurecounts] + + ### 2. Organizing output files into Directories + output-subdirs: + run: subdirs.cwl + in: + fq: fq + bams: alignment/bam_sorted_indexed + qc: alignment/qc_html + out: [dirs] + +outputs: + dirs: + type: Directory[] + outputSource: output-subdirs/dirs + + featurecounts: + type: File + outputSource: featureCounts/featurecounts + +requirements: + SubworkflowFeatureRequirement: {} + ScatterFeatureRequirement: {} diff --git a/lesson5/answers/subdirs.cwl b/lesson5/answers/subdirs.cwl new file mode 100644 index 0000000..fc4fe7d --- /dev/null +++ b/lesson5/answers/subdirs.cwl @@ -0,0 +1,22 @@ +cwlVersion: v1.2 +class: ExpressionTool +requirements: + InlineJavascriptRequirement: {} +inputs: + fq: File[] + bams: File[] + qc: File[] +outputs: + dirs: Directory[] +expression: |- + ${ + var dirs = []; + for (var i = 0; i < inputs.bams.length; i++) { + dirs.push({ + "class": "Directory", + "basename": inputs.fq[i].nameroot, + "listing": [inputs.bams[i], inputs.qc[i]] + }); + } + return {"dirs": dirs}; + } diff --git a/lesson5/lesson5.md b/lesson5/lesson5.md index 4567620..6640176 100644 --- a/lesson5/lesson5.md +++ b/lesson5/lesson5.md @@ -22,6 +22,7 @@ filename. ``` requirements: StepInputExpressionRequirement: {} + steps: ... STAR: