Add answers to each section.
Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <peter.amstutz@curii.com>
| Lesson | Description |
|----------|-------------|
-| [Lesson 1](lesson1/lesson1.md) | Turning a shell script into a workflow from existing tool wrappers |
+| [Lesson 1](lesson1/lesson1.md) | Turning a shell script into a workflow by composing existing tools |
| [Lesson 2](lesson2/lesson2.md) | Running and debugging a workflow |
| [Lesson 3](lesson3/lesson3.md) | Writing a tool wrapper |
| [Lesson 4](lesson4/lesson4.md) | Analyzing multiple samples |
--- /dev/null
+### 1. File header
+cwlVersion: v1.2
+class: Workflow
+label: RNAseq CWL practice workflow
+
+### 2. Workflow Inputs
+inputs:
+ fq: File
+ genome: Directory
+ gtf: File
+
+### 3. Workflow Steps
+steps:
+ fastqc:
+ run: bio-cwl-tools/fastqc/fastqc_2.cwl
+ in:
+ reads_file: fq
+ out: [html_file]
+
+ ### 4. Running alignment with STAR
+ STAR:
+ requirements:
+ ResourceRequirement:
+ ramMin: 6000
+ run: bio-cwl-tools/STAR/STAR-Align.cwl
+ in:
+ RunThreadN: {default: 4}
+ GenomeDir: genome
+ ForwardReads: fq
+ OutSAMtype: {default: BAM}
+ OutSAMunmapped: {default: Within}
+ out: [alignment]
+
+ ### 5. Running samtools
+ samtools:
+ run: bio-cwl-tools/samtools/samtools_index.cwl
+ in:
+ bam_sorted: STAR/alignment
+ out: [bam_sorted_indexed]
+
+### 7. Workflow Outputs
+outputs:
+ qc_html:
+ type: File
+ outputSource: fastqc/html_file
+ bam_sorted_indexed:
+ type: File
+ outputSource: samtools/bam_sorted_indexed
-# Turning a shell script into a workflow using existing tools
+# Turning a shell script into a workflow by composing existing tools
-In this lesson we will turn `rnaseq_analysis_on_input_file.sh` into a workflow.
+## Introduction
-## Setting up
+The goal of this training is to walk through the development of a
+best-practices CWL workflow by translating an existing bioinformatics
+shell script into CWL. Specific knowledge of the biology of RNA-seq
+is *not* a prerequisite for these lessons.
-We will create a new git repository and import a library of existing
-tool definitions that will help us build our workflow.
+These lessons are based on "Introduction to RNA-seq using
+high-performance computing (HPC)" lessons developed by members of the
+teaching team at the Harvard Chan Bioinformatics Core (HBC). The
+original training, which includes additional lectures about the
+biology of RNA-seq can be found here:
-Create a new git repository to hold our workflow with this command:
+https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2
-```
-git init rnaseq-cwl-training-exercises
-```
+## Background
-On Arvados use this:
+RNA-seq is the process of sequencing RNA in a biological sample. From
+the sequence reads, we want to measure the relative number of RNA
+molecules appearing in the sample that were produced by particular
+genes. This analysis is called "differential gene expression".
-```
-git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
-```
+The entire process looks like this:
-Next, import bio-cwl-tools with this command:
+![](RNAseqWorkflow.png)
-```
-git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
-```
+For this training, we are only concerned with the middle analytical
+steps (skipping adapter trimming).
+
+* Quality control (FASTQC)
+* Alignment (mapping)
+* Counting reads associated with genes
-## The shell script
+## Analysis shell script
+
+This analysis is already available as a Unix shell script, which we
+will refer to in order to build the workflow.
+
+Some of the reasons to use CWL over a plain shell script: portability,
+scalability, ability to run on platforms that are not traditional HPC.
+
+rnaseq_analysis_on_input_file.sh
```
#!/bin/bash
featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
```
+## Setting up
+
+We will create a new git repository and import a library of existing
+tool definitions that will help us build our workflow.
+
+Create a new git repository to hold our workflow with this command:
+
+```
+git init rnaseq-cwl-training-exercises
+```
+
+On Arvados use this:
+
+```
+git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
+```
+
+Next, import bio-cwl-tools with this command:
+
+```
+git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
+```
+
## Writing the workflow
### 1. File header
fastqc:
run: bio-cwl-tools/fastqc/fastqc_2.cwl
in:
- reads_file: fq
+ reads_file: fq
out: [html_file]
```
--- /dev/null
+### 1. File header
+cwlVersion: v1.2
+class: CommandLineTool
+
+### 2. Command line tool inputs
+inputs:
+ gtf: File
+ counts_input_bam: File
+
+### 3. Specifying the program to run
+baseCommand: featureCounts
+
+### 4. Command arguments
+arguments: [-T, $(runtime.cores),
+ -a, $(inputs.gtf),
+ -o, featurecounts.tsv,
+ $(inputs.counts_input_bam)]
+
+### 5. Outputs section
+outputs:
+ featurecounts:
+ type: File
+ outputBinding:
+ glob: featurecounts.tsv
+
+### 6. Running in a container
+hints:
+ DockerRequirement:
+ dockerPull: quay.io/biocontainers/subread:1.5.0p3--0
--- /dev/null
+cwlVersion: v1.2
+class: Workflow
+label: RNAseq CWL practice workflow
+
+inputs:
+ fq: File
+ genome: Directory
+ gtf: File
+
+steps:
+ fastqc:
+ run: bio-cwl-tools/fastqc/fastqc_2.cwl
+ in:
+ reads_file: fq
+ out: [html_file]
+
+ STAR:
+ requirements:
+ ResourceRequirement:
+ ramMin: 6000
+ run: bio-cwl-tools/STAR/STAR-Align.cwl
+ in:
+ RunThreadN: {default: 4}
+ GenomeDir: genome
+ ForwardReads: fq
+ OutSAMtype: {default: BAM}
+ OutSAMunmapped: {default: Within}
+ out: [alignment]
+
+ samtools:
+ run: bio-cwl-tools/samtools/samtools_index.cwl
+ in:
+ bam_sorted: STAR/alignment
+ out: [bam_sorted_indexed]
+
+ ### 8. Adding it to the workflow
+ featureCounts:
+ requirements:
+ ResourceRequirement:
+ ramMin: 500
+ run: featureCounts.cwl
+ in:
+ counts_input_bam: samtools/bam_sorted_indexed
+ gtf: gtf
+ out: [featurecounts]
+
+outputs:
+ qc_html:
+ type: File
+ outputSource: fastqc/html_file
+ bam_sorted_indexed:
+ type: File
+ outputSource: samtools/bam_sorted_indexed
+
+ ### 8. Adding it to the workflow
+ featurecounts:
+ type: File
+ outputSource: featureCounts/featurecounts
```
arguments: [-T, $(runtime.cores),
-a, $(inputs.gtf),
- -o, featurecounts.tsv,
- $(inputs.counts_input_bam)]
+ -o, featurecounts.tsv,
+ $(inputs.counts_input_bam)]
```
### 5. Outputs section
outputs:
featurecounts:
type: File
- outputBinding:
- glob: featurecounts.tsv
+ outputBinding:
+ glob: featurecounts.tsv
```
### 6. Running in a container
ResourceRequirement:
ramMin: 500
run: featureCounts.cwl
- in:
- counts_input_bam: samtools/bam_sorted_indexed
- gtf: gtf
- out: [featurecounts]
+ in:
+ counts_input_bam: samtools/bam_sorted_indexed
+ gtf: gtf
+ out: [featurecounts]
```
We will add the result from featurecounts to the output:
...
featurecounts:
type: File
- outputSource: featureCounts/featurecounts
-
+ outputSource: featureCounts/featurecounts
```
You should now be able to re-run the workflow and it will run the
--- /dev/null
+cwlVersion: v1.2
+class: Workflow
+label: RNAseq CWL practice workflow
+
+inputs:
+ fq: File
+ genome: Directory
+ gtf: File
+
+steps:
+ fastqc:
+ run: bio-cwl-tools/fastqc/fastqc_2.cwl
+ in:
+ reads_file: fq
+ out: [html_file]
+
+ STAR:
+ requirements:
+ ResourceRequirement:
+ ramMin: 6000
+ run: bio-cwl-tools/STAR/STAR-Align.cwl
+ in:
+ RunThreadN: {default: 4}
+ GenomeDir: genome
+ ForwardReads: fq
+ OutSAMtype: {default: BAM}
+ OutSAMunmapped: {default: Within}
+ out: [alignment]
+
+ samtools:
+ run: bio-cwl-tools/samtools/samtools_index.cwl
+ in:
+ bam_sorted: STAR/alignment
+ out: [bam_sorted_indexed]
+
+ featureCounts:
+ requirements:
+ ResourceRequirement:
+ ramMin: 500
+ run: featureCounts.cwl
+ in:
+ counts_input_bam: samtools/bam_sorted_indexed
+ gtf: gtf
+ out: [featurecounts]
+
+outputs:
+ qc_html:
+ type: File
+ outputSource: fastqc/html_file
+ bam_sorted_indexed:
+ type: File
+ outputSource: samtools/bam_sorted_indexed
+
+ featurecounts:
+ type: File
+ outputSource: featureCounts/featurecounts
--- /dev/null
+cwlVersion: v1.2
+class: CommandLineTool
+
+inputs:
+ gtf: File
+ counts_input_bam: File
+
+baseCommand: featureCounts
+
+arguments: [-T, $(runtime.cores),
+ -a, $(inputs.gtf),
+ -o, featurecounts.tsv,
+ $(inputs.counts_input_bam)]
+
+outputs:
+ featurecounts:
+ type: File
+ outputBinding:
+ glob: featurecounts.tsv
+
+hints:
+ DockerRequirement:
+ dockerPull: quay.io/biocontainers/subread:1.5.0p3--0
--- /dev/null
+cwlVersion: v1.2
+class: Workflow
+label: RNAseq CWL practice workflow
+
+inputs:
+ fq: File
+ genome: Directory
+ gtf: File
+
+### 1. Subworkflows
+steps:
+ alignment:
+ run: alignment.cwl
+ in:
+ fq: fq
+ genome: genome
+ gtf: gtf
+ out: [qc_html, bam_sorted_indexed, featurecounts]
+
+outputs:
+ qc_html:
+ type: File
+ outputSource: alignment/qc_html
+ bam_sorted_indexed:
+ type: File
+ outputSource: alignment/bam_sorted_indexed
+ featurecounts:
+ type: File
+ outputSource: alignment/featurecounts
+
+requirements:
+ SubworkflowFeatureRequirement: {}
--- /dev/null
+cwlVersion: v1.2
+class: Workflow
+label: RNAseq CWL practice workflow
+
+inputs:
+ fq: File
+ genome: Directory
+ gtf: File
+
+steps:
+ fastqc:
+ run: bio-cwl-tools/fastqc/fastqc_2.cwl
+ in:
+ reads_file: fq
+ out: [html_file]
+
+ STAR:
+ requirements:
+ ResourceRequirement:
+ ramMin: 6000
+ run: bio-cwl-tools/STAR/STAR-Align.cwl
+ in:
+ RunThreadN: {default: 4}
+ GenomeDir: genome
+ ForwardReads: fq
+ OutSAMtype: {default: BAM}
+ OutSAMunmapped: {default: Within}
+ out: [alignment]
+
+ samtools:
+ run: bio-cwl-tools/samtools/samtools_index.cwl
+ in:
+ bam_sorted: STAR/alignment
+ out: [bam_sorted_indexed]
+
+ featureCounts:
+ requirements:
+ ResourceRequirement:
+ ramMin: 500
+ run: featureCounts.cwl
+ in:
+ counts_input_bam: samtools/bam_sorted_indexed
+ gtf: gtf
+ out: [featurecounts]
+
+outputs:
+ qc_html:
+ type: File
+ outputSource: fastqc/html_file
+ bam_sorted_indexed:
+ type: File
+ outputSource: samtools/bam_sorted_indexed
+
+ featurecounts:
+ type: File
+ outputSource: featureCounts/featurecounts
--- /dev/null
+cwlVersion: v1.2
+class: CommandLineTool
+
+inputs:
+ gtf: File
+ counts_input_bam: File
+
+baseCommand: featureCounts
+
+arguments: [-T, $(runtime.cores),
+ -a, $(inputs.gtf),
+ -o, featurecounts.tsv,
+ $(inputs.counts_input_bam)]
+
+outputs:
+ featurecounts:
+ type: File
+ outputBinding:
+ glob: featurecounts.tsv
+
+hints:
+ DockerRequirement:
+ dockerPull: quay.io/biocontainers/subread:1.5.0p3--0
--- /dev/null
+cwlVersion: v1.2
+class: Workflow
+label: RNAseq CWL practice workflow
+
+### 2. Scattering
+inputs:
+ fq: File[]
+ genome: Directory
+ gtf: File
+
+steps:
+ alignment:
+ run: alignment.cwl
+ scatter: fq
+ in:
+ fq: fq
+ genome: genome
+ gtf: gtf
+ out: [qc_html, bam_sorted_indexed, featurecounts]
+
+outputs:
+ qc_html:
+ type: File[]
+ outputSource: alignment/qc_html
+ bam_sorted_indexed:
+ type: File[]
+ outputSource: alignment/bam_sorted_indexed
+ featurecounts:
+ type: File[]
+ outputSource: alignment/featurecounts
+
+requirements:
+ SubworkflowFeatureRequirement: {}
+ ScatterFeatureRequirement: {}
--- /dev/null
+cwlVersion: v1.2
+class: Workflow
+label: RNAseq CWL practice workflow
+
+inputs:
+ fq: File
+ genome: Directory
+ gtf: File
+
+steps:
+ fastqc:
+ run: bio-cwl-tools/fastqc/fastqc_2.cwl
+ in:
+ reads_file: fq
+ out: [html_file]
+
+ STAR:
+ requirements:
+ ResourceRequirement:
+ ramMin: 6000
+ run: bio-cwl-tools/STAR/STAR-Align.cwl
+ in:
+ RunThreadN: {default: 4}
+ GenomeDir: genome
+ ForwardReads: fq
+ OutSAMtype: {default: BAM}
+ OutSAMunmapped: {default: Within}
+ out: [alignment]
+
+ samtools:
+ run: bio-cwl-tools/samtools/samtools_index.cwl
+ in:
+ bam_sorted: STAR/alignment
+ out: [bam_sorted_indexed]
+
+outputs:
+ qc_html:
+ type: File
+ outputSource: fastqc/html_file
+ bam_sorted_indexed:
+ type: File
+ outputSource: samtools/bam_sorted_indexed
--- /dev/null
+cwlVersion: v1.2
+class: CommandLineTool
+
+### 4. Combining results
+inputs:
+ gtf: File
+ counts_input_bam:
+ - File
+ - File[]
+
+baseCommand: featureCounts
+
+arguments: [-T, $(runtime.cores),
+ -a, $(inputs.gtf),
+ -o, featurecounts.tsv,
+ $(inputs.counts_input_bam)]
+
+outputs:
+ featurecounts:
+ type: File
+ outputBinding:
+ glob: featurecounts.tsv
+
+hints:
+ DockerRequirement:
+ dockerPull: quay.io/biocontainers/subread:1.5.0p3--0
--- /dev/null
+cwlVersion: v1.2
+class: Workflow
+label: RNAseq CWL practice workflow
+
+### 2. Scattering
+inputs:
+ fq: File[]
+ genome: Directory
+ gtf: File
+
+steps:
+ alignment:
+ run: alignment.cwl
+ scatter: fq
+ in:
+ fq: fq
+ genome: genome
+ gtf: gtf
+ out: [qc_html, bam_sorted_indexed, featurecounts]
+
+ ### 4. Combining results
+ featureCounts:
+ requirements:
+ ResourceRequirement:
+ ramMin: 500
+ run: featureCounts.cwl
+ in:
+ counts_input_bam: alignment/bam_sorted_indexed
+ gtf: gtf
+ out: [featurecounts]
+
+outputs:
+ qc_html:
+ type: File[]
+ outputSource: alignment/qc_html
+ bam_sorted_indexed:
+ type: File[]
+ outputSource: alignment/bam_sorted_indexed
+
+ ### 4. Combining results
+ featurecounts:
+ type: File
+ outputSource: featureCounts/featurecounts
+
+requirements:
+ SubworkflowFeatureRequirement: {}
+ ScatterFeatureRequirement: {}
alignment:
run: alignment.cwl
in:
- fq: fq
- genome: genome
- gtf: gtf
- out: [qc_html, bam_sorted_indexed, featurecounts]
+ fq: fq
+ genome: genome
+ gtf: gtf
+ out: [qc_html, bam_sorted_indexed, featurecounts]
```
In the outputs section, all the output sources are from the alignment step:
steps:
alignment:
run: alignment.cwl
- scatter: fq
+ scatter: fq
in:
- fq: fq
- genome: genome
- gtf: gtf
- out: [qc_html, bam_sorted_indexed, featurecounts]
+ fq: fq
+ genome: genome
+ gtf: gtf
+ out: [qc_html, bam_sorted_indexed, featurecounts]
```
Because the scatter produces multiple outputs, each output parameter
--- /dev/null
+cwlVersion: v1.2
+class: Workflow
+label: RNAseq CWL practice workflow
+
+inputs:
+ fq: File
+ genome: Directory
+ gtf: File
+
+requirements:
+ StepInputExpressionRequirement: {}
+
+steps:
+ fastqc:
+ run: bio-cwl-tools/fastqc/fastqc_2.cwl
+ in:
+ reads_file: fq
+ out: [html_file]
+
+ STAR:
+ requirements:
+ ResourceRequirement:
+ ramMin: 6000
+ run: bio-cwl-tools/STAR/STAR-Align.cwl
+ in:
+ RunThreadN: {default: 4}
+ GenomeDir: genome
+ ForwardReads: fq
+ OutSAMtype: {default: BAM}
+ OutSAMunmapped: {default: Within}
+ ### 1. Expressions on step inputs
+ OutFileNamePrefix: {valueFrom: "$(inputs.ForwardReads.nameroot)."}
+ out: [alignment]
+
+ samtools:
+ run: bio-cwl-tools/samtools/samtools_index.cwl
+ in:
+ bam_sorted: STAR/alignment
+ out: [bam_sorted_indexed]
+
+outputs:
+ qc_html:
+ type: File
+ outputSource: fastqc/html_file
+ bam_sorted_indexed:
+ type: File
+ outputSource: samtools/bam_sorted_indexed
--- /dev/null
+cwlVersion: v1.2
+class: CommandLineTool
+
+inputs:
+ gtf: File
+ counts_input_bam:
+ - File
+ - File[]
+
+baseCommand: featureCounts
+
+arguments: [-T, $(runtime.cores),
+ -a, $(inputs.gtf),
+ -o, featurecounts.tsv,
+ $(inputs.counts_input_bam)]
+
+outputs:
+ featurecounts:
+ type: File
+ outputBinding:
+ glob: featurecounts.tsv
+
+hints:
+ DockerRequirement:
+ dockerPull: quay.io/biocontainers/subread:1.5.0p3--0
--- /dev/null
+cwlVersion: v1.2
+class: Workflow
+label: RNAseq CWL practice workflow
+
+inputs:
+ fq: File[]
+ genome: Directory
+ gtf: File
+
+steps:
+ alignment:
+ run: alignment.cwl
+ scatter: fq
+ in:
+ fq: fq
+ genome: genome
+ gtf: gtf
+ out: [qc_html, bam_sorted_indexed, featurecounts]
+
+ featureCounts:
+ requirements:
+ ResourceRequirement:
+ ramMin: 500
+ run: featureCounts.cwl
+ in:
+ counts_input_bam: alignment/bam_sorted_indexed
+ gtf: gtf
+ out: [featurecounts]
+
+ ### 2. Organizing output files into Directories
+ output-subdirs:
+ run: subdirs.cwl
+ in:
+ fq: fq
+ bams: alignment/bam_sorted_indexed
+ qc: alignment/qc_html
+ out: [dirs]
+
+outputs:
+ dirs:
+ type: Directory[]
+ outputSource: output-subdirs/dirs
+
+ featurecounts:
+ type: File
+ outputSource: featureCounts/featurecounts
+
+requirements:
+ SubworkflowFeatureRequirement: {}
+ ScatterFeatureRequirement: {}
--- /dev/null
+cwlVersion: v1.2
+class: ExpressionTool
+requirements:
+ InlineJavascriptRequirement: {}
+inputs:
+ fq: File[]
+ bams: File[]
+ qc: File[]
+outputs:
+ dirs: Directory[]
+expression: |-
+ ${
+ var dirs = [];
+ for (var i = 0; i < inputs.bams.length; i++) {
+ dirs.push({
+ "class": "Directory",
+ "basename": inputs.fq[i].nameroot,
+ "listing": [inputs.bams[i], inputs.qc[i]]
+ });
+ }
+ return {"dirs": dirs};
+ }
```
requirements:
StepInputExpressionRequirement: {}
+
steps:
...
STAR: