From 1b547ed23c65a69e68ef33c4d91eb69b504c5c28 Mon Sep 17 00:00:00 2001 From: Peter Amstutz Date: Thu, 28 Jan 2021 17:47:28 -0500 Subject: [PATCH] Second pass. Lots of work. Arvados-DCO-1.1-Signed-off-by: Peter Amstutz --- _episodes/01-introduction.md | 101 +++------ _episodes/02-workflow.md | 355 +++++++++++++++++++++++++++----- _episodes/03-running.md | 90 ++++---- _episodes/04-commandlinetool.md | 234 +++++++++++++-------- _episodes/05-scatter.md | 88 +++++--- _episodes/06-expressions.md | 16 +- _episodes/07-resources.md | 2 +- answers/ep2/main.cwl | 15 +- answers/ep3/main.cwl | 43 ++++ answers/ep4/featureCounts.cwl | 12 +- answers/ep4/main.cwl | 4 +- answers/ep5/part1/alignment.cwl | 2 +- answers/ep5/part2/alignment.cwl | 2 +- answers/ep5/part4/alignment.cwl | 2 +- answers/ep6/alignment.cwl | 2 +- 15 files changed, 657 insertions(+), 311 deletions(-) create mode 100644 answers/ep3/main.cwl diff --git a/_episodes/01-introduction.md b/_episodes/01-introduction.md index 8ee870e..81819f9 100644 --- a/_episodes/01-introduction.md +++ b/_episodes/01-introduction.md @@ -4,33 +4,48 @@ teaching: 10 exercises: 0 questions: - "What is CWL?" +- "What are the requirements for this training?" - "What is the goal of this training?" objectives: -- "First learning objective. (FIXME)" +- "Understand how the training will be motivated by an example analysis." keypoints: -- "First key point. Brief Answer to questions. (FIXME)" +- "Common Workflow Language is a standard for describing data analysis workflows" +- "This training assumes some basic familiarity with editing text files, the Unix command line, and Unix shell scripts." +- "We will use an bioinformatics RNA-seq analysis as an example workflow, but does not require in-depth knowledge of biology." +- "After completing this training, you should be able to begin writing workflows for your own analysis, and know where to learn more." --- # Introduction to Common Worklow Language The Common Workflow Language (CWL) is an open standard for describing -analysis workflows and tools in a way that makes them portable and -scalable across a variety of software and hardware environments, from -workstations to cluster, cloud, and high performance computing (HPC) -environments. CWL is designed to meet the needs of data-intensive -science, such as Bioinformatics, Medical Imaging, Astronomy, High -Energy Physics, and Machine Learning. +automated, batch data analysis workflows. Unlike many programming +languages, CWL is a declarative language. This means it describes +_what_ should happen, but not _how_ it should happen. This enables +workflows written in CWL to be portable and scalable across a variety +of software and hardware environments, from workstations to cluster, +cloud, and high performance computing (HPC) environments. As a +standard with multiple implementations, CWL is particularly well +suited for research collaboration, publishing, and high-throughput +production data analysis. # Introduction to this training The goal of this training is to walk the student through the development of a best-practices CWL workflow, starting from an -existing shell script that performs a common bioinformatics analysis. +existing shell script that performs a simple RNA-seq bioinformatics +analysis. At the conclusion of this training, you should have a grasp +of the essential components of a workflow, and have a basis for +learning more. + +This training assumes some basic familiarity with editing text files, +the Unix command line, and Unix shell scripts. Specific knowledge of the biology of RNA-seq is *not* a prerequisite -for these lessons. CWL is not domain specific to bioinformatics. We -hope that you will find this training useful even if you work in some -other field of research. +for these lessons. Although orignally developed to solve big data +problems in genomics, CWL is not domain specific to bioinformatics, +and is used in a number of other fields including medical imaging, +astronomy, geospatial, and machine learning. We hope that you will +find this training useful regardless of your area of research. These lessons are based on [Introduction to RNA-seq using high-performance computing @@ -60,65 +75,7 @@ steps (skipping adapter trimming). * Counting reads associated with genes In this training, we are not attempting to develop the analysis from -scratch, instead we we will be starting from an analysis written as a -shell script. We will be using the following shell script as a guide to build -our workflow. - -rnaseq_analysis_on_input_file.sh - -``` -#!/bin/bash - -# Based on -# https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html -# - -# This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it. -# USAGE: sh rnaseq_analysis_on_input_file.sh - -set -e - -# initialize a variable with an intuitive name to store the name of the input fastq file -fq=$1 - -# grab base of filename for naming outputs -base=`basename $fq .subset.fq` -echo "Sample name is $base" - -# specify the number of cores to use -cores=4 - -# directory with genome reference FASTA and index files + name of the gene annotation file -genome=rnaseq/reference_data -gtf=rnaseq/reference_data/chr1-hg19_genes.gtf - -# make all of the output directories -# The -p option means mkdir will create the whole path if it -# does not exist and refrain from complaining if it does exist -mkdir -p rnaseq/results/fastqc -mkdir -p rnaseq/results/STAR -mkdir -p rnaseq/results/counts - -# set up output filenames and locations -fastqc_out=rnaseq/results/fastqc -align_out=rnaseq/results/STAR/${base}_ -counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam -counts=rnaseq/results/counts/${base}_featurecounts.txt - -echo "Processing file $fq" - -# Run FastQC and move output to the appropriate folder -fastqc $fq - -# Run STAR -STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outFileNamePrefix $align_out --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard - -# Create BAM index -samtools index $counts_input_bam - -# Count mapped reads -featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam -``` - +scratch, instead we we will be starting from an analysis already +written in a shell script, which will be supplied in lesson 2. {% include links.md %} diff --git a/_episodes/02-workflow.md b/_episodes/02-workflow.md index cfb133c..3271ba8 100644 --- a/_episodes/02-workflow.md +++ b/_episodes/02-workflow.md @@ -1,36 +1,184 @@ --- -title: "Make a workflow by composing tools" -teaching: 0 -exercises: 0 +title: "Create a Workflow by Composing Tools" +teaching: 20 +exercises: 10 questions: -- "Key question (FIXME)" +- "What is the syntax of CWL?" +- "What are the key components of a workflow?" objectives: -- "First learning objective. (FIXME)" +- "Write a workflow based on the source shell script, making use of existing tool wrappers." keypoints: -- "First key point. Brief Answer to questions. (FIXME)" +- "CWL documents are written using a syntax called YAML." +- "The key components of the workflow are: the header, the inputs, the steps, and the outputs." --- -# 1. File header +# Source shell script -Create a new file "main.cwl" +In this lesson, we will develop an initial workflow inspired by the +following shell script. + +rnaseq_analysis_on_input_file.sh + +``` +#!/bin/bash + +# Based on +# https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html +# + +# This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it. +# USAGE: sh rnaseq_analysis_on_input_file.sh + +set -e + +# initialize a variable with an intuitive name to store the name of the input fastq file +fq=$1 + +# grab base of filename for naming outputs +base=`basename $fq .subset.fq` +echo "Sample name is $base" + +# specify the number of cores to use +cores=4 + +# directory with genome reference FASTA and index files + name of the gene annotation file +genome=rnaseq/reference_data +gtf=rnaseq/reference_data/chr1-hg19_genes.gtf + +# make all of the output directories +# The -p option means mkdir will create the whole path if it +# does not exist and refrain from complaining if it does exist +mkdir -p rnaseq/results/fastqc +mkdir -p rnaseq/results/STAR +mkdir -p rnaseq/results/counts + +# set up output filenames and locations +fastqc_out=rnaseq/results/fastqc +align_out=rnaseq/results/STAR/${base}_ +counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam +counts=rnaseq/results/counts/${base}_featurecounts.txt + +echo "Processing file $fq" + +# Run FastQC and move output to the appropriate folder +fastqc $fq + +# Run STAR +STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outFileNamePrefix $align_out --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard + +# Create BAM index +samtools index $counts_input_bam + +# Count mapped reads +featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam +``` +{: .language-bash } -Start with this header. +# CWL Syntax +CWL documents are written using a format called "YAML". Here is a crash-course in YAML: + +Data fields are written with the name, followed by a colon `:`, a space, +and then the value. + +``` +fieldName: value +``` +{: .language-yaml } + +The value is the remaining text to the end of the line. + +Special characters in YAML include `:`, `{`, `}` `[`, `]`, `#`, `!` +and `%`. If your text begins with any of these characters, you must +surround the string in single or double quotes. + +``` +fieldName: "#quoted-value" +``` +{: .language-yaml } + +You can write multi-line text by putting `|-` and writing an indented +block. The leading whitespace will be removed from the actual value. + +``` +fieldName: |- + This is a multi- + line string. + Horray! +``` +{: .language-yaml } + +Nested sections are indented: + +``` +section1: + field1: value1 + field2: value2 +``` +{: .language-yaml } + +Nested sections can _also_ be wrapped in curly brackets. In this case, fields must be comma-separated. + +``` +section1: {field1: value1, field2, value2} +``` +{: .language-yaml } + +When each item is on its own line starting with a dash `-`, it is a list. + +``` +section2: + - value1 + - value2 +``` +{: .language-yaml } + +List can _also_ be wrapped in square brackets. In this case, values must be comma-separated. + +``` +section2: [value1, value2] +``` +{: .language-yaml } + +Comments start with `#`. + +``` +# This is a comment about field3 +field3: stuff + +field4: stuff # This is a comment about field4 +``` +{: .language-yaml } + +Finally, YAML is a superset of JSON. Valid JSON is also valid YAML, +so you may sometimes see JSON format being used instead of YAML format +for CWL documents. + +# Workflow header + +Create a new file "main.cwl" + +Let's start with the header. ``` cwlVersion: v1.2 class: Workflow label: RNAseq CWL practice workflow ``` +{: .language-yaml } + +* cwlVersion - Every file must include this. It declares the version of CWL in use. +* class - This is the type of CWL document. We will see other types in future lessons. +* label - Optional title of your workflow. -# 2. Workflow Inputs +# Workflow Inputs The purpose of a workflow is to consume some input parameters, run a series of steps, and produce output values. For this analysis, the input parameters are the fastq file and the reference data required by STAR. -In the original shell script, the following variables are declared: +In the source shell script, the following variables are declared: ``` # initialize a variable with an intuitive name to store the name of the input fastq file @@ -40,6 +188,7 @@ fq=$1 genome=rnaseq/reference_data gtf=rnaseq/reference_data/chr1-hg19_genes.gtf ``` +{: .language-bash } In CWL, we will declare these variables in the `inputs` section. @@ -55,32 +204,87 @@ inputs: genome: Directory gtf: File ``` +{: .language-yaml } -# 3. Workflow Steps +# Workflow Steps A workflow consists of one or more steps. This is the `steps` section. -Now we need to describe the first step of the workflow. This step is to run `fastqc`. +Now we need to describe the first step of the workflow. In the source +script, the first step is to run `fastqc`. + +``` +# Run FastQC and move output to the appropriate folder +fastqc $fq +``` +{: .language-bash } A workflow step consists of the name of the step, the tool to `run`, the input parameters to be passed to the tool in `in`, and the output parameters expected from the tool in `out`. -The value of `run` references the tool file. Tip: while typing the -file name, you can get suggestions and auto-completion on a partial -name using control+space. +The value of `run` references the tool file. The tool file describes +how to run the tool (we will discuss how to write tool files in lesson +4). If we look in `bio-cwl-tools` (which you should have imported +when setting up a practice repository in the initial setup +instructions) we find `bio-cwl-tools/fastqc/fastqc_2.cwl`. + +Next, the `in` block is mapping of input parameters to the tool and +the workflow parameters that will be assigned to those inputs. We +need to know what input parameters the tool accepts. + +Let's open up the tool file and take a look: + +Find the `inputs` section of `bio-cwl-tools/fastqc/fastqc_2.cwl`: -The `in` block lists input parameters to the tool and the workflow -parameters that will be assigned to those inputs. +``` +inputs: + + reads_file: + type: + - File + inputBinding: + position: 50 + doc: | + Input bam,sam,bam_mapped,sam_mapped or fastq file +``` +{: .language-yaml } + +Now we know we need to provide an input parameter called `reads_file`. + +Next, the `out` section is a list of output parameters from the tool +that will be used later in the workflow, or as workflow output. We +need to know what output parameters the tool produces. Find the +`outputs` section of `bio-cwl-tools/fastqc/fastqc_2.cwl`: -The `out` block lists output parameters to the tool that are used -later in the workflow. +``` +outputs: + + zipped_file: + type: + - File + outputBinding: + glob: '*.zip' + html_file: + type: + - File + outputBinding: + glob: '*.html' + summary_file: + type: + - File + outputBinding: + glob: | + ${ + return "*/summary.txt"; + } +``` +{: .language-yaml } + +Now we know to expect an output parameter called `html_file`. -You need to know which input and output parameters are available for -each tool. In vscode, click on the value of `run` and select "Go to -definition" to open the tool file. Look for the `inputs` and -`outputs` sections of the tool file to find out what parameters are -defined. +Putting this all together, the `fastq` step consists of a `run`, `in` +and `out` subsections, and looks like this: ``` steps: @@ -90,33 +294,80 @@ steps: reads_file: fq out: [html_file] ``` +{: .language-yaml } -# 4. Running alignment with STAR +# Running alignment with STAR -STAR has more parameters. Sometimes we want to provide input values -to a step without making them as workflow-level inputs. We can do -this with `{default: N}` +The next step is to run the STAR aligner. +``` +# Run STAR +STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outFileNamePrefix $align_out --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard +``` +{: .language-bash } + +We will go through the same process as the first section. We find +there is `bio-cwl-tools/STAR/STAR-Align.cwl`. We will open the file +and look at the `inputs` section to determine what input parameters +correspond to the command line parmeters from our source script. +Command line flags generally appear appear in either the `arguments` +field, or the `prefix` field of the `inputBinding` section of an input +parameter declaration. For example, this tells us that the +`GenomeDir` input parameter corresponds to the `--genomeDir` command +line parameter. ``` - STAR: - requirements: - ResourceRequirement: - ramMin: 6000 - run: bio-cwl-tools/STAR/STAR-Align.cwl - in: - RunThreadN: {default: 4} - GenomeDir: genome - ForwardReads: fq - OutSAMtype: {default: BAM} - OutSAMunmapped: {default: Within} - out: [alignment] + GenomeDir: + type: Directory + inputBinding: + prefix: "--genomeDir" ``` +{: .language-yaml } + +Sometimes we want to provide input values to a step without making +them as workflow-level inputs. We can do this with `{default: N}`. +For example: -# 5. Running samtools +``` + in: + RunThreadN: {default: 4} +``` +{: .language-yaml } + +> ## `Exercise` +> +> Look at `STAR-Align.cwl` and identify the other input parameters that +> correspond to the command line arguments used in the source script. +> Also identify the output parameter. Use these to write the STAR +> step. +> +> > ## `Solution` +> > +> > ``` +> > STAR: +> > run: bio-cwl-tools/STAR/STAR-Align.cwl +> > in: +> > RunThreadN: {default: 4} +> > GenomeDir: genome +> > ForwardReads: fq +> > OutSAMtype: {default: BAM} +> > OutSAMunmapped: {default: Within} +> > out: [alignment] +> > ``` +> > {: .language-yaml } +> {: .solution} +{: .challenge} + +# Running samtools The third step is to generate an index for the aligned BAM. +``` +# Create BAM index +samtools index $counts_input_bam +``` +{: .language-bash } + For this step, we need to use the output of a previous step as input to this step. We refer the output of a step by with name of the step (STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment` @@ -131,16 +382,23 @@ step will not run until the `STAR` step has completed successfully. bam_sorted: STAR/alignment out: [bam_sorted_indexed] ``` +{: .language-yaml } + +# featureCounts -# 6. featureCounts +``` +# Count mapped reads +featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam +``` +{: .language-bash } As of this writing, the `subread` package that provides -`featureCounts` is not available in bio-cwl-tools (and if it has been -added since writing this, let's pretend that it isn't there.) We will -go over how to write a CWL wrapper for a command line tool in -lesson 3. For now, we will leave off the final step. +`featureCounts` is not available in `bio-cwl-tools` (and if it has been +added since then, let's pretend that it isn't there.) We will go over +how to write a CWL wrapper for a command line tool in lesson 4. For +now, we will leave off the final step. -# 7. Workflow Outputs +# Workflow Outputs The last thing to do is declare the workflow outputs in the `outputs` section. @@ -166,3 +424,4 @@ outputs: type: File outputSource: samtools/bam_sorted_indexed ``` +{: .language-yaml } diff --git a/_episodes/03-running.md b/_episodes/03-running.md index c6ff7d5..f3c9779 100644 --- a/_episodes/03-running.md +++ b/_episodes/03-running.md @@ -1,20 +1,25 @@ --- -title: "Running and debugging a workflow" -teaching: 0 -exercises: 0 +title: "Running and Debugging a Workflow" +teaching: 10 +exercises: 20 questions: -- "Key question (FIXME)" +- "How do I provide input to run a workflow?" +- "What should I do if the workflow fails?" objectives: -- "First learning objective. (FIXME)" +- "Write an input parameter file." +- "Execute the workflow." +- "Diagnose workflow errors." keypoints: -- "First key point. Brief Answer to questions. (FIXME)" +- "The input parameter file is a YAML file with values for each input parameter." +- "A common reason for a workflow step fails is insufficient RAM." +- "Use ResourceRequirement to set the amount of RAM to be allocated to the job." +- "Output parameter values are printed as JSON to standard output at the end of the run." --- -# 1. The input parameter file +# The input parameter file CWL input values are provided in the form of a YAML or JSON file. -Create one by right clicking on the explorer, select "New File" and -create a called file "main-input.yaml". +create a called file This file gives the values for parameters declared in the `inputs` section of our workflow. Our workflow takes `fq`, `genome` and `gtf` @@ -26,6 +31,8 @@ plain strings that may or may not be file paths. Note: if you don't have example sequence data or the STAR index files, see [setup](/setup.html). +main-input.yaml + ``` fq: class: File @@ -38,35 +45,32 @@ gtf: class: File location: rnaseq/reference_data/chr1-hg19_genes.gtf ``` - -On Arvados, do this: - -``` -fq: - class: File - location: keep:9178fe1b80a08a422dbe02adfd439764+925/raw_fastq/Mov10_oe_1.subset.fq - format: http://edamontology.org/format_1930 -genome: - class: Directory - location: keep:02a12ce9e2707610991bd29d38796b57+2912 -gtf: - class: File - location: keep:9178fe1b80a08a422dbe02adfd439764+925/reference_data/chr1-hg19_genes.gtf -``` - -# 2. Running the workflow - -Type this into the terminal: - -``` -cwl-runner main.cwl main-input.yaml -``` - -# 3. Debugging the workflow +{: .language-yaml } + +> ## Running the workflow +> +> Type this into the terminal: +> +> ``` +> cwl-runner main.cwl main-input.yaml +> ``` +> +> This may take a few minutes to run, and will print some amount of +> logging. The logging you see, how access other logs, and how to +> track workflow progress will depend on your CWL runner platform. +> +> {: .language-bash } +{: .challenge } + +# Debugging the workflow + +Depending on whether and how your workflow platform enforces memory +limits, your workflow may fail. Let's talk about what to do when a +workflow fails. A workflow can fail for many reasons: some possible reasons include -bad input, bugs in the code, or running out memory. In this case, the -STAR workflow might fail with an out of memory error. +bad input, bugs in the code, or running out memory. In our example, +the STAR workflow may fail with an out of memory error. To help diagnose these errors, the workflow runner produces logs that record what happened, either in the terminal or the web interface. @@ -90,13 +94,13 @@ Container exited with code: 137 If this happens, you will need to request more RAM. -# 4. Setting runtime RAM requirements +# Setting runtime RAM requirements By default, a step is allocated 256 MB of RAM. From the STAR error message: > Check if you have enough RAM 5711762337 bytes -We can see that STAR requires quite a bit more RAM than that. To +We can see that STAR requires quite a bit more RAM than 256 MB. To request more RAM, add a "requirements" section with "ResourceRequirement" to the "STAR" step: @@ -104,9 +108,11 @@ request more RAM, add a "requirements" section with STAR: requirements: ResourceRequirement: - ramMin: 8000 + ramMin: 9000 run: bio-cwl-tools/STAR/STAR-Align.cwl + ... ``` +{: .language-yaml } Resource requirements you can set include: @@ -117,11 +123,10 @@ Resource requirements you can set include: After setting the RAM requirements, re-run the workflow. -# 5. Workflow results +# Workflow results The CWL runner will print a results JSON object to standard output. It will look something like this (it may include additional fields). - ``` { "bam_sorted_indexed": { @@ -146,7 +151,8 @@ The CWL runner will print a results JSON object to standard output. It will loo } } ``` +{: .language-yaml } -This has the same structure as `main-input.yaml`. The each output +This has a similar structure as `main-input.yaml`. The each output parameter is listed, with the `location` field of each `File` object indicating where the output file can be found. diff --git a/_episodes/04-commandlinetool.md b/_episodes/04-commandlinetool.md index cae1682..22110fa 100644 --- a/_episodes/04-commandlinetool.md +++ b/_episodes/04-commandlinetool.md @@ -1,47 +1,54 @@ --- -title: "Writing a tool wrapper" -teaching: 0 -exercises: 0 +title: "Writing a Tool Wrapper" +teaching: 15 +exercises: 20 questions: -- "Key question (FIXME)" +- "What are the key components of a tool wrapper?" +- "How do I use software containers to supply the software I want to run?" objectives: -- "First learning objective. (FIXME)" +- "Write a tool wrapper for the featureCounts tool." +- "Find an software container that has the software we want to use." +- "Add the tool wrapper to our main workflow." keypoints: -- "First key point. Brief Answer to questions. (FIXME)" +- "The key components of a command line tool wrapper are the header, inputs, baseCommand, arguments, and outputs." +- "Like workflows, CommandLineTools have `inputs` and `outputs`." +- "Use `baseCommand` and `arguments` to provide the program to run and the command line arguments to run it with." +- "Use `glob` to capture output files and assign them to output parameters." +- "Use DockerRequirement to supply the name of the Docker image that contains the software to run." --- It is time to add the last step in the analysis. +``` +# Count mapped reads +featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam +``` +{: .language-bash } + This will use the "featureCounts" tool from the "subread" package. -# 1. File header +# File header Create a new file "featureCounts.cwl" -Start with this header +Let's start with the header. This is very similar to the workflow, except that we use `class: CommandLineTool`. ``` cwlVersion: v1.2 class: CommandLineTool +label: featureCounts tool ``` +{: .language-yaml } -# 2. Command line tool inputs +# Command line tool inputs A CommandLineTool describes a single invocation of a command line program. -It consumes some input parameters, runs a program, and produce output -values. - -Here is the original shell command: - -``` -featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam -``` +It consumes some input parameters, runs a program, and captures +output, mainly in in the form of files produced by the program. The variables used in the bash script are `$cores`, `$gtf`, `$counts` and `$counts_input_bam`. -The parameters - This gives us two file inputs, `gtf` and `counts_input_bam` which we can declare in our `inputs` section: ``` @@ -49,27 +56,22 @@ inputs: gtf: File counts_input_bam: File ``` +{: .language-yaml } -# 3. Specifying the program to run +# Specifying the program to run Give the name of the program to run in `baseCommand`. ``` baseCommand: featureCounts ``` +{: .language-yaml } -# 4. Command arguments +# Command arguments The easiest way to describe the command line is with an `arguments` section. This takes a comma-separated list of command line arguments. -Input variables are included on the command line as -`$(inputs.name_of_parameter)`. When the tool is executed, these input -parameter values are substituted for these variable. - -Special variables are also available. The runtime environment -describes the resources allocated to running the program. Here we use -`$(runtime.cores)` to decide how many threads to request. ``` arguments: [-T, $(runtime.cores), @@ -77,8 +79,42 @@ arguments: [-T, $(runtime.cores), -o, featurecounts.tsv, $(inputs.counts_input_bam)] ``` +{: .language-yaml } -# 5. Outputs section +Input variables are included on the command line as +`$(inputs.name_of_parameter)`. When the tool is executed, the +variables will be replaced with the input parameter values. + +There are also some special variables. The `runtime` object describes +the resources allocated to running the program. Here we use +`$(runtime.cores)` to decide how many threads to request. + +> ## `arguments` vs `inputBinding` +> +> You may recall from examining existing the fastqc and STAR tools +> wrappers in lesson 2, another way to express command line parameters +> is with `inputBinding` and `prefix` on individual input parameters. +> +> ``` +> inputs: +> parametername: +> type: parametertype +> inputBinding: +> prefix: --some-option +> ``` +> {: .language-yaml } +> +> We use `arguments` in the example simply because it is easier to see +> how it lines up with the source shell script. +> +> You can use both `inputBinding` and `arguments` in the same +> CommandLineTool document. There is no "right" or "wrong" way, and +> one does not override the other, they are combined to produce the +> final command line invocation. +> +{: .callout} + +# Outputs section In CWL, you must explicitly identify the outputs of a program. This associates output parameters with specific files, and enables the @@ -102,38 +138,50 @@ outputs: outputBinding: glob: featurecounts.tsv ``` +{: .language-yaml } -# 6. Running in a container +# Running in a container In order to run the tool, it needs to be installed. Using software containers, a tool can be pre-installed into a compatible runtime environment, and that runtime environment (called a container image) can be downloaded and run on demand. -Many bioinformatics tools are already available as containers. One -resource is the BioContainers project. Let's find the "subread" software: - - 1. Visit https://biocontainers.pro/ - 2. Click on "Registry" - 3. Search for "subread" - 4. Click on the search result for "subread" - 5. Click on the tab "Packages and Containers" - 6. Choose a row with type "docker", then on the right side of the "Full -Tag" column for that row, click the "copy to clipboard" button. - -To declare that you want to run inside a container, create a section -called `hints` with a subsection `DockerRequirement`. Under -`DockerRequirement`, paste the text your copied in the above step. -Replace the text `docker pull` to `dockerPull:` and indent it so it is -in the `DockerRequirement` section. - -``` -hints: - DockerRequirement: - dockerPull: quay.io/biocontainers/subread:1.5.0p3--0 -``` - -# 7. Running a tool on its own +Although plain CWL does not _require_ the use of containers, many +popular platforms that run CWL do require the software be supplied in +the form of a container image. + +> ## Finding container images +> +> Many bioinformatics tools are already available as containers. One +> resource is the BioContainers project. Let's find the "subread" software: +> +> 1. Visit [https://biocontainers.pro/](https://biocontainers.pro/) +> 2. Click on "Registry" +> 3. Search for "subread" +> 4. Click on the search result for "subread" +> 5. Click on the tab "Packages and Containers" +> 6. Choose a row with type "docker", then on the right side of the "Full +> Tag" column for that row, click the "copy to clipboard" button. +> +> To declare that you want to run inside a container, add a section +> called `hints` to your tool document. Under `hints` add a +> subsection `DockerRequirement`. Under `DockerRequirement`, paste +> the text your copied in the above step. Replace the text `docker +> pull` to `dockerPull:` ensure it is indented twice so it is a field +> of `DockerRequirement`. +> +> > ## Answer +> > ``` +> > hints: +> > DockerRequirement: +> > dockerPull: quay.io/biocontainers/subread:1.5.0p3--0 +> > ``` +> > {: .language-yaml } +> {: .solution} +{: .challenge} + +# Running a tool on its own When creating a tool wrapper, it is helpful to run it on its own to test it. @@ -150,43 +198,53 @@ gtf: class: File location: rnaseq/reference_data/chr1-hg19_genes.gtf ``` +{: .language-yaml } The invocation is also the same: ``` cwl-runner featureCounts.cwl featureCounts.yaml ``` - -# 8. Adding it to the workflow - -Now that we have confirmed that it works, we can add it to our workflow. -We add it to `steps`, connecting the output of samtools to -`counts_input_bam` and the `gtf` taking the workflow input of the same -name. - -``` -steps: - ... - featureCounts: - requirements: - ResourceRequirement: - ramMin: 500 - run: featureCounts.cwl - in: - counts_input_bam: samtools/bam_sorted_indexed - gtf: gtf - out: [featurecounts] -``` - -We will add the result from featurecounts to the output: - -``` -outputs: - ... - featurecounts: - type: File - outputSource: featureCounts/featurecounts -``` - -You should now be able to re-run the workflow and it will run the -"featureCounts" step and include "featurecounts" in the output. +{: .language-bash } + +# Adding it to the workflow + +> ## Exercise +> +> Now that we have confirmed that the tool wrapper works, it is time +> to add it to our workflow. +> +> 1. Add a new step called `featureCounts` that runs our tool +> wrapper. The new step should take input from +> `samtools/bam_sorted_indexed`, and should be allocated a +> minimum of 500 MB of RAM +> 2. Add a new output parameter for the workflow called +> `featurecounts` The output source should come from the output +> of the new `featureCounts` step. +> 3. When you have an answer, run the updated workflow, which +> should run the "featureCounts" step and produce "featurecounts" +> output parameter. +> +> > ## Answer +> > ``` +> > steps: +> > ... +> > featureCounts: +> > requirements: +> > ResourceRequirement: +> > ramMin: 500 +> > run: featureCounts.cwl +> > in: +> > counts_input_bam: samtools/bam_sorted_indexed +> > gtf: gtf +> > out: [featurecounts] +> > +> > outputs: +> > ... +> > featurecounts: +> > type: File +> > outputSource: featureCounts/featurecounts +> > ``` +> > {: .language-yaml } +> {: .solution} +{: .challenge} diff --git a/_episodes/05-scatter.md b/_episodes/05-scatter.md index bc53672..2903952 100644 --- a/_episodes/05-scatter.md +++ b/_episodes/05-scatter.md @@ -1,26 +1,33 @@ --- -title: " Analyzing multiple samples" -teaching: 0 +title: "Analyzing Multiple Samples" +teaching: 20 exercises: 0 questions: -- "Key question (FIXME)" +- "How can you run the same workflow over multiple samples?" objectives: -- "First learning objective. (FIXME)" +- "Modify the workflow to process multiple samples, then perform a joint analysis." keypoints: -- "First key point. Brief Answer to questions. (FIXME)" +- "Separate the part of the workflow that you want to run multiple times into a subworkflow." +- "Use a scatter step to run the subworkflow over a list of inputs." +- "The result of a scatter is an array, which can be used in a combine step to get a single result." --- -Analyzing a single sample is great, but in the real world you probably -have a batch of samples that you need to analyze and then compare. +In the previous lesson, we completed converting the function of the +original source shell script into CWL. This lesson expands the scope +by demonstrating what changes to make to the workflow to be able to +analyze multiple samples in parallel. -# 1. Subworkflows +# Subworkflows In addition to running command line tools, a workflow step can also execute another workflow. -Let's copy "main.cwl" to "alignment.cwl". +First, copy `main.cwl` to `alignment.cwl`. -Now, edit open "main.cwl" for editing. We are going to replace the `steps` and `outputs` sections. +Next, open `main.cwl` for editing. We are going to replace the `steps` and `outputs` sections. + +Remove all the steps and replace them with a single `alignment` step +which invokes the `alignment.cwl` we just copied. ``` steps: @@ -32,8 +39,9 @@ steps: gtf: gtf out: [qc_html, bam_sorted_indexed, featurecounts] ``` +{: .language-yaml } -In the outputs section, all the output sources are from the alignment step: +In the `outputs` section, all the output sources are from the alignment step: ``` outputs: @@ -47,23 +55,27 @@ outputs: type: File outputSource: alignment/featurecounts ``` +{: .language-yaml } -We also need a little boilerplate to tell the workflow runner that we want to use subworkflows: +We also need add "SubworkflowFeatureRequirement" to tell the workflow +runner that we are using subworkflows: ``` requirements: SubworkflowFeatureRequirement: {} ``` +{: .language-yaml } If you run this workflow, you will get exactly the same results as -before, we've just wrapped the inner workflow with an outer workflow. +before, as all we have done so far is to wrap the inner workflow with +an outer workflow. -# 2. Scattering +# Scattering -The wrapper lets us do something useful. We can modify the outer -workflow to accept a list of files, and then invoke the inner workflow -step for every one of those files. We will need to modify the -`inputs`, `steps`, `outputs`, and `requirements` sections. +The "wrapper" step lets us do something useful. We can modify the +outer workflow to accept a list of files, and then invoke the inner +workflow step for every one of those files. We will need to modify +the `inputs`, `steps`, `outputs`, and `requirements` sections. First we change the `fq` parameter to expect a list of files: @@ -73,9 +85,11 @@ inputs: genome: Directory gtf: File ``` +{: .language-yaml } -Next, we add `scatter` to the alignment step. The means it will -run `alignment.cwl` for each value in the list in the `fq` parameter. +Next, we add `scatter` to the alignment step. The means we want to +run run `alignment.cwl` for each value in the list in the `fq` +parameter. ``` steps: @@ -88,6 +102,7 @@ steps: gtf: gtf out: [qc_html, bam_sorted_indexed, featurecounts] ``` +{: .language-yaml } Because the scatter produces multiple outputs, each output parameter becomes a list as well: @@ -104,17 +119,19 @@ outputs: type: File[] outputSource: alignment/featurecounts ``` +{: .language-yaml } -Finally, we need a little more boilerplate to tell the workflow runner -that we want to use scatter: +We also need add "ScatterFeatureRequirement" to tell the workflow +runner that we are using scatter: ``` requirements: SubworkflowFeatureRequirement: {} ScatterFeatureRequirement: {} ``` +{: .language-yaml } -# 3. Running with list inputs +# Input parameter lists The `fq` parameter needs to be a list. You write a list in yaml by starting each list item with a dash. Example `main-input.yaml` @@ -146,20 +163,21 @@ gtf: class: File location: rnaseq/reference_data/chr1-hg19_genes.gtf ``` +{: .language-yaml } -Now you can run the workflow the same way as in Lesson 2. +If you run the workflow, you will get results for each one of the +input fastq files. -# 4. Combining results +# Combining results -Each instance of the alignment workflow produces its own featureCounts -file. However, to be able to compare results easily, we need them a -single file with all the results. +Each instance of the alignment workflow produces its own +`featurecounts.tsv` file. However, to be able to compare results +easily, we would like single file with all the results. -The easiest way to do this is to run `featureCounts` just once at the -end of the workflow, with all the bam files listed on the command -line. +We can modify the workflow to run `featureCounts` once at the end of +the workflow, taking all the bam files listed on the command line. -We'll need to modify a few things. +We will need to change a few things. First, in `featureCounts.cwl` we need to modify it to accept either a single bam file or list of bam files. @@ -171,6 +189,7 @@ inputs: - File - File[] ``` +{: .language-yaml } Second, in `alignment.cwl` we need to remove the `featureCounts` step from alignment.cwl, as well as the `featurecounts` output parameter. @@ -197,6 +216,7 @@ steps: gtf: gtf out: [featurecounts] ``` +{: .language-yaml } Last, we modify the `featurecounts` output parameter. Instead of a list of files produced by the `alignment` step, it is now a single @@ -209,5 +229,7 @@ outputs: type: File outputSource: featureCounts/featurecounts ``` +{: .language-yaml } -Run this workflow to get a single `featurecounts.tsv` file with a column for each bam file. +Run this workflow to get a single `featurecounts.tsv` file with a +column for each bam file. diff --git a/_episodes/06-expressions.md b/_episodes/06-expressions.md index 54a5d32..cf26be4 100644 --- a/_episodes/06-expressions.md +++ b/_episodes/06-expressions.md @@ -1,16 +1,16 @@ --- -title: "Dynamic workflows with expressions" -teaching: 0 +title: "Dynamic Workflow Behavior" +teaching: 20 exercises: 0 questions: -- "Key question (FIXME)" +- "How can I adjust workflow behavior at runtime?" objectives: -- "First learning objective. (FIXME)" +- "Set " keypoints: - "First key point. Brief Answer to questions. (FIXME)" --- -# 1. Expressions on step inputs +# Expressions on step inputs You might have noticed that the output bam files are all named `Aligned.sortedByCoord.out.bam`. This happens because because when we @@ -20,6 +20,7 @@ During workflow execution, this is usually not a problem. The workflow runner is smart enough to know that these files are different and keep them separate. This can even make development easier by not having to worry about assigning unique file names to every file. +Also, if we intend to discard the BAM files as intermediate results However, it is a problem for humans interpreting the output. We can fix this by setting the parameter `OutFileNamePrefix` on STAR. We @@ -42,6 +43,7 @@ steps: ... OutFileNamePrefix: {valueFrom: "$(inputs.ForwardReads.nameroot)."} ``` +{: .language-yaml } The code between `$(...)` is called an "expression". It is evaluated when setting up the step to run, and the expression is replaced by the @@ -64,7 +66,7 @@ adds the remainder of the string, which just is a dot `.`. This is to separate the leading part of our filename from the "Aligned.bam" extension that will be added by STAR. -# 2. Organizing output files into Directories +# Organizing output files into Directories You probably noticed that all the output files appear in the same directory. You might prefer that each file appears in its own @@ -129,6 +131,7 @@ expression: |- return {"dirs": dirs}; } ``` +{: .language-yaml } Then change `main.cwl`: @@ -150,3 +153,4 @@ outputs: type: File outputSource: featureCounts/featurecounts ``` +{: .language-yaml } diff --git a/_episodes/07-resources.md b/_episodes/07-resources.md index 0ac9e5f..c68b5ac 100644 --- a/_episodes/07-resources.md +++ b/_episodes/07-resources.md @@ -1,6 +1,6 @@ --- title: " Resources for further learning" -teaching: 0 +teaching: 10 exercises: 0 questions: - "Key question (FIXME)" diff --git a/answers/ep2/main.cwl b/answers/ep2/main.cwl index bad27f4..cb1ef84 100644 --- a/answers/ep2/main.cwl +++ b/answers/ep2/main.cwl @@ -1,15 +1,15 @@ -### 1. File header +# Workflow header cwlVersion: v1.2 class: Workflow label: RNAseq CWL practice workflow -### 2. Workflow Inputs +# Workflow Inputs inputs: fq: File genome: Directory gtf: File -### 3. Workflow Steps +# Workflow Steps steps: fastqc: run: bio-cwl-tools/fastqc/fastqc_2.cwl @@ -17,11 +17,8 @@ steps: reads_file: fq out: [html_file] - ### 4. Running alignment with STAR + # Running alignment with STAR STAR: - requirements: - ResourceRequirement: - ramMin: 6000 run: bio-cwl-tools/STAR/STAR-Align.cwl in: RunThreadN: {default: 4} @@ -31,14 +28,14 @@ steps: OutSAMunmapped: {default: Within} out: [alignment] - ### 5. Running samtools + # Running samtools samtools: run: bio-cwl-tools/samtools/samtools_index.cwl in: bam_sorted: STAR/alignment out: [bam_sorted_indexed] -### 7. Workflow Outputs +# Workflow Outputs outputs: qc_html: type: File diff --git a/answers/ep3/main.cwl b/answers/ep3/main.cwl new file mode 100644 index 0000000..09af85f --- /dev/null +++ b/answers/ep3/main.cwl @@ -0,0 +1,43 @@ +cwlVersion: v1.2 +class: Workflow +label: RNAseq CWL practice workflow + +inputs: + fq: File + genome: Directory + gtf: File + +steps: + fastqc: + run: bio-cwl-tools/fastqc/fastqc_2.cwl + in: + reads_file: fq + out: [html_file] + + STAR: + # 4. Setting runtime RAM requirements + requirements: + ResourceRequirement: + ramMin: 9000 + run: bio-cwl-tools/STAR/STAR-Align.cwl + in: + RunThreadN: {default: 4} + GenomeDir: genome + ForwardReads: fq + OutSAMtype: {default: BAM} + OutSAMunmapped: {default: Within} + out: [alignment] + + samtools: + run: bio-cwl-tools/samtools/samtools_index.cwl + in: + bam_sorted: STAR/alignment + out: [bam_sorted_indexed] + +outputs: + qc_html: + type: File + outputSource: fastqc/html_file + bam_sorted_indexed: + type: File + outputSource: samtools/bam_sorted_indexed diff --git a/answers/ep4/featureCounts.cwl b/answers/ep4/featureCounts.cwl index c96a495..128ae4b 100644 --- a/answers/ep4/featureCounts.cwl +++ b/answers/ep4/featureCounts.cwl @@ -1,29 +1,29 @@ -### 1. File header +# File header cwlVersion: v1.2 class: CommandLineTool -### 2. Command line tool inputs +# Command line tool inputs inputs: gtf: File counts_input_bam: File -### 3. Specifying the program to run +# Specifying the program to run baseCommand: featureCounts -### 4. Command arguments +# Command arguments arguments: [-T, $(runtime.cores), -a, $(inputs.gtf), -o, featurecounts.tsv, $(inputs.counts_input_bam)] -### 5. Outputs section +# Outputs section outputs: featurecounts: type: File outputBinding: glob: featurecounts.tsv -### 6. Running in a container +# Running in a container hints: DockerRequirement: dockerPull: quay.io/biocontainers/subread:1.5.0p3--0 diff --git a/answers/ep4/main.cwl b/answers/ep4/main.cwl index 937dd3a..a6a8d3c 100644 --- a/answers/ep4/main.cwl +++ b/answers/ep4/main.cwl @@ -17,7 +17,7 @@ steps: STAR: requirements: ResourceRequirement: - ramMin: 6000 + ramMin: 9000 run: bio-cwl-tools/STAR/STAR-Align.cwl in: RunThreadN: {default: 4} @@ -52,7 +52,7 @@ outputs: type: File outputSource: samtools/bam_sorted_indexed - ### 8. Adding it to the workflow + # Adding it to the workflow featurecounts: type: File outputSource: featureCounts/featurecounts diff --git a/answers/ep5/part1/alignment.cwl b/answers/ep5/part1/alignment.cwl index 3c2d79e..8ab6bd2 100644 --- a/answers/ep5/part1/alignment.cwl +++ b/answers/ep5/part1/alignment.cwl @@ -17,7 +17,7 @@ steps: STAR: requirements: ResourceRequirement: - ramMin: 6000 + ramMin: 9000 run: bio-cwl-tools/STAR/STAR-Align.cwl in: RunThreadN: {default: 4} diff --git a/answers/ep5/part2/alignment.cwl b/answers/ep5/part2/alignment.cwl index 3c2d79e..8ab6bd2 100644 --- a/answers/ep5/part2/alignment.cwl +++ b/answers/ep5/part2/alignment.cwl @@ -17,7 +17,7 @@ steps: STAR: requirements: ResourceRequirement: - ramMin: 6000 + ramMin: 9000 run: bio-cwl-tools/STAR/STAR-Align.cwl in: RunThreadN: {default: 4} diff --git a/answers/ep5/part4/alignment.cwl b/answers/ep5/part4/alignment.cwl index df31e9b..b69fa6e 100644 --- a/answers/ep5/part4/alignment.cwl +++ b/answers/ep5/part4/alignment.cwl @@ -17,7 +17,7 @@ steps: STAR: requirements: ResourceRequirement: - ramMin: 6000 + ramMin: 9000 run: bio-cwl-tools/STAR/STAR-Align.cwl in: RunThreadN: {default: 4} diff --git a/answers/ep6/alignment.cwl b/answers/ep6/alignment.cwl index 8a54fe4..73d9323 100644 --- a/answers/ep6/alignment.cwl +++ b/answers/ep6/alignment.cwl @@ -20,7 +20,7 @@ steps: STAR: requirements: ResourceRequirement: - ramMin: 6000 + ramMin: 9000 run: bio-cwl-tools/STAR/STAR-Align.cwl in: RunThreadN: {default: 4} -- 2.30.2