---
-title: "Make a workflow by composing tools"
-teaching: 0
-exercises: 0
+title: "Create a Workflow by Composing Tools"
+teaching: 30
+exercises: 10
questions:
-- "Key question (FIXME)"
+- "What is the syntax of CWL?"
+- "What are the key components of a workflow?"
objectives:
-- "First learning objective. (FIXME)"
+- "Write a workflow based on the source shell script, making use of existing tool wrappers."
keypoints:
-- "First key point. Brief Answer to questions. (FIXME)"
+- "CWL documents are written using a syntax called YAML."
+- "The key components of the workflow are: the header, the inputs, the steps, and the outputs."
---
-# 1. File header
+# Source shell script
-Create a new file "main.cwl"
+In this lesson, we will develop an initial workflow inspired by the
+following shell script.
+
+rnaseq_analysis_on_input_file.sh
+
+```
+#!/bin/bash
+
+# Based on
+# https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html
+#
+
+# This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it.
+# USAGE: sh rnaseq_analysis_on_input_file.sh <name of fastq file>
+
+set -e
+
+# initialize a variable with an intuitive name to store the name of the input fastq file
+fq=$1
+
+# grab base of filename for naming outputs
+base=`basename $fq .subset.fq`
+echo "Sample name is $base"
+
+# specify the number of cores to use
+cores=4
+
+# directory with genome reference FASTA and index files + name of the gene annotation file
+genome=rnaseq/reference_data
+gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
+
+# make all of the output directories
+# The -p option means mkdir will create the whole path if it
+# does not exist and refrain from complaining if it does exist
+mkdir -p rnaseq/results/fastqc
+mkdir -p rnaseq/results/STAR
+mkdir -p rnaseq/results/counts
+
+# set up output filenames and locations
+fastqc_out=rnaseq/results/fastqc
+align_out=rnaseq/results/STAR/${base}_
+counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam
+counts=rnaseq/results/counts/${base}_featurecounts.txt
+
+echo "Processing file $fq"
+
+# Run FastQC and move output to the appropriate folder
+fastqc $fq
+
+# Run STAR
+STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within
+
+# Create BAM index
+samtools index $counts_input_bam
+
+# Count mapped reads
+featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
+```
+{: .language-bash }
+
+# CWL Syntax
+
+CWL documents are written using a format called "YAML". Here is a crash-course in YAML:
+
+Data fields are written with the name, followed by a colon `:`, a space,
+and then the value.
+
+```
+fieldName: value
+```
+{: .language-yaml }
+
+The value is the remaining text to the end of the line.
+
+Special characters in YAML include `:`, `{`, `}` `[`, `]`, `#`, `!`
+and `%`. If your text begins with any of these characters, you must
+surround the string in single or double quotes.
+
+```
+fieldName: "#quoted-value"
+```
+{: .language-yaml }
-Start with this header.
+You can write multi-line text by putting `|-` and writing an indented
+block. The leading whitespace will be removed from the actual value.
+
+```
+fieldName: |-
+ This is a multi-
+ line string.
+ Horray!
+```
+{: .language-yaml }
+
+Nested sections are indented:
+
+```
+section1:
+ field1: value1
+ field2: value2
+```
+{: .language-yaml }
+
+Nested sections can _also_ be wrapped in curly brackets. In this case, fields must be comma-separated.
+
+```
+section1: {field1: value1, field2, value2}
+```
+{: .language-yaml }
+When each item is on its own line starting with a dash `-`, it is a list.
+
+```
+section2:
+ - value1
+ - value2
+```
+{: .language-yaml }
+
+List can _also_ be wrapped in square brackets. In this case, values must be comma-separated.
+
+```
+section2: [value1, value2]
+```
+{: .language-yaml }
+
+Comments start with `#`.
+
+```
+# This is a comment about field3
+field3: stuff
+
+field4: stuff # This is a comment about field4
+```
+{: .language-yaml }
+
+Finally, YAML is a superset of JSON. Valid JSON is also valid YAML,
+so you may sometimes see JSON format being used instead of YAML format
+for CWL documents.
+
+# Workflow header
+
+Create a new file "main.cwl"
+
+Let's start with the header.
```
cwlVersion: v1.2
class: Workflow
label: RNAseq CWL practice workflow
```
+{: .language-yaml }
+
+* cwlVersion - Every file must include this. It declares the version of CWL in use.
+* class - This is the type of CWL document. We will see other types in future lessons.
+* label - Optional title of your workflow.
-# 2. Workflow Inputs
+# Workflow Inputs
The purpose of a workflow is to consume some input parameters, run a
series of steps, and produce output values.
For this analysis, the input parameters are the fastq file and the reference data required by STAR.
-In the original shell script, the following variables are declared:
+In the source shell script, the following variables are declared:
```
# initialize a variable with an intuitive name to store the name of the input fastq file
genome=rnaseq/reference_data
gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
```
+{: .language-bash }
In CWL, we will declare these variables in the `inputs` section.
genome: Directory
gtf: File
```
+{: .language-yaml }
-# 3. Workflow Steps
+# Workflow Steps
A workflow consists of one or more steps. This is the `steps` section.
-Now we need to describe the first step of the workflow. This step is to run `fastqc`.
+Now we need to describe the first step of the workflow. In the source
+script, the first step is to run `fastqc`.
+
+```
+# Run FastQC and move output to the appropriate folder
+fastqc $fq
+```
+{: .language-bash }
A workflow step consists of the name of the step, the tool to `run`,
the input parameters to be passed to the tool in `in`, and the output
parameters expected from the tool in `out`.
-The value of `run` references the tool file. Tip: while typing the
-file name, you can get suggestions and auto-completion on a partial
-name using control+space.
+The value of `run` references the tool file. The tool file describes
+how to run the tool (we will discuss how to write tool files in lesson
+4). If we look in `bio-cwl-tools` (which you should have imported
+when setting up a practice repository in the initial setup
+instructions) we find `bio-cwl-tools/fastqc/fastqc_2.cwl`.
-The `in` block lists input parameters to the tool and the workflow
-parameters that will be assigned to those inputs.
+Next, the `in` block is mapping of input parameters to the tool and
+the workflow parameters that will be assigned to those inputs. We
+need to know what input parameters the tool accepts.
-The `out` block lists output parameters to the tool that are used
-later in the workflow.
+Let's open up the tool file and take a look:
-You need to know which input and output parameters are available for
-each tool. In vscode, click on the value of `run` and select "Go to
-definition" to open the tool file. Look for the `inputs` and
-`outputs` sections of the tool file to find out what parameters are
-defined.
+Find the `inputs` section of `bio-cwl-tools/fastqc/fastqc_2.cwl`:
+
+```
+inputs:
+
+ reads_file:
+ type:
+ - File
+ inputBinding:
+ position: 50
+ doc: |
+ Input bam,sam,bam_mapped,sam_mapped or fastq file
+```
+{: .language-yaml }
+
+Now we know we need to provide an input parameter called `reads_file`.
+
+Next, the `out` section is a list of output parameters from the tool
+that will be used later in the workflow, or as workflow output. We
+need to know what output parameters the tool produces. Find the
+`outputs` section of `bio-cwl-tools/fastqc/fastqc_2.cwl`:
+
+```
+outputs:
+
+ zipped_file:
+ type:
+ - File
+ outputBinding:
+ glob: '*.zip'
+ html_file:
+ type:
+ - File
+ outputBinding:
+ glob: '*.html'
+ summary_file:
+ type:
+ - File
+ outputBinding:
+ glob: |
+ ${
+ return "*/summary.txt";
+ }
+```
+{: .language-yaml }
+
+Now we know to expect an output parameter called `html_file`.
+
+Putting this all together, the `fastq` step consists of a `run`, `in`
+and `out` subsections, and looks like this:
```
steps:
reads_file: fq
out: [html_file]
```
+{: .language-yaml }
-# 4. Running alignment with STAR
-
-STAR has more parameters. Sometimes we want to provide input values
-to a step without making them as workflow-level inputs. We can do
-this with `{default: N}`
+# Running alignment with STAR
+The next step is to run the STAR aligner.
```
- STAR:
- requirements:
- ResourceRequirement:
- ramMin: 6000
- run: bio-cwl-tools/STAR/STAR-Align.cwl
- in:
- RunThreadN: {default: 4}
- GenomeDir: genome
- ForwardReads: fq
- OutSAMtype: {default: BAM}
- OutSAMunmapped: {default: Within}
- out: [alignment]
+# Run STAR
+STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within
```
-
-# 5. Running samtools
+{: .language-bash }
+
+We will go through the same process as the first section. We find
+there is `bio-cwl-tools/STAR/STAR-Align.cwl`. We will open the file
+and look at the `inputs` section to determine what input parameters
+correspond to the command line parmeters from our source script.
+
+> ## Exercise
+>
+> Look at `STAR-Align.cwl` and identify the input parameters that
+> correspond to the command line arguments used in the source script:
+> `--runThreadN`, `--genomeDir`, `--outSAMtype`, and
+> `--outSAMunmapped`. Also identify the name of the output parameter.
+>
+> > ## Solution
+> >
+> > * `--runThreadN` → `RunThreadN`
+> > * `--genomeDir` → `GenomeDir`
+> > * `--readFilesIn` → `ForwardReads`
+> > * `--outSAMtype` → `OutSAMtype`, `SortedByCoordinate`
+> > * `--outSAMunmapped` → `OutSAMunmapped`
+> >
+> > output parameter name: `alignment`
+> {: .solution}
+{: .challenge}
+
+> ## Command line flags
+>
+> Command line flags generally appear appear in either the `arguments`
+> field, or the `prefix` field of the `inputBinding` section of an
+> input parameter declaration. For example, this section of
+> `STAR-Align.cwl` tells us that the `GenomeDir` input parameter
+> corresponds to the `--genomeDir` command line parameter.
+>
+> ```
+> GenomeDir:
+> type: Directory
+> inputBinding:
+> prefix: "--genomeDir"
+> ```
+> {: .language-yaml }
+{: .callout}
+
+> ## Default values
+>
+> Sometimes we want to provide input values to a step without making
+> them as workflow-level inputs. We can do this with `{default: N}`.
+> For example:
+>
+> ```
+> in:
+> RunThreadN: {default: 4}
+> ```
+> {: .language-yaml }
+{: .callout}
+
+> ## Exercise
+>
+> Using the input and output parameters identified in the previous
+> exercise, write the `run`, `in` and `out` sections of the STAR step.
+>
+> > ## Solution
+> >
+> > ```
+> > STAR:
+> > run: bio-cwl-tools/STAR/STAR-Align.cwl
+> > in:
+> > RunThreadN: {default: 4}
+> > GenomeDir: genome
+> > ForwardReads: fq
+> > OutSAMtype: {default: BAM}
+> > SortedByCoordinate: {default: true}
+> > OutSAMunmapped: {default: Within}
+> > out: [alignment]
+> > ```
+> > {: .language-yaml }
+> {: .solution}
+{: .challenge}
+
+
+# Running samtools
The third step is to generate an index for the aligned BAM.
+```
+# Create BAM index
+samtools index $counts_input_bam
+```
+{: .language-bash }
+
For this step, we need to use the output of a previous step as input
to this step. We refer the output of a step by with name of the step
(STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment`
bam_sorted: STAR/alignment
out: [bam_sorted_indexed]
```
+{: .language-yaml }
-# 6. featureCounts
+# featureCounts
+
+```
+# Count mapped reads
+featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
+```
+{: .language-bash }
As of this writing, the `subread` package that provides
-`featureCounts` is not available in bio-cwl-tools (and if it has been
-added since writing this, let's pretend that it isn't there.) We will
-go over how to write a CWL wrapper for a command line tool in
-lesson 3. For now, we will leave off the final step.
+`featureCounts` is not available in `bio-cwl-tools` (and if it has been
+added since then, let's pretend that it isn't there.) We will go over
+how to write a CWL wrapper for a command line tool in lesson 4. For
+now, we will leave off the final step.
-# 7. Workflow Outputs
+# Workflow Outputs
The last thing to do is declare the workflow outputs in the `outputs` section.
type: File
outputSource: samtools/bam_sorted_indexed
```
+{: .language-yaml }
+
+> ## Episode solution
+> * <a href="../assets/answers/ep2/main.cwl">main.cwl</a>
+{: .solution}