carpentry: "swc"
# Overall title for pages.
-title: "Lesson Title"
+title: "Getting started with CWL"
# Life cycle stage of the lesson
# See this page for more details: https://cdh.carpentries.org/the-lesson-life-cycle.html
---
title: "Introduction"
-teaching: 0
+teaching: 10
exercises: 0
questions:
-- "Key question (FIXME)"
+- "What is CWL?"
+- "What is the goal of this training?"
objectives:
- "First learning objective. (FIXME)"
keypoints:
- "First key point. Brief Answer to questions. (FIXME)"
---
-## Introduction
+# Introduction to Common Worklow Language
-The goal of this training is to walk through the development of a
-best-practices CWL workflow by translating an existing bioinformatics
-shell script into CWL. Specific knowledge of the biology of RNA-seq
-is *not* a prerequisite for these lessons.
+The Common Workflow Language (CWL) is an open standard for describing
+analysis workflows and tools in a way that makes them portable and
+scalable across a variety of software and hardware environments, from
+workstations to cluster, cloud, and high performance computing (HPC)
+environments. CWL is designed to meet the needs of data-intensive
+science, such as Bioinformatics, Medical Imaging, Astronomy, High
+Energy Physics, and Machine Learning.
-These lessons are based on "Introduction to RNA-seq using
-high-performance computing (HPC)" lessons developed by members of the
-teaching team at the Harvard Chan Bioinformatics Core (HBC). The
-original training, which includes additional lectures about the
-biology of RNA-seq can be found here:
+# Introduction to this training
-https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2
+The goal of this training is to walk the student through the
+development of a best-practices CWL workflow, starting from an
+existing shell script that performs a common bioinformatics analysis.
-## Background
+Specific knowledge of the biology of RNA-seq is *not* a prerequisite
+for these lessons. CWL is not domain specific to bioinformatics. We
+hope that you will find this training useful even if you work in some
+other field of research.
-RNA-seq is the process of sequencing RNA in a biological sample. From
-the sequence reads, we want to measure the relative number of RNA
-molecules appearing in the sample that were produced by particular
-genes. This analysis is called "differential gene expression".
+These lessons are based on [Introduction to RNA-seq using
+high-performance computing
+(HPC)](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2) lessons
+developed by members of the teaching team at the Harvard Chan
+Bioinformatics Core (HBC). The original training, which includes
+additional lectures about the biology of RNA-seq, can be found at that
+link.
+
+# Introduction to the example analysis
+
+RNA-seq is the process of sequencing RNA present in a biological
+sample. From the sequence reads, we want to measure the relative
+numbers of different RNA molecules appearing in the sample that were
+produced by particular genes. This analysis is called "differential
+gene expression".
The entire process looks like this:
-![](/assets/img/RNAseqWorkflow.png)
+![](/assets/img/RNAseqWorkflow.png){: height="400px"}
For this training, we are only concerned with the middle analytical
steps (skipping adapter trimming).
* Alignment (mapping)
* Counting reads associated with genes
-## Analysis shell script
-
-This analysis is already available as a Unix shell script, which we
-will refer to in order to build the workflow.
-
-Some of the reasons to use CWL over a plain shell script: portability,
-scalability, ability to run on platforms that are not traditional HPC.
+In this training, we are not attempting to develop the analysis from
+scratch, instead we we will be starting from an analysis written as a
+shell script. We will be using the following shell script as a guide to build
+our workflow.
rnaseq_analysis_on_input_file.sh
---
-title: "Turning a shell script into a workflow by composing existing tools"
+title: "Make a workflow by composing tools"
teaching: 0
exercises: 0
questions:
- "First key point. Brief Answer to questions. (FIXME)"
---
-# Setting up
-
-We will create a new git repository and import a library of existing
-tool definitions that will help us build our workflow.
-
-Create a new git repository to hold our workflow with this command:
-
-```
-git init rnaseq-cwl-training-exercises
-```
-
-On Arvados use this:
-
-```
-git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
-```
-
-Next, import bio-cwl-tools with this command:
-
-```
-git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
-```
-
-# Writing the workflow
-
-## 1. File header
+# 1. File header
Create a new file "main.cwl"
label: RNAseq CWL practice workflow
```
-## 2. Workflow Inputs
+# 2. Workflow Inputs
The purpose of a workflow is to consume some input parameters, run a
series of steps, and produce output values.
gtf: File
```
-## 3. Workflow Steps
+# 3. Workflow Steps
A workflow consists of one or more steps. This is the `steps` section.
out: [html_file]
```
-## 4. Running alignment with STAR
+# 4. Running alignment with STAR
STAR has more parameters. Sometimes we want to provide input values
to a step without making them as workflow-level inputs. We can do
out: [alignment]
```
-## 5. Running samtools
+# 5. Running samtools
The third step is to generate an index for the aligned BAM.
out: [bam_sorted_indexed]
```
-## 6. featureCounts
+# 6. featureCounts
As of this writing, the `subread` package that provides
`featureCounts` is not available in bio-cwl-tools (and if it has been
go over how to write a CWL wrapper for a command line tool in
lesson 3. For now, we will leave off the final step.
-## 7. Workflow Outputs
+# 7. Workflow Outputs
The last thing to do is declare the workflow outputs in the `outputs` section.
- "First key point. Brief Answer to questions. (FIXME)"
---
-# Running and debugging a workflow
-
-### 1. The input parameter file
+# 1. The input parameter file
CWL input values are provided in the form of a YAML or JSON file.
Create one by right clicking on the explorer, select "New File" and
`class: File` or `class: Directory`. This distinguishes them from
plain strings that may or may not be file paths.
-Note: if you don't have example sequence data or the STAR index files, see the Appendix below.
+Note: if you don't have example sequence data or the STAR index files, see [setup](/setup.html).
```
fq:
location: keep:9178fe1b80a08a422dbe02adfd439764+925/reference_data/chr1-hg19_genes.gtf
```
-### 2. Running the workflow
+# 2. Running the workflow
Type this into the terminal:
cwl-runner main.cwl main-input.yaml
```
-### 3. Debugging the workflow
+# 3. Debugging the workflow
A workflow can fail for many reasons: some possible reasons include
bad input, bugs in the code, or running out memory. In this case, the
If this happens, you will need to request more RAM.
-### 4. Setting runtime RAM requirements
+# 4. Setting runtime RAM requirements
By default, a step is allocated 256 MB of RAM. From the STAR error message:
After setting the RAM requirements, re-run the workflow.
-### 5. Workflow results
+# 5. Workflow results
The CWL runner will print a results JSON object to standard output. It will look something like this (it may include additional fields).
This has the same structure as `main-input.yaml`. The each output
parameter is listed, with the `location` field of each `File` object
indicating where the output file can be found.
-
-# Appendix
-
-## Downloading sample and reference data
-
-Start from your rnaseq-cwl-exercises directory.
-
-```
-mkdir rnaseq
-cd rnaseq
-wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=9178fe1b80a08a422dbe02adfd439764+925/
-```
-
-## Downloading or generating STAR index
-
-Running STAR requires index files generated from the reference.
-
-This is a rather large download (4 GB). Depending on your bandwidth, it may be faster to generate it yourself.
-
-### Downloading
-
-```
-mkdir hg19-chr1-STAR-index
-cd hg19-chr1-STAR-index
-wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=02a12ce9e2707610991bd29d38796b57+2912/
-```
-
-### Generating
-
-Create `chr1-star-index.yaml`:
-
-```
-InputFiles:
- - class: File
- location: rnaseq/reference_data/chr1.fa
- format: http://edamontology.org/format_1930
-IndexName: 'hg19-chr1-STAR-index'
-Gtf:
- class: File
- location: rnaseq/reference_data/chr1-hg19_genes.gtf
-Overhang: 99
-```
-
-Generate the index with your local cwl-runner.
-
-```
-cwl-runner bio-cwl-tools/STAR/STAR-Index.cwl chr1-star-index.yaml
-```
This will use the "featureCounts" tool from the "subread" package.
-### 1. File header
+# 1. File header
Create a new file "featureCounts.cwl"
class: CommandLineTool
```
-### 2. Command line tool inputs
+# 2. Command line tool inputs
A CommandLineTool describes a single invocation of a command line program.
counts_input_bam: File
```
-### 3. Specifying the program to run
+# 3. Specifying the program to run
Give the name of the program to run in `baseCommand`.
baseCommand: featureCounts
```
-### 4. Command arguments
+# 4. Command arguments
The easiest way to describe the command line is with an `arguments`
section. This takes a comma-separated list of command line arguments.
$(inputs.counts_input_bam)]
```
-### 5. Outputs section
+# 5. Outputs section
In CWL, you must explicitly identify the outputs of a program. This
associates output parameters with specific files, and enables the
glob: featurecounts.tsv
```
-### 6. Running in a container
+# 6. Running in a container
In order to run the tool, it needs to be installed.
Using software containers, a tool can be pre-installed into a
dockerPull: quay.io/biocontainers/subread:1.5.0p3--0
```
-### 7. Running a tool on its own
+# 7. Running a tool on its own
When creating a tool wrapper, it is helpful to run it on its own to test it.
cwl-runner featureCounts.cwl featureCounts.yaml
```
-### 8. Adding it to the workflow
+# 8. Adding it to the workflow
Now that we have confirmed that it works, we can add it to our workflow.
We add it to `steps`, connecting the output of samtools to
- "First key point. Brief Answer to questions. (FIXME)"
---
-# Analyzing multiple samples
-
Analyzing a single sample is great, but in the real world you probably
have a batch of samples that you need to analyze and then compare.
-### 1. Subworkflows
+# 1. Subworkflows
In addition to running command line tools, a workflow step can also
execute another workflow.
If you run this workflow, you will get exactly the same results as
before, we've just wrapped the inner workflow with an outer workflow.
-### 2. Scattering
+# 2. Scattering
The wrapper lets us do something useful. We can modify the outer
workflow to accept a list of files, and then invoke the inner workflow
ScatterFeatureRequirement: {}
```
-### 3. Running with list inputs
+# 3. Running with list inputs
The `fq` parameter needs to be a list. You write a list in yaml by
starting each list item with a dash. Example `main-input.yaml`
Now you can run the workflow the same way as in Lesson 2.
-### 4. Combining results
+# 4. Combining results
Each instance of the alignment workflow produces its own featureCounts
file. However, to be able to compare results easily, we need them a
---
-title: "Dynamic Workflow behavior with expressions"
+title: "Dynamic workflows with expressions"
teaching: 0
exercises: 0
questions:
- "First key point. Brief Answer to questions. (FIXME)"
---
-### 1. Expressions on step inputs
+# 1. Expressions on step inputs
You might have noticed that the output bam files are all named
`Aligned.sortedByCoord.out.bam`. This happens because because when we
separate the leading part of our filename from the "Aligned.bam"
extension that will be added by STAR.
-### 2. Organizing output files into Directories
+# 2. Organizing output files into Directories
You probably noticed that all the output files appear in the same
directory. You might prefer that each file appears in its own
further help you use CWL to solve your own scientific workflow
problems.
-## CWL Reference
+# CWL Reference
-Main CWL web page https://commonwl.org
+[Main CWL web page](https://commonwl.org)
-User guide https://www.commonwl.org/user_guide/
+[User guide](https://www.commonwl.org/user_guide/)
-Specification https://www.commonwl.org/v1.2/
+[Specification](https://www.commonwl.org/v1.2/)
-Github organization https://github.com/common-workflow-language/
+[Github organization](https://github.com/common-workflow-language/)
-## CWL Community
+# CWL Community
-CWL Forum, this is is best place to ask questions https://cwl.discourse.group/
+The [CWL Forum](https://cwl.discourse.group/) is is best place to ask questions
-Gitter (chat) https://gitter.im/common-workflow-language/common-workflow-language
+[Gitter (chat)](https://gitter.im/common-workflow-language/common-workflow-language)
-Weekly video calls https://cwl.discourse.group/t/eu-us-timezone-cwl-video-chat/260
+[Weekly video calls](https://cwl.discourse.group/t/eu-us-timezone-cwl-video-chat/260)
-## Software resources
+# Software resources
-Github organization for repositories of CWL tool and workflow
-descriptions, including bio-cwl-tools
-https://github.com/common-workflow-library/
+Github organization for [repositories of CWL tool and workflow descriptions](https://github.com/common-workflow-library/),
+including [bio-cwl-tools](https://github.com/common-workflow-library/bio-cwl-tools).
-BioContainers https://biocontainers.pro/
+[BioContainers](https://biocontainers.pro/)
-Search for CWL files on github, try adding the name of a tool you are
-interested in to the search
-https://github.com/search?q=extension%3Acwl+cwlVersion
+[Search for CWL files](https://github.com/search?q=extension%3Acwl+cwlVersion) on
+Github, try adding the name of a tool you are interested in to the
+search
---
title: Setup
---
-FIXME
+
+# Setting up a practice repository
+
+We will create a new git repository and import a library of existing
+tool definitions that will help us build our workflow.
+
+Create a new git repository to hold our workflow with this command:
+
+```
+git init rnaseq-cwl-training-exercises
+```
+
+On Arvados use this:
+
+```
+git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
+```
+
+Next, import bio-cwl-tools with this command:
+
+```
+git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
+```
+
+# Downloading sample and reference data
+
+Start from your rnaseq-cwl-exercises directory.
+
+```
+mkdir rnaseq
+cd rnaseq
+wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=9178fe1b80a08a422dbe02adfd439764+925/
+```
+
+# Downloading or generating STAR index
+
+Running STAR requires index files generated from the reference.
+
+This is a rather large download (4 GB). Depending on your bandwidth, it may be faster to generate it yourself.
+
+## Downloading
+
+```
+mkdir hg19-chr1-STAR-index
+cd hg19-chr1-STAR-index
+wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=02a12ce9e2707610991bd29d38796b57+2912/
+```
+
+## Generating
+
+Create `chr1-star-index.yaml`:
+
+```
+InputFiles:
+ - class: File
+ location: rnaseq/reference_data/chr1.fa
+ format: http://edamontology.org/format_1930
+IndexName: 'hg19-chr1-STAR-index'
+Gtf:
+ class: File
+ location: rnaseq/reference_data/chr1-hg19_genes.gtf
+Overhang: 99
+```
+
+Generate the index with your local cwl-runner.
+
+```
+cwl-runner bio-cwl-tools/STAR/STAR-Index.cwl chr1-star-index.yaml
+```
{% include links.md %}