From: Peter Amstutz Date: Tue, 26 Jan 2021 22:41:21 +0000 (-0500) Subject: Formatting & content WIP X-Git-Url: https://git.arvados.org/rnaseq-cwl-training.git/commitdiff_plain/9126e9209dec22eae0093204c060cbe4c139f720 Formatting & content WIP Arvados-DCO-1.1-Signed-off-by: Peter Amstutz --- diff --git a/_config.yml b/_config.yml index a67f14b..7ddda3b 100644 --- a/_config.yml +++ b/_config.yml @@ -11,7 +11,7 @@ carpentry: "swc" # Overall title for pages. -title: "Lesson Title" +title: "Getting started with CWL" # Life cycle stage of the lesson # See this page for more details: https://cdh.carpentries.org/the-lesson-life-cycle.html diff --git a/_episodes/01-introduction.md b/_episodes/01-introduction.md index fa10f79..8ee870e 100644 --- a/_episodes/01-introduction.md +++ b/_episodes/01-introduction.md @@ -1,40 +1,56 @@ --- title: "Introduction" -teaching: 0 +teaching: 10 exercises: 0 questions: -- "Key question (FIXME)" +- "What is CWL?" +- "What is the goal of this training?" objectives: - "First learning objective. (FIXME)" keypoints: - "First key point. Brief Answer to questions. (FIXME)" --- -## Introduction +# Introduction to Common Worklow Language -The goal of this training is to walk through the development of a -best-practices CWL workflow by translating an existing bioinformatics -shell script into CWL. Specific knowledge of the biology of RNA-seq -is *not* a prerequisite for these lessons. +The Common Workflow Language (CWL) is an open standard for describing +analysis workflows and tools in a way that makes them portable and +scalable across a variety of software and hardware environments, from +workstations to cluster, cloud, and high performance computing (HPC) +environments. CWL is designed to meet the needs of data-intensive +science, such as Bioinformatics, Medical Imaging, Astronomy, High +Energy Physics, and Machine Learning. -These lessons are based on "Introduction to RNA-seq using -high-performance computing (HPC)" lessons developed by members of the -teaching team at the Harvard Chan Bioinformatics Core (HBC). The -original training, which includes additional lectures about the -biology of RNA-seq can be found here: +# Introduction to this training -https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2 +The goal of this training is to walk the student through the +development of a best-practices CWL workflow, starting from an +existing shell script that performs a common bioinformatics analysis. -## Background +Specific knowledge of the biology of RNA-seq is *not* a prerequisite +for these lessons. CWL is not domain specific to bioinformatics. We +hope that you will find this training useful even if you work in some +other field of research. -RNA-seq is the process of sequencing RNA in a biological sample. From -the sequence reads, we want to measure the relative number of RNA -molecules appearing in the sample that were produced by particular -genes. This analysis is called "differential gene expression". +These lessons are based on [Introduction to RNA-seq using +high-performance computing +(HPC)](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2) lessons +developed by members of the teaching team at the Harvard Chan +Bioinformatics Core (HBC). The original training, which includes +additional lectures about the biology of RNA-seq, can be found at that +link. + +# Introduction to the example analysis + +RNA-seq is the process of sequencing RNA present in a biological +sample. From the sequence reads, we want to measure the relative +numbers of different RNA molecules appearing in the sample that were +produced by particular genes. This analysis is called "differential +gene expression". The entire process looks like this: -![](/assets/img/RNAseqWorkflow.png) +![](/assets/img/RNAseqWorkflow.png){: height="400px"} For this training, we are only concerned with the middle analytical steps (skipping adapter trimming). @@ -43,13 +59,10 @@ steps (skipping adapter trimming). * Alignment (mapping) * Counting reads associated with genes -## Analysis shell script - -This analysis is already available as a Unix shell script, which we -will refer to in order to build the workflow. - -Some of the reasons to use CWL over a plain shell script: portability, -scalability, ability to run on platforms that are not traditional HPC. +In this training, we are not attempting to develop the analysis from +scratch, instead we we will be starting from an analysis written as a +shell script. We will be using the following shell script as a guide to build +our workflow. rnaseq_analysis_on_input_file.sh diff --git a/_episodes/02-workflow.md b/_episodes/02-workflow.md index a3700a9..cfb133c 100644 --- a/_episodes/02-workflow.md +++ b/_episodes/02-workflow.md @@ -1,5 +1,5 @@ --- -title: "Turning a shell script into a workflow by composing existing tools" +title: "Make a workflow by composing tools" teaching: 0 exercises: 0 questions: @@ -10,32 +10,7 @@ keypoints: - "First key point. Brief Answer to questions. (FIXME)" --- -# Setting up - -We will create a new git repository and import a library of existing -tool definitions that will help us build our workflow. - -Create a new git repository to hold our workflow with this command: - -``` -git init rnaseq-cwl-training-exercises -``` - -On Arvados use this: - -``` -git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises -``` - -Next, import bio-cwl-tools with this command: - -``` -git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git -``` - -# Writing the workflow - -## 1. File header +# 1. File header Create a new file "main.cwl" @@ -48,7 +23,7 @@ class: Workflow label: RNAseq CWL practice workflow ``` -## 2. Workflow Inputs +# 2. Workflow Inputs The purpose of a workflow is to consume some input parameters, run a series of steps, and produce output values. @@ -81,7 +56,7 @@ inputs: gtf: File ``` -## 3. Workflow Steps +# 3. Workflow Steps A workflow consists of one or more steps. This is the `steps` section. @@ -116,7 +91,7 @@ steps: out: [html_file] ``` -## 4. Running alignment with STAR +# 4. Running alignment with STAR STAR has more parameters. Sometimes we want to provide input values to a step without making them as workflow-level inputs. We can do @@ -138,7 +113,7 @@ this with `{default: N}` out: [alignment] ``` -## 5. Running samtools +# 5. Running samtools The third step is to generate an index for the aligned BAM. @@ -157,7 +132,7 @@ step will not run until the `STAR` step has completed successfully. out: [bam_sorted_indexed] ``` -## 6. featureCounts +# 6. featureCounts As of this writing, the `subread` package that provides `featureCounts` is not available in bio-cwl-tools (and if it has been @@ -165,7 +140,7 @@ added since writing this, let's pretend that it isn't there.) We will go over how to write a CWL wrapper for a command line tool in lesson 3. For now, we will leave off the final step. -## 7. Workflow Outputs +# 7. Workflow Outputs The last thing to do is declare the workflow outputs in the `outputs` section. diff --git a/_episodes/03-running.md b/_episodes/03-running.md index b851a29..c6ff7d5 100644 --- a/_episodes/03-running.md +++ b/_episodes/03-running.md @@ -10,9 +10,7 @@ keypoints: - "First key point. Brief Answer to questions. (FIXME)" --- -# Running and debugging a workflow - -### 1. The input parameter file +# 1. The input parameter file CWL input values are provided in the form of a YAML or JSON file. Create one by right clicking on the explorer, select "New File" and @@ -26,7 +24,7 @@ When setting inputs, Files and Directories are given as an object with `class: File` or `class: Directory`. This distinguishes them from plain strings that may or may not be file paths. -Note: if you don't have example sequence data or the STAR index files, see the Appendix below. +Note: if you don't have example sequence data or the STAR index files, see [setup](/setup.html). ``` fq: @@ -56,7 +54,7 @@ gtf: location: keep:9178fe1b80a08a422dbe02adfd439764+925/reference_data/chr1-hg19_genes.gtf ``` -### 2. Running the workflow +# 2. Running the workflow Type this into the terminal: @@ -64,7 +62,7 @@ Type this into the terminal: cwl-runner main.cwl main-input.yaml ``` -### 3. Debugging the workflow +# 3. Debugging the workflow A workflow can fail for many reasons: some possible reasons include bad input, bugs in the code, or running out memory. In this case, the @@ -92,7 +90,7 @@ Container exited with code: 137 If this happens, you will need to request more RAM. -### 4. Setting runtime RAM requirements +# 4. Setting runtime RAM requirements By default, a step is allocated 256 MB of RAM. From the STAR error message: @@ -119,7 +117,7 @@ Resource requirements you can set include: After setting the RAM requirements, re-run the workflow. -### 5. Workflow results +# 5. Workflow results The CWL runner will print a results JSON object to standard output. It will look something like this (it may include additional fields). @@ -152,51 +150,3 @@ The CWL runner will print a results JSON object to standard output. It will loo This has the same structure as `main-input.yaml`. The each output parameter is listed, with the `location` field of each `File` object indicating where the output file can be found. - -# Appendix - -## Downloading sample and reference data - -Start from your rnaseq-cwl-exercises directory. - -``` -mkdir rnaseq -cd rnaseq -wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=9178fe1b80a08a422dbe02adfd439764+925/ -``` - -## Downloading or generating STAR index - -Running STAR requires index files generated from the reference. - -This is a rather large download (4 GB). Depending on your bandwidth, it may be faster to generate it yourself. - -### Downloading - -``` -mkdir hg19-chr1-STAR-index -cd hg19-chr1-STAR-index -wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=02a12ce9e2707610991bd29d38796b57+2912/ -``` - -### Generating - -Create `chr1-star-index.yaml`: - -``` -InputFiles: - - class: File - location: rnaseq/reference_data/chr1.fa - format: http://edamontology.org/format_1930 -IndexName: 'hg19-chr1-STAR-index' -Gtf: - class: File - location: rnaseq/reference_data/chr1-hg19_genes.gtf -Overhang: 99 -``` - -Generate the index with your local cwl-runner. - -``` -cwl-runner bio-cwl-tools/STAR/STAR-Index.cwl chr1-star-index.yaml -``` diff --git a/_episodes/04-commandlinetool.md b/_episodes/04-commandlinetool.md index 0575a01..cae1682 100644 --- a/_episodes/04-commandlinetool.md +++ b/_episodes/04-commandlinetool.md @@ -14,7 +14,7 @@ It is time to add the last step in the analysis. This will use the "featureCounts" tool from the "subread" package. -### 1. File header +# 1. File header Create a new file "featureCounts.cwl" @@ -25,7 +25,7 @@ cwlVersion: v1.2 class: CommandLineTool ``` -### 2. Command line tool inputs +# 2. Command line tool inputs A CommandLineTool describes a single invocation of a command line program. @@ -50,7 +50,7 @@ inputs: counts_input_bam: File ``` -### 3. Specifying the program to run +# 3. Specifying the program to run Give the name of the program to run in `baseCommand`. @@ -58,7 +58,7 @@ Give the name of the program to run in `baseCommand`. baseCommand: featureCounts ``` -### 4. Command arguments +# 4. Command arguments The easiest way to describe the command line is with an `arguments` section. This takes a comma-separated list of command line arguments. @@ -78,7 +78,7 @@ arguments: [-T, $(runtime.cores), $(inputs.counts_input_bam)] ``` -### 5. Outputs section +# 5. Outputs section In CWL, you must explicitly identify the outputs of a program. This associates output parameters with specific files, and enables the @@ -103,7 +103,7 @@ outputs: glob: featurecounts.tsv ``` -### 6. Running in a container +# 6. Running in a container In order to run the tool, it needs to be installed. Using software containers, a tool can be pre-installed into a @@ -133,7 +133,7 @@ hints: dockerPull: quay.io/biocontainers/subread:1.5.0p3--0 ``` -### 7. Running a tool on its own +# 7. Running a tool on its own When creating a tool wrapper, it is helpful to run it on its own to test it. @@ -157,7 +157,7 @@ The invocation is also the same: cwl-runner featureCounts.cwl featureCounts.yaml ``` -### 8. Adding it to the workflow +# 8. Adding it to the workflow Now that we have confirmed that it works, we can add it to our workflow. We add it to `steps`, connecting the output of samtools to diff --git a/_episodes/05-scatter.md b/_episodes/05-scatter.md index 6160bae..bc53672 100644 --- a/_episodes/05-scatter.md +++ b/_episodes/05-scatter.md @@ -10,12 +10,10 @@ keypoints: - "First key point. Brief Answer to questions. (FIXME)" --- -# Analyzing multiple samples - Analyzing a single sample is great, but in the real world you probably have a batch of samples that you need to analyze and then compare. -### 1. Subworkflows +# 1. Subworkflows In addition to running command line tools, a workflow step can also execute another workflow. @@ -60,7 +58,7 @@ requirements: If you run this workflow, you will get exactly the same results as before, we've just wrapped the inner workflow with an outer workflow. -### 2. Scattering +# 2. Scattering The wrapper lets us do something useful. We can modify the outer workflow to accept a list of files, and then invoke the inner workflow @@ -116,7 +114,7 @@ requirements: ScatterFeatureRequirement: {} ``` -### 3. Running with list inputs +# 3. Running with list inputs The `fq` parameter needs to be a list. You write a list in yaml by starting each list item with a dash. Example `main-input.yaml` @@ -151,7 +149,7 @@ gtf: Now you can run the workflow the same way as in Lesson 2. -### 4. Combining results +# 4. Combining results Each instance of the alignment workflow produces its own featureCounts file. However, to be able to compare results easily, we need them a diff --git a/_episodes/06-expressions.md b/_episodes/06-expressions.md index 7b83de6..54a5d32 100644 --- a/_episodes/06-expressions.md +++ b/_episodes/06-expressions.md @@ -1,5 +1,5 @@ --- -title: "Dynamic Workflow behavior with expressions" +title: "Dynamic workflows with expressions" teaching: 0 exercises: 0 questions: @@ -10,7 +10,7 @@ keypoints: - "First key point. Brief Answer to questions. (FIXME)" --- -### 1. Expressions on step inputs +# 1. Expressions on step inputs You might have noticed that the output bam files are all named `Aligned.sortedByCoord.out.bam`. This happens because because when we @@ -64,7 +64,7 @@ adds the remainder of the string, which just is a dot `.`. This is to separate the leading part of our filename from the "Aligned.bam" extension that will be added by STAR. -### 2. Organizing output files into Directories +# 2. Organizing output files into Directories You probably noticed that all the output files appear in the same directory. You might prefer that each file appears in its own diff --git a/_episodes/07-resources.md b/_episodes/07-resources.md index 81fd2e1..0ac9e5f 100644 --- a/_episodes/07-resources.md +++ b/_episodes/07-resources.md @@ -15,32 +15,31 @@ developing a CWL workflow. There are many resources out there to further help you use CWL to solve your own scientific workflow problems. -## CWL Reference +# CWL Reference -Main CWL web page https://commonwl.org +[Main CWL web page](https://commonwl.org) -User guide https://www.commonwl.org/user_guide/ +[User guide](https://www.commonwl.org/user_guide/) -Specification https://www.commonwl.org/v1.2/ +[Specification](https://www.commonwl.org/v1.2/) -Github organization https://github.com/common-workflow-language/ +[Github organization](https://github.com/common-workflow-language/) -## CWL Community +# CWL Community -CWL Forum, this is is best place to ask questions https://cwl.discourse.group/ +The [CWL Forum](https://cwl.discourse.group/) is is best place to ask questions -Gitter (chat) https://gitter.im/common-workflow-language/common-workflow-language +[Gitter (chat)](https://gitter.im/common-workflow-language/common-workflow-language) -Weekly video calls https://cwl.discourse.group/t/eu-us-timezone-cwl-video-chat/260 +[Weekly video calls](https://cwl.discourse.group/t/eu-us-timezone-cwl-video-chat/260) -## Software resources +# Software resources -Github organization for repositories of CWL tool and workflow -descriptions, including bio-cwl-tools -https://github.com/common-workflow-library/ +Github organization for [repositories of CWL tool and workflow descriptions](https://github.com/common-workflow-library/), +including [bio-cwl-tools](https://github.com/common-workflow-library/bio-cwl-tools). -BioContainers https://biocontainers.pro/ +[BioContainers](https://biocontainers.pro/) -Search for CWL files on github, try adding the name of a tool you are -interested in to the search -https://github.com/search?q=extension%3Acwl+cwlVersion +[Search for CWL files](https://github.com/search?q=extension%3Acwl+cwlVersion) on +Github, try adding the name of a tool you are interested in to the +search diff --git a/setup.md b/setup.md index b8c5032..f907ec7 100644 --- a/setup.md +++ b/setup.md @@ -1,7 +1,75 @@ --- title: Setup --- -FIXME + +# Setting up a practice repository + +We will create a new git repository and import a library of existing +tool definitions that will help us build our workflow. + +Create a new git repository to hold our workflow with this command: + +``` +git init rnaseq-cwl-training-exercises +``` + +On Arvados use this: + +``` +git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises +``` + +Next, import bio-cwl-tools with this command: + +``` +git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git +``` + +# Downloading sample and reference data + +Start from your rnaseq-cwl-exercises directory. + +``` +mkdir rnaseq +cd rnaseq +wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=9178fe1b80a08a422dbe02adfd439764+925/ +``` + +# Downloading or generating STAR index + +Running STAR requires index files generated from the reference. + +This is a rather large download (4 GB). Depending on your bandwidth, it may be faster to generate it yourself. + +## Downloading + +``` +mkdir hg19-chr1-STAR-index +cd hg19-chr1-STAR-index +wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=02a12ce9e2707610991bd29d38796b57+2912/ +``` + +## Generating + +Create `chr1-star-index.yaml`: + +``` +InputFiles: + - class: File + location: rnaseq/reference_data/chr1.fa + format: http://edamontology.org/format_1930 +IndexName: 'hg19-chr1-STAR-index' +Gtf: + class: File + location: rnaseq/reference_data/chr1-hg19_genes.gtf +Overhang: 99 +``` + +Generate the index with your local cwl-runner. + +``` +cwl-runner bio-cwl-tools/STAR/STAR-Index.cwl chr1-star-index.yaml +``` {% include links.md %}