Formatting & content WIP

author Peter Amstutz <peter.amstutz@curii.com>

Tue, 26 Jan 2021 22:41:21 +0000 (17:41 -0500)

committer Peter Amstutz <peter.amstutz@curii.com>

Tue, 26 Jan 2021 22:41:21 +0000 (17:41 -0500)
author Peter Amstutz <peter.amstutz@curii.com>
Tue, 26 Jan 2021 22:41:21 +0000 (17:41 -0500)
committer Peter Amstutz <peter.amstutz@curii.com>
Tue, 26 Jan 2021 22:41:21 +0000 (17:41 -0500)
diff --git a/_config.yml b/_config.yml

index a67f14b1d1f4da2ccebd8df341ddb6becf2a317e..7ddda3bd1718872e8664a6ed354877e77593c50e 100644 (file)
--- a/_config.yml
+++ b/_config.yml
@@ -11,7 +11,7 @@
  carpentry: "swc"
  
  # Overall title for pages.
-title: "Lesson Title"
+title: "Getting started with CWL"
  
  # Life cycle stage of the lesson
  # See this page for more details: https://cdh.carpentries.org/the-lesson-life-cycle.html
diff --git a/_episodes/01-introduction.md b/_episodes/01-introduction.md

index fa10f796d02d55a2be56d242e8249428a1f28d99..8ee870e5f36396d728382eb06ff5c4a05386824a 100644 (file)
--- a/_episodes/01-introduction.md
+++ b/_episodes/01-introduction.md
@@ -1,40 +1,56 @@
  ---
  title: "Introduction"
-teaching: 0
+teaching: 10
  exercises: 0
  questions:
-- "Key question (FIXME)"
+- "What is CWL?"
+- "What is the goal of this training?"
  objectives:
  - "First learning objective. (FIXME)"
  keypoints:
  - "First key point. Brief Answer to questions. (FIXME)"
  ---
  
-## Introduction
+# Introduction to Common Worklow Language
  
-The goal of this training is to walk through the development of a
-best-practices CWL workflow by translating an existing bioinformatics
-shell script into CWL.  Specific knowledge of the biology of RNA-seq
-is *not* a prerequisite for these lessons.
+The Common Workflow Language (CWL) is an open standard for describing
+analysis workflows and tools in a way that makes them portable and
+scalable across a variety of software and hardware environments, from
+workstations to cluster, cloud, and high performance computing (HPC)
+environments. CWL is designed to meet the needs of data-intensive
+science, such as Bioinformatics, Medical Imaging, Astronomy, High
+Energy Physics, and Machine Learning.
  
-These lessons are based on "Introduction to RNA-seq using
-high-performance computing (HPC)" lessons developed by members of the
-teaching team at the Harvard Chan Bioinformatics Core (HBC).  The
-original training, which includes additional lectures about the
-biology of RNA-seq can be found here:
+# Introduction to this training
  
-https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2
+The goal of this training is to walk the student through the
+development of a best-practices CWL workflow, starting from an
+existing shell script that performs a common bioinformatics analysis.
  
-## Background
+Specific knowledge of the biology of RNA-seq is *not* a prerequisite
+for these lessons.  CWL is not domain specific to bioinformatics.  We
+hope that you will find this training useful even if you work in some
+other field of research.
  
-RNA-seq is the process of sequencing RNA in a biological sample.  From
-the sequence reads, we want to measure the relative number of RNA
-molecules appearing in the sample that were produced by particular
-genes.  This analysis is called "differential gene expression".
+These lessons are based on [Introduction to RNA-seq using
+high-performance computing
+(HPC)](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2) lessons
+developed by members of the teaching team at the Harvard Chan
+Bioinformatics Core (HBC).  The original training, which includes
+additional lectures about the biology of RNA-seq, can be found at that
+link.
+
+# Introduction to the example analysis
+
+RNA-seq is the process of sequencing RNA present in a biological
+sample.  From the sequence reads, we want to measure the relative
+numbers of different RNA molecules appearing in the sample that were
+produced by particular genes.  This analysis is called "differential
+gene expression".
  
  The entire process looks like this:
  
-![](/assets/img/RNAseqWorkflow.png)
+![](/assets/img/RNAseqWorkflow.png){: height="400px"}
  
  For this training, we are only concerned with the middle analytical
  steps (skipping adapter trimming).
@@ -43,13 +59,10 @@ steps (skipping adapter trimming).
  * Alignment (mapping)
  * Counting reads associated with genes
  
-## Analysis shell script
-
-This analysis is already available as a Unix shell script, which we
-will refer to in order to build the workflow.
-
-Some of the reasons to use CWL over a plain shell script: portability,
-scalability, ability to run on platforms that are not traditional HPC.
+In this training, we are not attempting to develop the analysis from
+scratch, instead we we will be starting from an analysis written as a
+shell script.  We will be using the following shell script as a guide to build
+our workflow.
  
  rnaseq_analysis_on_input_file.sh
  
diff --git a/_episodes/02-workflow.md b/_episodes/02-workflow.md

index a3700a9def6b1cf2727dbdcd609e959c78ca9d6c..cfb133cb47fb3bed9f18181d1f6fce557e4b6dee 100644 (file)
--- a/_episodes/02-workflow.md
+++ b/_episodes/02-workflow.md
@@ -1,5 +1,5 @@
  ---
-title: "Turning a shell script into a workflow by composing existing tools"
+title: "Make a workflow by composing tools"
  teaching: 0
  exercises: 0
  questions:
@@ -10,32 +10,7 @@ keypoints:
  - "First key point. Brief Answer to questions. (FIXME)"
  ---
  
-# Setting up
-
-We will create a new git repository and import a library of existing
-tool definitions that will help us build our workflow.
-
-Create a new git repository to hold our workflow with this command:
-
-```
-git init rnaseq-cwl-training-exercises
-```
-
-On Arvados use this:
-
-```
-git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
-```
-
-Next, import bio-cwl-tools with this command:
-
-```
-git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
-```
-
-# Writing the workflow
-
-## 1. File header
+# 1. File header
  
  Create a new file "main.cwl"
  
@@ -48,7 +23,7 @@ class: Workflow
  label: RNAseq CWL practice workflow
  ```
  
-## 2. Workflow Inputs
+# 2. Workflow Inputs
  
  The purpose of a workflow is to consume some input parameters, run a
  series of steps, and produce output values.
@@ -81,7 +56,7 @@ inputs:
    gtf: File
  ```
  
-## 3. Workflow Steps
+# 3. Workflow Steps
  
  A workflow consists of one or more steps.  This is the `steps` section.
  
@@ -116,7 +91,7 @@ steps:
      out: [html_file]
  ```
  
-## 4. Running alignment with STAR
+# 4. Running alignment with STAR
  
  STAR has more parameters.  Sometimes we want to provide input values
  to a step without making them as workflow-level inputs.  We can do
@@ -138,7 +113,7 @@ this with `{default: N}`
      out: [alignment]
  ```
  
-## 5. Running samtools
+# 5. Running samtools
  
  The third step is to generate an index for the aligned BAM.
  
@@ -157,7 +132,7 @@ step will not run until the `STAR` step has completed successfully.
      out: [bam_sorted_indexed]
  ```
  
-## 6. featureCounts
+# 6. featureCounts
  
  As of this writing, the `subread` package that provides
  `featureCounts` is not available in bio-cwl-tools (and if it has been
@@ -165,7 +140,7 @@ added since writing this, let's pretend that it isn't there.)  We will
  go over how to write a CWL wrapper for a command line tool in
  lesson 3.  For now, we will leave off the final step.
  
-## 7. Workflow Outputs
+# 7. Workflow Outputs
  
  The last thing to do is declare the workflow outputs in the `outputs` section.
  
diff --git a/_episodes/03-running.md b/_episodes/03-running.md

index b851a2959f70af583ad60987b7fb36069941cdcc..c6ff7d53642b0ddb1a8dcb89b5e97d7f78d63d43 100644 (file)
--- a/_episodes/03-running.md
+++ b/_episodes/03-running.md
@@ -10,9 +10,7 @@ keypoints:
  - "First key point. Brief Answer to questions. (FIXME)"
  ---
  
-# Running and debugging a workflow
-
-### 1. The input parameter file
+# 1. The input parameter file
  
  CWL input values are provided in the form of a YAML or JSON file.
  Create one by right clicking on the explorer, select "New File" and
@@ -26,7 +24,7 @@ When setting inputs, Files and Directories are given as an object with
  `class: File` or `class: Directory`.  This distinguishes them from
  plain strings that may or may not be file paths.
  
-Note: if you don't have example sequence data or the STAR index files, see the Appendix below.
+Note: if you don't have example sequence data or the STAR index files, see [setup](/setup.html).
  
  ```
  fq:
@@ -56,7 +54,7 @@ gtf:
    location: keep:9178fe1b80a08a422dbe02adfd439764+925/reference_data/chr1-hg19_genes.gtf
  ```
  
-### 2. Running the workflow
+# 2. Running the workflow
  
  Type this into the terminal:
  
@@ -64,7 +62,7 @@ Type this into the terminal:
  cwl-runner main.cwl main-input.yaml
  ```
  
-### 3. Debugging the workflow
+# 3. Debugging the workflow
  
  A workflow can fail for many reasons: some possible reasons include
  bad input, bugs in the code, or running out memory.  In this case, the
@@ -92,7 +90,7 @@ Container exited with code: 137
  
  If this happens, you will need to request more RAM.
  
-### 4. Setting runtime RAM requirements
+# 4. Setting runtime RAM requirements
  
  By default, a step is allocated 256 MB of RAM.  From the STAR error message:
  
@@ -119,7 +117,7 @@ Resource requirements you can set include:
  
  After setting the RAM requirements, re-run the workflow.
  
-### 5. Workflow results
+# 5. Workflow results
  
  The CWL runner will print a results JSON object to standard output.  It will look something like this (it may include additional fields).
  
@@ -152,51 +150,3 @@ The CWL runner will print a results JSON object to standard output.  It will loo
  This has the same structure as `main-input.yaml`.  The each output
  parameter is listed, with the `location` field of each `File` object
  indicating where the output file can be found.
-
-# Appendix
-
-## Downloading sample and reference data
-
-Start from your rnaseq-cwl-exercises directory.
-
-```
-mkdir rnaseq
-cd rnaseq
-wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=9178fe1b80a08a422dbe02adfd439764+925/
-```
-
-## Downloading or generating STAR index
-
-Running STAR requires index files generated from the reference.
-
-This is a rather large download (4 GB).  Depending on your bandwidth, it may be faster to generate it yourself.
-
-### Downloading
-
-```
-mkdir hg19-chr1-STAR-index
-cd hg19-chr1-STAR-index
-wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=02a12ce9e2707610991bd29d38796b57+2912/
-```
-
-### Generating
-
-Create `chr1-star-index.yaml`:
-
-```
-InputFiles:
-  - class: File
-    location: rnaseq/reference_data/chr1.fa
-    format: http://edamontology.org/format_1930
-IndexName: 'hg19-chr1-STAR-index'
-Gtf:
-  class: File
-  location: rnaseq/reference_data/chr1-hg19_genes.gtf
-Overhang: 99
-```
-
-Generate the index with your local cwl-runner.
-
-```
-cwl-runner bio-cwl-tools/STAR/STAR-Index.cwl chr1-star-index.yaml
-```
diff --git a/_episodes/04-commandlinetool.md b/_episodes/04-commandlinetool.md

index 0575a01561fd71919f991a11bf925592f31f5943..cae16826d6d11d8308e038fc1a7780be6d2b9a0f 100644 (file)
--- a/_episodes/04-commandlinetool.md
+++ b/_episodes/04-commandlinetool.md
@@ -14,7 +14,7 @@ It is time to add the last step in the analysis.
  
  This will use the "featureCounts" tool from the "subread" package.
  
-### 1. File header
+# 1. File header
  
  Create a new file "featureCounts.cwl"
  
@@ -25,7 +25,7 @@ cwlVersion: v1.2
  class: CommandLineTool
  ```
  
-### 2. Command line tool inputs
+# 2. Command line tool inputs
  
  A CommandLineTool describes a single invocation of a command line program.
  
@@ -50,7 +50,7 @@ inputs:
    counts_input_bam: File
  ```
  
-### 3. Specifying the program to run
+# 3. Specifying the program to run
  
  Give the name of the program to run in `baseCommand`.
  
@@ -58,7 +58,7 @@ Give the name of the program to run in `baseCommand`.
  baseCommand: featureCounts
  ```
  
-### 4. Command arguments
+# 4. Command arguments
  
  The easiest way to describe the command line is with an `arguments`
  section.  This takes a comma-separated list of command line arguments.
@@ -78,7 +78,7 @@ arguments: [-T, $(runtime.cores),
              $(inputs.counts_input_bam)]
  ```
  
-### 5. Outputs section
+# 5. Outputs section
  
  In CWL, you must explicitly identify the outputs of a program.  This
  associates output parameters with specific files, and enables the
@@ -103,7 +103,7 @@ outputs:
        glob: featurecounts.tsv
  ```
  
-### 6. Running in a container
+# 6. Running in a container
  
  In order to run the tool, it needs to be installed.
  Using software containers, a tool can be pre-installed into a
@@ -133,7 +133,7 @@ hints:
      dockerPull: quay.io/biocontainers/subread:1.5.0p3--0
  ```
  
-### 7. Running a tool on its own
+# 7. Running a tool on its own
  
  When creating a tool wrapper, it is helpful to run it on its own to test it.
  
@@ -157,7 +157,7 @@ The invocation is also the same:
  cwl-runner featureCounts.cwl featureCounts.yaml
  ```
  
-### 8. Adding it to the workflow
+# 8. Adding it to the workflow
  
  Now that we have confirmed that it works, we can add it to our workflow.
  We add it to `steps`, connecting the output of samtools to
diff --git a/_episodes/05-scatter.md b/_episodes/05-scatter.md

index 6160baeb27cec93a0be94030d08b52f57add9a3f..bc536727b9824a72446eeab60d43a716d9809b15 100644 (file)
--- a/_episodes/05-scatter.md
+++ b/_episodes/05-scatter.md
@@ -10,12 +10,10 @@ keypoints:
  - "First key point. Brief Answer to questions. (FIXME)"
  ---
  
-# Analyzing multiple samples
-
  Analyzing a single sample is great, but in the real world you probably
  have a batch of samples that you need to analyze and then compare.
  
-### 1. Subworkflows
+# 1. Subworkflows
  
  In addition to running command line tools, a workflow step can also
  execute another workflow.
@@ -60,7 +58,7 @@ requirements:
  If you run this workflow, you will get exactly the same results as
  before, we've just wrapped the inner workflow with an outer workflow.
  
-### 2. Scattering
+# 2. Scattering
  
  The wrapper lets us do something useful.  We can modify the outer
  workflow to accept a list of files, and then invoke the inner workflow
@@ -116,7 +114,7 @@ requirements:
    ScatterFeatureRequirement: {}
  ```
  
-### 3. Running with list inputs
+# 3. Running with list inputs
  
  The `fq` parameter needs to be a list.  You write a list in yaml by
  starting each list item with a dash.  Example `main-input.yaml`
@@ -151,7 +149,7 @@ gtf:
  
  Now you can run the workflow the same way as in Lesson 2.
  
-### 4. Combining results
+# 4. Combining results
  
  Each instance of the alignment workflow produces its own featureCounts
  file.  However, to be able to compare results easily, we need them a
diff --git a/_episodes/06-expressions.md b/_episodes/06-expressions.md

index 7b83de6d28c831f5b6f124f39377c0705d7df7ac..54a5d32b9065a9534370521f3b7ad4ce46892d87 100644 (file)
--- a/_episodes/06-expressions.md
+++ b/_episodes/06-expressions.md
@@ -1,5 +1,5 @@
  ---
-title: "Dynamic Workflow behavior with expressions"
+title: "Dynamic workflows with expressions"
  teaching: 0
  exercises: 0
  questions:
@@ -10,7 +10,7 @@ keypoints:
  - "First key point. Brief Answer to questions. (FIXME)"
  ---
  
-### 1. Expressions on step inputs
+# 1. Expressions on step inputs
  
  You might have noticed that the output bam files are all named
  `Aligned.sortedByCoord.out.bam`.  This happens because because when we
@@ -64,7 +64,7 @@ adds the remainder of the string, which just is a dot `.`.  This is to
  separate the leading part of our filename from the "Aligned.bam"
  extension that will be added by STAR.
  
-### 2. Organizing output files into Directories
+# 2. Organizing output files into Directories
  
  You probably noticed that all the output files appear in the same
  directory.  You might prefer that each file appears in its own
diff --git a/_episodes/07-resources.md b/_episodes/07-resources.md

index 81fd2e1bd4915c67cb25c51f73ac67526480561f..0ac9e5f053d5de44887b2337493f03aa85fb8b1e 100644 (file)
--- a/_episodes/07-resources.md
+++ b/_episodes/07-resources.md
@@ -15,32 +15,31 @@ developing a CWL workflow. There are many resources out there to
  further help you use CWL to solve your own scientific workflow
  problems.
  
-## CWL Reference
+# CWL Reference
  
-Main CWL web page https://commonwl.org
+[Main CWL web page](https://commonwl.org)
  
-User guide https://www.commonwl.org/user_guide/
+[User guide](https://www.commonwl.org/user_guide/)
  
-Specification https://www.commonwl.org/v1.2/
+[Specification](https://www.commonwl.org/v1.2/)
  
-Github organization https://github.com/common-workflow-language/
+[Github organization](https://github.com/common-workflow-language/)
  
-## CWL Community
+# CWL Community
  
-CWL Forum, this is is best place to ask questions https://cwl.discourse.group/
+The [CWL Forum](https://cwl.discourse.group/) is is best place to ask questions
  
-Gitter (chat) https://gitter.im/common-workflow-language/common-workflow-language
+[Gitter (chat)](https://gitter.im/common-workflow-language/common-workflow-language)
  
-Weekly video calls https://cwl.discourse.group/t/eu-us-timezone-cwl-video-chat/260
+[Weekly video calls](https://cwl.discourse.group/t/eu-us-timezone-cwl-video-chat/260)
  
-## Software resources
+# Software resources
  
-Github organization for repositories of CWL tool and workflow
-descriptions, including bio-cwl-tools
-https://github.com/common-workflow-library/
+Github organization for [repositories of CWL tool and workflow descriptions](https://github.com/common-workflow-library/),
+including [bio-cwl-tools](https://github.com/common-workflow-library/bio-cwl-tools).
  
-BioContainers https://biocontainers.pro/
+[BioContainers](https://biocontainers.pro/)
  
-Search for CWL files on github, try adding the name of a tool you are
-interested in to the search
-https://github.com/search?q=extension%3Acwl+cwlVersion
+[Search for CWL files](https://github.com/search?q=extension%3Acwl+cwlVersion) on
+Github, try adding the name of a tool you are interested in to the
+search
diff --git a/setup.md b/setup.md

index b8c50321d8b07f8a76f8e925416957c3f274012e..f907ec716ffe9f6e5bbf86725bf02b1ee7d8e631 100644 (file)
--- a/setup.md
+++ b/setup.md
@@ -1,7 +1,75 @@
  ---
  title: Setup
  ---
-FIXME
+
+# Setting up a practice repository
+
+We will create a new git repository and import a library of existing
+tool definitions that will help us build our workflow.
+
+Create a new git repository to hold our workflow with this command:
+
+```
+git init rnaseq-cwl-training-exercises
+```
+
+On Arvados use this:
+
+```
+git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
+```
+
+Next, import bio-cwl-tools with this command:
+
+```
+git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
+```
+
+# Downloading sample and reference data
+
+Start from your rnaseq-cwl-exercises directory.
+
+```
+mkdir rnaseq
+cd rnaseq
+wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=9178fe1b80a08a422dbe02adfd439764+925/
+```
+
+# Downloading or generating STAR index
+
+Running STAR requires index files generated from the reference.
+
+This is a rather large download (4 GB).  Depending on your bandwidth, it may be faster to generate it yourself.
+
+## Downloading
+
+```
+mkdir hg19-chr1-STAR-index
+cd hg19-chr1-STAR-index
+wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=02a12ce9e2707610991bd29d38796b57+2912/
+```
+
+## Generating
+
+Create `chr1-star-index.yaml`:
+
+```
+InputFiles:
+  - class: File
+    location: rnaseq/reference_data/chr1.fa
+    format: http://edamontology.org/format_1930
+IndexName: 'hg19-chr1-STAR-index'
+Gtf:
+  class: File
+  location: rnaseq/reference_data/chr1-hg19_genes.gtf
+Overhang: 99
+```
+
+Generate the index with your local cwl-runner.
+
+```
+cwl-runner bio-cwl-tools/STAR/STAR-Index.cwl chr1-star-index.yaml
+```
  
  
  {% include links.md %}
author	Peter Amstutz <peter.amstutz@curii.com>
	Tue, 26 Jan 2021 22:41:21 +0000 (17:41 -0500)
committer	Peter Amstutz <peter.amstutz@curii.com>
	Tue, 26 Jan 2021 22:41:21 +0000 (17:41 -0500)
_config.yml		patch \| blob \| history
_episodes/01-introduction.md		patch \| blob \| history
_episodes/02-workflow.md		patch \| blob \| history
_episodes/03-running.md		patch \| blob \| history
_episodes/04-commandlinetool.md		patch \| blob \| history
_episodes/05-scatter.md		patch \| blob \| history
_episodes/06-expressions.md		patch \| blob \| history
_episodes/07-resources.md		patch \| blob \| history
setup.md		patch \| blob \| history