From 1dcbce0b09a71abaec4b1e72dd70a2f573e4dc6b Mon Sep 17 00:00:00 2001
From: Peter Amstutz <peter.amstutz@curii.com>
Date: Mon, 18 Jan 2021 17:51:10 -0500
Subject: [PATCH] Running is lesson 2.

Arvados-DCO-1.1-Signed-off-by: Peter Amstutz <peter.amstutz@curii.com>
---
 README.md                 |  15 ---
 dockerfile/Dockerfile     |   2 +-
 lesson1/lesson1.md        | 183 ++++++++++++++++++++++++++++++++-
 lesson2/lesson2.md        | 209 ++++++++++++++++++++++++++++++++++++++
 scripts/genomeGenerate.sh |  10 --
 5 files changed, 392 insertions(+), 27 deletions(-)
 create mode 100644 lesson2/lesson2.md
 delete mode 100755 scripts/genomeGenerate.sh

diff --git a/README.md b/README.md
index fd9a630..f35a870 100644
--- a/README.md
+++ b/README.md
@@ -1,19 +1,4 @@
 
-# Downloading reference data
-
-```
-mkdir rnaseq
-cd rnaseq
-wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=pirca-4zz18-sa9s62lzc00jeds/
-```
-
-# Building the Docker image
-
-```
-cd dockerfile
-docker build -t cwl-training-rnaseq-tools .
-```
-
 # Acknowledgements
 
 These CWL lessons are based on "Introduction to RNA-seq using
diff --git a/dockerfile/Dockerfile b/dockerfile/Dockerfile
index 69a9327..c73bcf2 100644
--- a/dockerfile/Dockerfile
+++ b/dockerfile/Dockerfile
@@ -1,2 +1,2 @@
 FROM debian:10
-RUN apt-get update && apt-get -y --no-install-recommends install rna-star fastqc samtools subread
\ No newline at end of file
+RUN apt-get update && apt-get -y --no-install-recommends install subread
\ No newline at end of file
diff --git a/lesson1/lesson1.md b/lesson1/lesson1.md
index 7d4e1b1..060597a 100644
--- a/lesson1/lesson1.md
+++ b/lesson1/lesson1.md
@@ -1,3 +1,184 @@
-# Turning a bash script into a workflow
+# Turning a bash script into a workflow using existing tools
 
 In this lesson we will turn `rnaseq_analysis_on_input_file.sh` into a workflow.
+
+# Setting up
+
+We will create a new git repository and import a library of existing
+tool definitions that will help us build our workflow.
+
+1. Select "Terminal->New terminal"
+
+2. Create a new git repository to hold our workflow with this command:
+
+## Arvados
+
+```
+git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
+```
+
+## Generic
+
+```
+git init rnaseq-cwl-training-exercises
+```
+
+
+3. Go to File->Open Folder and select rnaseq-cwl-training-exercises
+
+4. Go to the terminal window
+
+5. Import bio-cwl-tools with this command:
+
+```
+git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
+```
+
+# Writing the workflow
+
+1. Create a new file "main.cwl"
+
+2. Start with this header.
+
+
+```
+cwlVersion: v1.2
+class: Workflow
+label: RNAseq CWL practice workflow
+```
+
+3. Workflow Inputs
+
+The purpose of a workflow is to consume some input parameters, run a
+series of steps, and produce output values.
+
+For this analysis, the input parameters are the fastq file and the reference data required by STAR.
+
+In CWL, these are declared in the `inputs` section.
+
+The inputs section lists each input parameter and its type.  Valid
+types include `File`, `Directory`, `string`, `boolean`, `int`, and
+`float`.
+
+In this case, the fastq and gene annotation file are individual files.  The STAR index is a directory.  We can describe these inputs in CWL like this:
+
+```
+inputs:
+  fq: File
+  genome: Directory
+  gtf: File
+```
+
+4. Workflow Steps
+
+A workflow consists of one or more steps.  This is the `steps` section.
+
+Now we need to describe the first step of the workflow.  This step is to run `fastqc`.
+
+A workflow step consists of the name of the step, the tool to `run`,
+the input parameters to be passed to the tool in `in`, and the output
+parameters expected from the tool in `out`.
+
+The value of `run` references the tool file.  Tip: while typing the
+file name, you can get suggestions and auto-completion on a partial
+name using control+space.
+
+The `in` block lists input parameters to the tool and the workflow
+parameters that will be assigned to those inputs.
+
+The `out` block lists output parameters to the tool that are used
+later in the workflow.
+
+You need to know which input and output parameters are available for
+each tool.  In vscode, click on the value of `run` and select "Go to
+definition" to open the tool file.  Look for the `inputs` and
+`outputs` sections of the tool file to find out what parameters are
+defined.
+
+```
+steps:
+  fastqc:
+    run: bio-cwl-tools/fastqc/fastqc_2.cwl
+    in:
+	  reads_file: fq
+    out: [html_file, summary_file]
+```
+
+5. Running alignment with STAR
+
+STAR has more parameters.  Sometimes we want to provide input values
+to a step without making them as workflow-level inputs.  We can do
+this with `{default: N}`
+
+
+```
+  STAR:
+    requirements:
+      ResourceRequirement:
+        ramMin: 6000
+    run: bio-cwl-tools/STAR/STAR-Align.cwl
+    in:
+      RunThreadN: {default: 4}
+      GenomeDir: genome
+      ForwardReads: fq
+      OutSAMtype: {default: BAM}
+      OutSAMunmapped: {default: Within}
+    out: [alignment]
+```
+
+6. Running samtools
+
+The third step is to generate an index for the aligned BAM.
+
+For this step, we need to use the output of a previous step as input
+to this step.  We refer the output of a step by with name of the step
+(STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment`
+
+This creates a dependency between steps.  This means the `samtools`
+step will not run until the `STAR` step has completed successfully.
+
+```
+  samtools:
+    run: bio-cwl-tools/samtools/samtools_index.cwl
+    in:
+      bam_sorted: STAR/alignment
+    out: [bam_sorted_indexed]
+```
+
+7. featureCounts
+
+As of this writing, the `subread` package that provides
+`featureCounts` is not available in bio-cwl-tools (and if it has been
+added since writing this, let's pretend that it isn't there.)  We will
+dive into how to write a CWL wrapper for a command line tool in
+lesson 2.  For now, we will leave off the final step.
+
+8. Workflow Outputs
+
+The last thing to do is declare the workflow outputs in the `outputs` section.
+
+For each output, we need to declare the type of output, and what
+parameter has the output value.
+
+Output types are the same as input types, valid types include `File`,
+`Directory`, `string`, `boolean`, `int`, and `float`.
+
+The `outputSource` field refers the a step output in the same way that
+the `in` block does, the name of the step, a slash, and the name of
+the output parameter.
+
+For our final outputs, we want the results from fastqc and the
+aligned, sorted and indexed BAM file.
+
+```
+outputs:
+  qc_html:
+    type: File
+    outputSource: fastqc/html_file
+  qc_summary:
+    type: File
+    outputSource: fastqc/summary_file
+  bam_sorted_indexed:
+    type: File
+    outputSource: samtools/bam_sorted_indexed
+```
diff --git a/lesson2/lesson2.md b/lesson2/lesson2.md
new file mode 100644
index 0000000..a5ec6a7
--- /dev/null
+++ b/lesson2/lesson2.md
@@ -0,0 +1,209 @@
+# Running and debugging a workflow
+
+1. The input parameter file
+
+CWL input values are provided in the form of a YAML or JSON file.
+Create one by right clicking on the explorer, select "New File" and
+create a called file "input.yaml".
+
+This file gives the values for parameters declared in the `inputs`
+section of our workflow.  Our workflow takes `fq`, `genome` and `gtf`
+as input parameters.
+
+When setting inputs, Files and Directories are given as an object with
+`class: File` or `class: Directory`.  This distinguishes them from
+plain strings that may or may not be file paths.
+
+
+## Arvados
+
+```
+fq:
+  class: File
+  location: keep:9178fe1b80a08a422dbe02adfd439764+925/raw_fastq/Mov10_oe_1.subset.fq
+  format: http://edamontology.org/format_1930
+genome:
+  class: Directory
+  location: keep:02a12ce9e2707610991bd29d38796b57+2912
+gtf:
+  class: File
+  location: keep:9178fe1b80a08a422dbe02adfd439764+925/reference_data/chr1-hg19_genes.gtf
+```
+
+## Generic
+
+Note: if you don't have example sequence data or the STAR index files, see the Appendix below.
+
+```
+fq:
+  class: File
+  location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq
+  format: http://edamontology.org/format_1930
+genome:
+  class: Directory
+  location: hg19-chr1-STAR-index
+gtf:
+  class: File
+  location: rnaseq/reference_data/chr1-hg19_genes.gtf
+```
+
+2. Running the workflow
+
+## Arvados
+
+In vscode, select "main.cwl" and then choose "Terminal -> Run task -> Run CWL workflow on Arvados"
+
+## Generic
+
+Type this into the terminal:
+
+```
+cwl-runner main.cwl input.yaml
+```
+
+3. Debugging the workflow
+
+A workflow can fail for many reasons: some possible reasons include
+bad input, bugs in the code, or running out memory.  In this case, the
+STAR workflow might fail with an out of memory error.
+
+To help diagnose these errors, the workflow runner produces logs that
+record what happened, either in the terminal or the web interface.
+
+Some errors you might see in the logs that would indicate an out of
+memory condition:
+
+```
+EXITING: fatal error trying to allocate genome arrays, exception thrown: std::bad_alloc
+Possible cause 1: not enough RAM. Check if you have enough RAM 5711762337 bytes
+Possible cause 2: not enough virtual memory allowed with ulimit. SOLUTION: run ulimit -v 5711762337
+```
+
+or
+
+```
+Container exited with code: 137
+```
+
+(Exit code 137 most commonly occurs when a container goes "out of memory" and is terminated by the operating system).
+
+If this happens, you will need to request more RAM.
+
+4. Setting runtime RAM requirements
+
+By default, a step is allocated 256 MB of RAM.  From the STAR error message:
+
+> Check if you have enough RAM 5711762337 bytes
+
+We can see that STAR requires quite a bit more RAM than that.  To
+request more RAM, add a "requirements" section with
+"ResourceRequirement" to the "STAR" step:
+
+```
+  STAR:
+    requirements:
+      ResourceRequirement:
+        ramMin: 7000
+    run: bio-cwl-tools/STAR/STAR-Align.cwl
+```
+
+Resource requirements you can set include:
+
+* coresMin: CPU cores
+* ramMin: RAM (in megabytes)
+* tmpdirMin: temporary directory available space
+* outdirMin: output directory available space
+
+After setting the RAM requirements, re-run the workflow.
+
+5. Workflow results
+
+The CWL runner will print a results JSON object.  It will look something like this (it may include additional fields).
+
+
+```
+{
+    "bam_sorted_indexed": {
+        "location": "file:///home/username/rnaseq-cwl-training-exercises/Aligned.sortedByCoord.out.bam",
+        "basename": "Aligned.sortedByCoord.out.bam",
+        "class": "File",
+        "size": 25370707,
+        "secondaryFiles": [
+            {
+                "basename": "Aligned.sortedByCoord.out.bam.bai",
+                "location": "file:///home/username/rnaseq-cwl-training-exercises/Aligned.sortedByCoord.out.bam.bai",
+                "class": "File",
+                "size": 176552,
+            }
+        ]
+    },
+    "qc_html": {
+        "location": "file:///home/username/rnaseq-cwl-training-exercises/Mov10_oe_1.subset_fastqc.html",
+        "basename": "Mov10_oe_1.subset_fastqc.html",
+        "class": "File",
+        "size": 383589
+    },
+    "qc_summary": {
+        "location": "file:///home/username/rnaseq-cwl-training-exercises/summary.txt",
+        "basename": "summary.txt",
+        "class": "File",
+        "size": 590
+    }
+}
+```
+
+This has the same structure as `input.yaml`.  The each output
+parameter is listed, with the `location` field of each `File` object
+indicating where the output file can be found.
+
+# Appendix
+
+## Downloading sample and reference data
+
+Start from your rnaseq-cwl-exercises directory.
+
+```
+mkdir rnaseq
+cd rnaseq
+wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=9178fe1b80a08a422dbe02adfd439764+925/
+```
+
+## Downloading or generating STAR index
+
+Running STAR requires index files generated from the reference.
+
+This is a rather large download (4 GB).  Depending on your bandwidth, it may be faster to generate it yourself.
+
+### Downloading
+
+Go to the "Terminal" tab in the lower vscode panel.  If necessary, select `bash` from the dropdown list in the upper right corner.
+
+```
+mkdir hg19-chr1-STAR-index
+cd hg19-chr1-STAR-index
+wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=02a12ce9e2707610991bd29d38796b57+2912/
+```
+
+### Generating
+
+Create `chr1-star-index.yaml`:
+
+```
+InputFiles:
+  - class: File
+    location: rnaseq/reference_data/chr1.fa
+    format: http://edamontology.org/format_1930
+IndexName: 'hg19-chr1-STAR-index'
+Gtf:
+  class: File
+  location: rnaseq/reference_data/chr1-hg19_genes.gtf
+Overhang: 99
+```
+
+Next, go to the "Terminal" tab in the lower vscode panel.  If
+necessary, select `bash` from the dropdown list in the upper right
+corner.  Generate the index with your local cwl-runner.
+
+```
+cwl-runner bio-cwl-tools/STAR/STAR-Index.cwl chr1-star-index.yaml
+```
diff --git a/scripts/genomeGenerate.sh b/scripts/genomeGenerate.sh
deleted file mode 100755
index c714640..0000000
--- a/scripts/genomeGenerate.sh
+++ /dev/null
@@ -1,10 +0,0 @@
-#!/bin/bash
-
-# Generate STAR genome index
-
-STAR --runThreadN 4 \
---runMode genomeGenerate \
---genomeDir unix_lesson/reference_data \
---genomeFastaFiles unix_lesson/reference_data/chr1.fa \
---sjdbGTFfile unix_lesson/reference_data/chr1-hg19_genes.gtf \
---sjdbOverhang 99
-- 
2.30.2