Remove software carpentries logo

[rnaseq-cwl-training.git] / _episodes / 05-scatter.md
diff --git a/_episodes/05-scatter.md b/_episodes/05-scatter.md

index 6160baeb27cec93a0be94030d08b52f57add9a3f..224fdc9ade45f644f39f7a9d13dd129e5b92827a 100644 (file)
--- a/_episodes/05-scatter.md
+++ b/_episodes/05-scatter.md
@@ -1,28 +1,33 @@
  ---
-title: " Analyzing multiple samples"
-teaching: 0
-exercises: 0
+title: "Analyzing Multiple Samples"
+teaching: 30
+exercises: 30
  questions:
-- "Key question (FIXME)"
+- "How can you run the same workflow over multiple samples?"
  objectives:
-- "First learning objective. (FIXME)"
+- "Modify the workflow to process multiple samples, then perform a joint analysis."
  keypoints:
-- "First key point. Brief Answer to questions. (FIXME)"
+- "Separate the part of the workflow that you want to run multiple times into a subworkflow."
+- "Use a scatter step to run the subworkflow over a list of inputs."
+- "The result of a scatter is an array, which can be used in a combine step to get a single result."
  ---
  
-# Analyzing multiple samples
+In the previous lesson, we completed converting the function of the
+original source shell script into CWL.  This lesson expands the scope
+by demonstrating what changes to make to the workflow to be able to
+analyze multiple samples in parallel.
  
-Analyzing a single sample is great, but in the real world you probably
-have a batch of samples that you need to analyze and then compare.
-
-### 1. Subworkflows
+# Subworkflows
  
  In addition to running command line tools, a workflow step can also
  execute another workflow.
  
-Let's copy "main.cwl" to "alignment.cwl".
+First, copy `main.cwl` to `alignment.cwl`.
+
+Next, open `main.cwl` for editing.  We are going to replace the `steps` and `outputs` sections.
  
-Now, edit open "main.cwl" for editing.  We are going to replace the `steps` and `outputs` sections.
+Remove all the steps and replace them with a single `alignment` step
+which invokes the `alignment.cwl` we just copied.
  
  ```
  steps:
@@ -34,8 +39,9 @@ steps:
        gtf: gtf
      out: [qc_html, bam_sorted_indexed, featurecounts]
  ```
+{: .language-yaml }
  
-In the outputs section, all the output sources are from the alignment step:
+In the `outputs` section, all the output sources are from the alignment step:
  
  ```
  outputs:
@@ -49,23 +55,37 @@ outputs:
      type: File
      outputSource: alignment/featurecounts
  ```
+{: .language-yaml }
  
-We also need a little boilerplate to tell the workflow runner that we want to use subworkflows:
+We also need add "SubworkflowFeatureRequirement" to tell the workflow
+runner that we are using subworkflows:
  
  ```
  requirements:
    SubworkflowFeatureRequirement: {}
  ```
+{: .language-yaml }
+
+> ## Running the workflow
+>
+> Run this workflow.  You should get exactly the same results as
+> before, as all we have done so far is to wrap the inner workflow with
+> an outer workflow.
+>
+{: .challenge }
  
-If you run this workflow, you will get exactly the same results as
-before, we've just wrapped the inner workflow with an outer workflow.
+> ## Part 1 solution
+> * <a href="../assets/answers/ep5/part1/main.cwl">main.cwl</a>
+> * <a href="../assets/answers/ep5/part1/alignment.cwl">alignment.cwl</a>
+> * <a href="../assets/answers/ep5/part1/featureCounts.cwl">featureCounts.cwl</a>
+{: .solution}
  
-### 2. Scattering
+# Scattering
  
-The wrapper lets us do something useful.  We can modify the outer
-workflow to accept a list of files, and then invoke the inner workflow
-step for every one of those files.  We will need to modify the
-`inputs`, `steps`, `outputs`, and `requirements` sections.
+The "wrapper" step lets us do something useful.  We can modify the
+outer workflow to accept a list of files, and then invoke the inner
+workflow step for every one of those files.  We will need to modify
+the `inputs`, `steps`, `outputs`, and `requirements` sections.
  
  First we change the `fq` parameter to expect a list of files:
  
@@ -75,9 +95,11 @@ inputs:
    genome: Directory
    gtf: File
  ```
+{: .language-yaml }
  
-Next, we add `scatter` to the alignment step.  The means it will
-run `alignment.cwl` for each value in the list in the `fq` parameter.
+Next, we add `scatter` to the alignment step.  The means we want to
+run run `alignment.cwl` for each value in the list in the `fq`
+parameter.
  
  ```
  steps:
@@ -90,6 +112,7 @@ steps:
        gtf: gtf
      out: [qc_html, bam_sorted_indexed, featurecounts]
  ```
+{: .language-yaml }
  
  Because the scatter produces multiple outputs, each output parameter
  becomes a list as well:
@@ -106,17 +129,25 @@ outputs:
      type: File[]
      outputSource: alignment/featurecounts
  ```
+{: .language-yaml }
  
-Finally, we need a little more boilerplate to tell the workflow runner
-that we want to use scatter:
+We also need add "ScatterFeatureRequirement" to tell the workflow
+runner that we are using scatter:
  
  ```
  requirements:
    SubworkflowFeatureRequirement: {}
    ScatterFeatureRequirement: {}
  ```
+{: .language-yaml }
  
-### 3. Running with list inputs
+> ## Part 2 solution
+> * <a href="../assets/answers/ep5/part2/main.cwl">main.cwl</a>
+> * <a href="../assets/answers/ep5/part2/alignment.cwl">alignment.cwl</a>
+> * <a href="../assets/answers/ep5/part2/featureCounts.cwl">featureCounts.cwl</a>
+{: .solution}
+
+# Input parameter lists
  
  The `fq` parameter needs to be a list.  You write a list in yaml by
  starting each list item with a dash.  Example `main-input.yaml`
@@ -148,20 +179,25 @@ gtf:
    class: File
    location: rnaseq/reference_data/chr1-hg19_genes.gtf
  ```
+{: .language-yaml }
  
-Now you can run the workflow the same way as in Lesson 2.
+> ## Running the workflow
+>
+> Run this workflow.  You will now get results for each one of the
+> input fastq files.
+>
+{: .challenge }
  
-### 4. Combining results
+# Combining results
  
-Each instance of the alignment workflow produces its own featureCounts
-file.  However, to be able to compare results easily, we need them a
-single file with all the results.
+Each instance of the alignment workflow produces its own
+`featurecounts.tsv` file.  However, to be able to compare results
+easily, we would like single file with all the results.
  
-The easiest way to do this is to run `featureCounts` just once at the
-end of the workflow, with all the bam files listed on the command
-line.
+We can modify the workflow to run `featureCounts` once at the end of
+the workflow, taking all the bam files listed on the command line.
  
-We'll need to modify a few things.
+We will need to change a few things.
  
  First, in `featureCounts.cwl` we need to modify it to accept either a
  single bam file or list of bam files.
@@ -173,6 +209,7 @@ inputs:
     - File
     - File[]
  ```
+{: .language-yaml }
  
  Second, in `alignment.cwl` we need to remove the `featureCounts` step from alignment.cwl, as well as the `featurecounts` output parameter.
  
@@ -199,6 +236,7 @@ steps:
        gtf: gtf
      out: [featurecounts]
  ```
+{: .language-yaml }
  
  Last, we modify the `featurecounts` output parameter.  Instead of a
  list of files produced by the `alignment` step, it is now a single
@@ -211,5 +249,18 @@ outputs:
      type: File
      outputSource: featureCounts/featurecounts
  ```
-
-Run this workflow to get a single `featurecounts.tsv` file with a column for each bam file.
+{: .language-yaml }
+
+> ## Running the workflow
+>
+> Run this workflow.  You will still have separate results from fastq
+> and and STAR, but now you will only have a single
+> `featurecounts.tsv` file with a column for each bam file.
+>
+{: .challenge }
+
+> ## Episode solution
+> * <a href="../assets/answers/ep5/part4/main.cwl">main.cwl</a>
+> * <a href="../assets/answers/ep5/part4/alignment.cwl">alignment.cwl</a>
+> * <a href="../assets/answers/ep5/part4/featureCounts.cwl">featureCounts.cwl</a>
+{: .solution}