Second pass. Lots of work.

[rnaseq-cwl-training.git] / _episodes / 04-commandlinetool.md
diff --git a/_episodes/04-commandlinetool.md b/_episodes/04-commandlinetool.md

index cae16826d6d11d8308e038fc1a7780be6d2b9a0f..22110fad4e51af48ea772a381d082535c311b160 100644 (file)
--- a/_episodes/04-commandlinetool.md
+++ b/_episodes/04-commandlinetool.md
@@ -1,47 +1,54 @@
  ---
-title: "Writing a tool wrapper"
-teaching: 0
-exercises: 0
+title: "Writing a Tool Wrapper"
+teaching: 15
+exercises: 20
  questions:
-- "Key question (FIXME)"
+- "What are the key components of a tool wrapper?"
+- "How do I use software containers to supply the software I want to run?"
  objectives:
-- "First learning objective. (FIXME)"
+- "Write a tool wrapper for the featureCounts tool."
+- "Find an software container that has the software we want to use."
+- "Add the tool wrapper to our main workflow."
  keypoints:
-- "First key point. Brief Answer to questions. (FIXME)"
+- "The key components of a command line tool wrapper are the header, inputs, baseCommand, arguments, and outputs."
+- "Like workflows, CommandLineTools have `inputs` and `outputs`."
+- "Use `baseCommand` and `arguments` to provide the program to run and the command line arguments to run it with."
+- "Use `glob` to capture output files and assign them to output parameters."
+- "Use DockerRequirement to supply the name of the Docker image that contains the software to run."
  ---
  
  It is time to add the last step in the analysis.
  
+```
+# Count mapped reads
+featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
+```
+{: .language-bash }
+
  This will use the "featureCounts" tool from the "subread" package.
  
-# 1. File header
+# File header
  
  Create a new file "featureCounts.cwl"
  
-Start with this header
+Let's start with the header.  This is very similar to the workflow, except that we use `class: CommandLineTool`.
  
  ```
  cwlVersion: v1.2
  class: CommandLineTool
+label: featureCounts tool
  ```
+{: .language-yaml }
  
-# 2. Command line tool inputs
+# Command line tool inputs
  
  A CommandLineTool describes a single invocation of a command line program.
  
-It consumes some input parameters, runs a program, and produce output
-values.
-
-Here is the original shell command:
-
-```
-featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
-```
+It consumes some input parameters, runs a program, and captures
+output, mainly in in the form of files produced by the program.
  
  The variables used in the bash script are `$cores`, `$gtf`, `$counts` and `$counts_input_bam`.
  
-The parameters
-
  This gives us two file inputs, `gtf` and `counts_input_bam` which we can declare in our `inputs` section:
  
  ```
@@ -49,27 +56,22 @@ inputs:
    gtf: File
    counts_input_bam: File
  ```
+{: .language-yaml }
  
-# 3. Specifying the program to run
+# Specifying the program to run
  
  Give the name of the program to run in `baseCommand`.
  
  ```
  baseCommand: featureCounts
  ```
+{: .language-yaml }
  
-# 4. Command arguments
+# Command arguments
  
  The easiest way to describe the command line is with an `arguments`
  section.  This takes a comma-separated list of command line arguments.
  
-Input variables are included on the command line as
-`$(inputs.name_of_parameter)`.  When the tool is executed, these input
-parameter values are substituted for these variable.
-
-Special variables are also available.  The runtime environment
-describes the resources allocated to running the program.  Here we use
-`$(runtime.cores)` to decide how many threads to request.
  
  ```
  arguments: [-T, $(runtime.cores),
@@ -77,8 +79,42 @@ arguments: [-T, $(runtime.cores),
              -o, featurecounts.tsv,
              $(inputs.counts_input_bam)]
  ```
+{: .language-yaml }
  
-# 5. Outputs section
+Input variables are included on the command line as
+`$(inputs.name_of_parameter)`.  When the tool is executed, the
+variables will be replaced with the input parameter values.
+
+There are also some special variables.  The `runtime` object describes
+the resources allocated to running the program.  Here we use
+`$(runtime.cores)` to decide how many threads to request.
+
+> ## `arguments` vs `inputBinding`
+>
+> You may recall from examining existing the fastqc and STAR tools
+> wrappers in lesson 2, another way to express command line parameters
+> is with `inputBinding` and `prefix` on individual input parameters.
+>
+> ```
+> inputs:
+>   parametername:
+>     type: parametertype
+>     inputBinding:
+>       prefix: --some-option
+> ```
+> {: .language-yaml }
+>
+> We use `arguments` in the example simply because it is easier to see
+> how it lines up with the source shell script.
+>
+> You can use both `inputBinding` and `arguments` in the same
+> CommandLineTool document.  There is no "right" or "wrong" way, and
+> one does not override the other, they are combined to produce the
+> final command line invocation.
+>
+{: .callout}
+
+# Outputs section
  
  In CWL, you must explicitly identify the outputs of a program.  This
  associates output parameters with specific files, and enables the
@@ -102,38 +138,50 @@ outputs:
      outputBinding:
        glob: featurecounts.tsv
  ```
+{: .language-yaml }
  
-# 6. Running in a container
+# Running in a container
  
  In order to run the tool, it needs to be installed.
  Using software containers, a tool can be pre-installed into a
  compatible runtime environment, and that runtime environment (called a
  container image) can be downloaded and run on demand.
  
-Many bioinformatics tools are already available as containers.  One
-resource is the BioContainers project.  Let's find the "subread" software:
-
-   1. Visit https://biocontainers.pro/
-   2. Click on "Registry"
-   3. Search for "subread"
-   4. Click on the search result for "subread"
-   5. Click on the tab "Packages and Containers"
-   6. Choose a row with type "docker", then on the right side of the "Full
-Tag" column for that row, click the "copy to clipboard" button.
-
-To declare that you want to run inside a container, create a section
-called `hints` with a subsection `DockerRequirement`.  Under
-`DockerRequirement`, paste the text your copied in the above step.
-Replace the text `docker pull` to `dockerPull:` and indent it so it is
-in the `DockerRequirement` section.
-
-```
-hints:
-  DockerRequirement:
-    dockerPull: quay.io/biocontainers/subread:1.5.0p3--0
-```
-
-# 7. Running a tool on its own
+Although plain CWL does not _require_ the use of containers, many
+popular platforms that run CWL do require the software be supplied in
+the form of a container image.
+
+> ## Finding container images
+>
+> Many bioinformatics tools are already available as containers.  One
+> resource is the BioContainers project.  Let's find the "subread" software:
+>
+>   1. Visit [https://biocontainers.pro/](https://biocontainers.pro/)
+>   2. Click on "Registry"
+>   3. Search for "subread"
+>   4. Click on the search result for "subread"
+>   5. Click on the tab "Packages and Containers"
+>   6. Choose a row with type "docker", then on the right side of the "Full
+> Tag" column for that row, click the "copy to clipboard" button.
+>
+> To declare that you want to run inside a container, add a section
+> called `hints` to your tool document.  Under `hints` add a
+> subsection `DockerRequirement`.  Under `DockerRequirement`, paste
+> the text your copied in the above step.  Replace the text `docker
+> pull` to `dockerPull:` ensure it is indented twice so it is a field
+> of `DockerRequirement`.
+>
+> > ## Answer
+> > ```
+> > hints:
+> >   DockerRequirement:
+> >     dockerPull: quay.io/biocontainers/subread:1.5.0p3--0
+> > ```
+> > {: .language-yaml }
+> {: .solution}
+{: .challenge}
+
+# Running a tool on its own
  
  When creating a tool wrapper, it is helpful to run it on its own to test it.
  
@@ -150,43 +198,53 @@ gtf:
    class: File
    location: rnaseq/reference_data/chr1-hg19_genes.gtf
  ```
+{: .language-yaml }
  
  The invocation is also the same:
  
  ```
  cwl-runner featureCounts.cwl featureCounts.yaml
  ```
-
-# 8. Adding it to the workflow
-
-Now that we have confirmed that it works, we can add it to our workflow.
-We add it to `steps`, connecting the output of samtools to
-`counts_input_bam` and the `gtf` taking the workflow input of the same
-name.
-
-```
-steps:
-  ...
-  featureCounts:
-    requirements:
-      ResourceRequirement:
-        ramMin: 500
-    run: featureCounts.cwl
-    in:
-      counts_input_bam: samtools/bam_sorted_indexed
-      gtf: gtf
-    out: [featurecounts]
-```
-
-We will add the result from featurecounts to the output:
-
-```
-outputs:
-  ...
-  featurecounts:
-    type: File
-    outputSource: featureCounts/featurecounts
-```
-
-You should now be able to re-run the workflow and it will run the
-"featureCounts" step and include "featurecounts" in the output.
+{: .language-bash }
+
+# Adding it to the workflow
+
+> ## Exercise
+>
+> Now that we have confirmed that the tool wrapper works, it is time
+> to add it to our workflow.
+>
+>   1. Add a new step called `featureCounts` that runs our tool
+>      wrapper.  The new step should take input from
+>      `samtools/bam_sorted_indexed`, and should be allocated a
+>      minimum of 500 MB of RAM
+>   2. Add a new output parameter for the workflow called
+>      `featurecounts` The output source should come from the output
+>      of the new `featureCounts` step.
+>   3.  When you have an answer, run the updated workflow, which
+>       should run the "featureCounts" step and produce "featurecounts"
+>       output parameter.
+>
+> > ## Answer
+> > ```
+> > steps:
+> >   ...
+> >   featureCounts:
+> >     requirements:
+> >       ResourceRequirement:
+> >         ramMin: 500
+> >     run: featureCounts.cwl
+> >     in:
+> >       counts_input_bam: samtools/bam_sorted_indexed
+> >       gtf: gtf
+> >     out: [featurecounts]
+> >
+> > outputs:
+> >   ...
+> >   featurecounts:
+> >     type: File
+> >     outputSource: featureCounts/featurecounts
+> > ```
+> > {: .language-yaml }
+> {: .solution}
+{: .challenge}