_episodes/04-commandlinetool.md

   1 ---
   2 title: "Writing a Tool Wrapper"
   3 teaching: 15
   4 exercises: 20
   5 questions:
   6 - "What are the key components of a tool wrapper?"
   7 - "How do I use software containers to supply the software I want to run?"
   8 objectives:
   9 - "Write a tool wrapper for the featureCounts tool."
  10 - "Find an software container that has the software we want to use."
  11 - "Add the tool wrapper to our main workflow."
  12 keypoints:
  13 - "The key components of a command line tool wrapper are the header, inputs, baseCommand, arguments, and outputs."
  14 - "Like workflows, CommandLineTools have `inputs` and `outputs`."
  15 - "Use `baseCommand` and `arguments` to provide the program to run and the command line arguments to run it with."
  16 - "Use `glob` to capture output files and assign them to output parameters."
  17 - "Use DockerRequirement to supply the name of the Docker image that contains the software to run."
  18 ---
  19
  20 It is time to add the last step in the analysis.
  21
  22 ```
  23 # Count mapped reads
  24 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
  25 ```
  26 {: .language-bash }
  27
  28 This will use the "featureCounts" tool from the "subread" package.
  29
  30 # File header
  31
  32 Create a new file "featureCounts.cwl"
  33
  34 Let's start with the header.  This is very similar to the workflow, except that we use `class: CommandLineTool`.
  35
  36 ```
  37 cwlVersion: v1.2
  38 class: CommandLineTool
  39 label: featureCounts tool
  40 ```
  41 {: .language-yaml }
  42
  43 # Command line tool inputs
  44
  45 A CommandLineTool describes a single invocation of a command line program.
  46
  47 It consumes some input parameters, runs a program, and captures
  48 output, mainly in in the form of files produced by the program.
  49
  50 The variables used in the bash script are `$cores`, `$gtf`, `$counts` and `$counts_input_bam`.
  51
  52 This gives us two file inputs, `gtf` and `counts_input_bam` which we can declare in our `inputs` section:
  53
  54 ```
  55 inputs:
  56   gtf: File
  57   counts_input_bam: File
  58 ```
  59 {: .language-yaml }
  60
  61 # Specifying the program to run
  62
  63 Give the name of the program to run in `baseCommand`.
  64
  65 ```
  66 baseCommand: featureCounts
  67 ```
  68 {: .language-yaml }
  69
  70 # Command arguments
  71
  72 The easiest way to describe the command line is with an `arguments`
  73 section.  This takes a comma-separated list of command line arguments.
  74
  75
  76 ```
  77 arguments: [-T, $(runtime.cores),
  78             -a, $(inputs.gtf),
  79             -o, featurecounts.tsv,
  80             $(inputs.counts_input_bam)]
  81 ```
  82 {: .language-yaml }
  83
  84 Input variables are included on the command line as
  85 `$(inputs.name_of_parameter)`.  When the tool is executed, the
  86 variables will be replaced with the input parameter values.
  87
  88 There are also some special variables.  The `runtime` object describes
  89 the resources allocated to running the program.  Here we use
  90 `$(runtime.cores)` to decide how many threads to request.
  91
  92 > ## `arguments` vs `inputBinding`
  93 >
  94 > You may recall from examining existing the fastqc and STAR tools
  95 > wrappers in lesson 2, another way to express command line parameters
  96 > is with `inputBinding` and `prefix` on individual input parameters.
  97 >
  98 > ```
  99 > inputs:
 100 >   parametername:
 101 >     type: parametertype
 102 >     inputBinding:
 103 >       prefix: --some-option
 104 > ```
 105 > {: .language-yaml }
 106 >
 107 > We use `arguments` in the example simply because it is easier to see
 108 > how it lines up with the source shell script.
 109 >
 110 > You can use both `inputBinding` and `arguments` in the same
 111 > CommandLineTool document.  There is no "right" or "wrong" way, and
 112 > one does not override the other, they are combined to produce the
 113 > final command line invocation.
 114 >
 115 {: .callout}
 116
 117 # Outputs section
 118
 119 In CWL, you must explicitly identify the outputs of a program.  This
 120 associates output parameters with specific files, and enables the
 121 workflow runner to know which files must be saved and which files can
 122 be discarded.
 123
 124 In the previous section, we told the featureCounts program the name of
 125 our output files should be `featurecounts.tsv`.
 126
 127 We can declare an output parameter called `featurecounts` that will
 128 have that output file as its value.
 129
 130 The `outputBinding` section describes how to determine the value of
 131 the parameter.  The `glob` field tells it to search for a file in the
 132 output directory called `featurecounts.tsv`
 133
 134 ```
 135 outputs:
 136   featurecounts:
 137     type: File
 138     outputBinding:
 139       glob: featurecounts.tsv
 140 ```
 141 {: .language-yaml }
 142
 143 # Running in a container
 144
 145 In order to run the tool, it needs to be installed.
 146 Using software containers, a tool can be pre-installed into a
 147 compatible runtime environment, and that runtime environment (called a
 148 container image) can be downloaded and run on demand.
 149
 150 Although plain CWL does not _require_ the use of containers, many
 151 popular platforms that run CWL do require the software be supplied in
 152 the form of a container image.
 153
 154 > ## Finding container images
 155 >
 156 > Many bioinformatics tools are already available as containers.  One
 157 > resource is the BioContainers project.  Let's find the "subread" software:
 158 >
 159 >   1. Visit [https://biocontainers.pro/](https://biocontainers.pro/)
 160 >   2. Click on "Registry"
 161 >   3. Search for "subread"
 162 >   4. Click on the search result for "subread"
 163 >   5. Click on the tab "Packages and Containers"
 164 >   6. Choose a row with type "docker", then on the right side of the "Full
 165 > Tag" column for that row, click the "copy to clipboard" button.
 166 >
 167 > To declare that you want to run inside a container, add a section
 168 > called `hints` to your tool document.  Under `hints` add a
 169 > subsection `DockerRequirement`.  Under `DockerRequirement`, paste
 170 > the text your copied in the above step.  Replace the text `docker
 171 > pull` to `dockerPull:` ensure it is indented twice so it is a field
 172 > of `DockerRequirement`.
 173 >
 174 > > ## Answer
 175 > > ```
 176 > > hints:
 177 > >   DockerRequirement:
 178 > >     dockerPull: quay.io/biocontainers/subread:1.5.0p3--0
 179 > > ```
 180 > > {: .language-yaml }
 181 > {: .solution}
 182 {: .challenge}
 183
 184 # Running a tool on its own
 185
 186 When creating a tool wrapper, it is helpful to run it on its own to test it.
 187
 188 The input to a single tool is the same kind of input parameters file
 189 that we used as input to a workflow in the previous lesson.
 190
 191 featureCounts.yaml:
 192
 193 ```
 194 counts_input_bam:
 195   class: File
 196   location: Aligned.sortedByCoord.out.bam
 197 gtf:
 198   class: File
 199   location: rnaseq/reference_data/chr1-hg19_genes.gtf
 200 ```
 201 {: .language-yaml }
 202
 203 The invocation is also the same:
 204
 205 ```
 206 cwl-runner featureCounts.cwl featureCounts.yaml
 207 ```
 208 {: .language-bash }
 209
 210 # Adding it to the workflow
 211
 212 > ## Exercise
 213 >
 214 > Now that we have confirmed that the tool wrapper works, it is time
 215 > to add it to our workflow.
 216 >
 217 >   1. Add a new step called `featureCounts` that runs our tool
 218 >      wrapper.  The new step should take input from
 219 >      `samtools/bam_sorted_indexed`, and should be allocated a
 220 >      minimum of 500 MB of RAM
 221 >   2. Add a new output parameter for the workflow called
 222 >      `featurecounts` The output source should come from the output
 223 >      of the new `featureCounts` step.
 224 >   3.  When you have an answer, run the updated workflow, which
 225 >       should run the "featureCounts" step and produce "featurecounts"
 226 >       output parameter.
 227 >
 228 > > ## Answer
 229 > > ```
 230 > > steps:
 231 > >   ...
 232 > >   featureCounts:
 233 > >     requirements:
 234 > >       ResourceRequirement:
 235 > >         ramMin: 500
 236 > >     run: featureCounts.cwl
 237 > >     in:
 238 > >       counts_input_bam: samtools/bam_sorted_indexed
 239 > >       gtf: gtf
 240 > >     out: [featurecounts]
 241 > >
 242 > > outputs:
 243 > >   ...
 244 > >   featurecounts:
 245 > >     type: File
 246 > >     outputSource: featureCounts/featurecounts
 247 > > ```
 248 > > {: .language-yaml }
 249 > {: .solution}
 250 {: .challenge}