_episodes/04-commandlinetool.md

   1 ---
   2 title: "Writing a Tool Wrapper"
   3 teaching: 20
   4 exercises: 30
   5 questions:
   6 - "What are the key components of a tool wrapper?"
   7 - "How do I use software containers to supply the software I want to run?"
   8 objectives:
   9 - "Write a tool wrapper for the featureCounts tool."
  10 - "Find an software container that has the software we want to use."
  11 - "Add the tool wrapper to our main workflow."
  12 keypoints:
  13 - "The key components of a command line tool wrapper are the header, inputs, baseCommand, arguments, and outputs."
  14 - "Like workflows, CommandLineTools have `inputs` and `outputs`."
  15 - "Use `baseCommand` and `arguments` to provide the program to run and the command line arguments to run it with."
  16 - "Use `glob` to capture output files and assign them to output parameters."
  17 - "Use DockerRequirement to supply the name of the Docker image that contains the software to run."
  18 ---
  19
  20 It is time to add the last step in the analysis.
  21
  22 ```
  23 # Count mapped reads
  24 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
  25 ```
  26 {: .language-bash }
  27
  28 This will use the "featureCounts" tool from the "subread" package.
  29
  30 # File header
  31
  32 A CommandLineTool describes a single invocation of a command line
  33 program.  It consumes some input parameters, runs a program, and
  34 captures output, mainly in in the form of files produced by the
  35 program.
  36
  37 Create a new file "featureCounts.cwl"
  38
  39 Let's start with the header.  This is very similar to the workflow, except that we use `class: CommandLineTool`.
  40
  41 ```
  42 cwlVersion: v1.2
  43 class: CommandLineTool
  44 label: featureCounts tool
  45 ```
  46 {: .language-yaml }
  47
  48 # Command line tool inputs
  49
  50 The `inputs` section describes input parameters with the same form as
  51 the Workflow `inputs` section.
  52
  53 > ## Exercise
  54 >
  55 > The variables used in the bash script are `$cores`, `$gtf`, `$counts` and `$counts_input_bam`.
  56 >
  57 > * $cores is the number of CPU cores to use.
  58 > * $gtf is the input .gtf file
  59 > * $counts is the name we will give to the output file
  60 > * $counts_input_bam is the input .bam file
  61 >
  62 > Write the `inputs` section for the File inputs `gtf` and `counts_input_bam`.
  63 >
  64 > > ## Solution
  65 > > ```
  66 > > inputs:
  67 > >   gtf: File
  68 > >   counts_input_bam: File
  69 > > ```
  70 > > {: .language-yaml }
  71 > {: .solution}
  72 {: .challenge}
  73
  74 # Specifying the program to run
  75
  76 Give the name of the program to run in `baseCommand`.
  77
  78 ```
  79 baseCommand: featureCounts
  80 ```
  81 {: .language-yaml }
  82
  83 # Command arguments
  84
  85 The easiest way to describe the command line is with an `arguments`
  86 section.  This takes a comma-separated list of command line arguments.
  87
  88
  89 ```
  90 arguments: [-T, $(runtime.cores),
  91             -a, $(inputs.gtf),
  92             -o, featurecounts.tsv,
  93             $(inputs.counts_input_bam)]
  94 ```
  95 {: .language-yaml }
  96
  97 Input variables are included on the command line as
  98 `$(inputs.name_of_parameter)`.  When the tool is executed, the
  99 variables will be replaced with the input parameter values.
 100
 101 There are also some special variables.  The `runtime` object describes
 102 the resources allocated to running the program.  Here we use
 103 `$(runtime.cores)` to decide how many threads to request.
 104
 105 > ## `arguments` vs `inputBinding`
 106 >
 107 > You may recall from examining existing the fastqc and STAR tools
 108 > wrappers in lesson 2, another way to express command line parameters
 109 > is with `inputBinding` and `prefix` on individual input parameters.
 110 >
 111 > ```
 112 > inputs:
 113 >   parametername:
 114 >     type: parametertype
 115 >     inputBinding:
 116 >       prefix: --some-option
 117 > ```
 118 > {: .language-yaml }
 119 >
 120 > We use `arguments` in the example simply because it is easier to see
 121 > how it lines up with the source shell script.
 122 >
 123 > You can use both `inputBinding` and `arguments` in the same
 124 > CommandLineTool document.  There is no "right" or "wrong" way, and
 125 > one does not override the other, they are combined to produce the
 126 > final command line invocation.
 127 >
 128 {: .callout}
 129
 130 # Outputs section
 131
 132 In CWL, you must explicitly identify the outputs of a program.  This
 133 associates output parameters with specific files, and enables the
 134 workflow runner to know which files must be saved and which files can
 135 be discarded.
 136
 137 In the previous section, we told the featureCounts program the name of
 138 our output files should be `featurecounts.tsv`.
 139
 140 We can declare an output parameter called `featurecounts` that will
 141 have that output file as its value.
 142
 143 The `outputBinding` section describes how to determine the value of
 144 the parameter.  The `glob` field tells it to search for a file in the
 145 output directory called `featurecounts.tsv`
 146
 147 ```
 148 outputs:
 149   featurecounts:
 150     type: File
 151     outputBinding:
 152       glob: featurecounts.tsv
 153 ```
 154 {: .language-yaml }
 155
 156 # Running in a container
 157
 158 In order to run the tool, it needs to be installed.
 159 Using software containers, a tool can be pre-installed into a
 160 compatible runtime environment, and that runtime environment (called a
 161 container image) can be downloaded and run on demand.
 162
 163 Although plain CWL does not _require_ the use of containers, many
 164 popular platforms that run CWL do require the software be supplied in
 165 the form of a container image.
 166
 167 > ## Finding container images
 168 >
 169 > Many bioinformatics tools are already available as containers.  One
 170 > resource is the BioContainers project.  Let's find the "subread" software:
 171 >
 172 >   1. Visit [https://biocontainers.pro/](https://biocontainers.pro/)
 173 >   2. Click on "Registry"
 174 >   3. Search for "subread"
 175 >   4. Click on the search result for "subread"
 176 >   5. Click on the tab "Packages and Containers"
 177 >   6. Choose a row with type "docker", then on the right side of the "Full
 178 > Tag" column for that row, click the "copy to clipboard" button.
 179 >
 180 > To declare that you want to run inside a container, add a section
 181 > called `hints` to your tool document.  Under `hints` add a
 182 > subsection `DockerRequirement`.  Under `DockerRequirement`, paste
 183 > the text your copied in the above step.  Replace the text `docker
 184 > pull` to `dockerPull:` ensure it is indented twice so it is a field
 185 > of `DockerRequirement`.
 186 >
 187 > > ## Answer
 188 > > ```
 189 > > hints:
 190 > >   DockerRequirement:
 191 > >     dockerPull: quay.io/biocontainers/subread:1.5.0p3--0
 192 > > ```
 193 > > {: .language-yaml }
 194 > {: .solution}
 195 {: .challenge}
 196
 197 # Running a tool on its own
 198
 199 When creating a tool wrapper, it is helpful to run it on its own to test it.
 200
 201 The input to a single tool is the same kind of input parameters file
 202 that we used as input to a workflow in the previous lesson.
 203
 204 `featureCounts.yaml`
 205
 206 ```
 207 counts_input_bam:
 208   class: File
 209   location: Aligned.sortedByCoord.out.bam
 210 gtf:
 211   class: File
 212   location: rnaseq/reference_data/chr1-hg19_genes.gtf
 213 ```
 214 {: .language-yaml }
 215
 216 > ## Running the tool
 217 >
 218 > Run the tool on its own to confirm it has correct behavior:
 219 >
 220 > ```
 221 > cwl-runner featureCounts.cwl featureCounts.yaml
 222 > ```
 223 > {: .language-bash }
 224 {: .challenge }
 225
 226 # Adding it to the workflow
 227
 228 Now that we have confirmed that the tool wrapper works, it is time to
 229 add it to our workflow.
 230
 231 > ## Exercise
 232 >
 233 >   1. Add a new step called `featureCounts` that runs our tool
 234 >      wrapper.  The new step should take input from
 235 >      `samtools/bam_sorted_indexed`, and should be allocated a
 236 >      minimum of 500 MB of RAM
 237 >   2. Add a new output parameter for the workflow called
 238 >      `featurecounts` The output source should come from the output
 239 >      of the new `featureCounts` step.
 240 >   3.  When you have an answer, run the updated workflow, which
 241 >       should run the "featureCounts" step and produce "featurecounts"
 242 >       output parameter.
 243 >
 244 > > ## Answer
 245 > > ```
 246 > > steps:
 247 > >   ...
 248 > >   featureCounts:
 249 > >     requirements:
 250 > >       ResourceRequirement:
 251 > >         ramMin: 500
 252 > >     run: featureCounts.cwl
 253 > >     in:
 254 > >       counts_input_bam: samtools/bam_sorted_indexed
 255 > >       gtf: gtf
 256 > >     out: [featurecounts]
 257 > >
 258 > > outputs:
 259 > >   ...
 260 > >   featurecounts:
 261 > >     type: File
 262 > >     outputSource: featureCounts/featurecounts
 263 > > ```
 264 > > {: .language-yaml }
 265 > {: .solution}
 266 {: .challenge}
 267
 268 > ## Episode solution
 269 > * <a href="../assets/answers/ep4/main.cwl">main.cwl</a>
 270 > * <a href="../assets/answers/ep4/featureCounts.cwl">featureCounts.cwl</a>
 271 {: .solution}