_episodes/04-commandlinetool.md

   1 ---
   2 title: "Writing a tool wrapper"
   3 teaching: 0
   4 exercises: 0
   5 questions:
   6 - "Key question (FIXME)"
   7 objectives:
   8 - "First learning objective. (FIXME)"
   9 keypoints:
  10 - "First key point. Brief Answer to questions. (FIXME)"
  11 ---
  12
  13 It is time to add the last step in the analysis.
  14
  15 This will use the "featureCounts" tool from the "subread" package.
  16
  17 # 1. File header
  18
  19 Create a new file "featureCounts.cwl"
  20
  21 Start with this header
  22
  23 ```
  24 cwlVersion: v1.2
  25 class: CommandLineTool
  26 ```
  27
  28 # 2. Command line tool inputs
  29
  30 A CommandLineTool describes a single invocation of a command line program.
  31
  32 It consumes some input parameters, runs a program, and produce output
  33 values.
  34
  35 Here is the original shell command:
  36
  37 ```
  38 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
  39 ```
  40
  41 The variables used in the bash script are `$cores`, `$gtf`, `$counts` and `$counts_input_bam`.
  42
  43 The parameters
  44
  45 This gives us two file inputs, `gtf` and `counts_input_bam` which we can declare in our `inputs` section:
  46
  47 ```
  48 inputs:
  49   gtf: File
  50   counts_input_bam: File
  51 ```
  52
  53 # 3. Specifying the program to run
  54
  55 Give the name of the program to run in `baseCommand`.
  56
  57 ```
  58 baseCommand: featureCounts
  59 ```
  60
  61 # 4. Command arguments
  62
  63 The easiest way to describe the command line is with an `arguments`
  64 section.  This takes a comma-separated list of command line arguments.
  65
  66 Input variables are included on the command line as
  67 `$(inputs.name_of_parameter)`.  When the tool is executed, these input
  68 parameter values are substituted for these variable.
  69
  70 Special variables are also available.  The runtime environment
  71 describes the resources allocated to running the program.  Here we use
  72 `$(runtime.cores)` to decide how many threads to request.
  73
  74 ```
  75 arguments: [-T, $(runtime.cores),
  76             -a, $(inputs.gtf),
  77             -o, featurecounts.tsv,
  78             $(inputs.counts_input_bam)]
  79 ```
  80
  81 # 5. Outputs section
  82
  83 In CWL, you must explicitly identify the outputs of a program.  This
  84 associates output parameters with specific files, and enables the
  85 workflow runner to know which files must be saved and which files can
  86 be discarded.
  87
  88 In the previous section, we told the featureCounts program the name of
  89 our output files should be `featurecounts.tsv`.
  90
  91 We can declare an output parameter called `featurecounts` that will
  92 have that output file as its value.
  93
  94 The `outputBinding` section describes how to determine the value of
  95 the parameter.  The `glob` field tells it to search for a file in the
  96 output directory called `featurecounts.tsv`
  97
  98 ```
  99 outputs:
 100   featurecounts:
 101     type: File
 102     outputBinding:
 103       glob: featurecounts.tsv
 104 ```
 105
 106 # 6. Running in a container
 107
 108 In order to run the tool, it needs to be installed.
 109 Using software containers, a tool can be pre-installed into a
 110 compatible runtime environment, and that runtime environment (called a
 111 container image) can be downloaded and run on demand.
 112
 113 Many bioinformatics tools are already available as containers.  One
 114 resource is the BioContainers project.  Let's find the "subread" software:
 115
 116    1. Visit https://biocontainers.pro/
 117    2. Click on "Registry"
 118    3. Search for "subread"
 119    4. Click on the search result for "subread"
 120    5. Click on the tab "Packages and Containers"
 121    6. Choose a row with type "docker", then on the right side of the "Full
 122 Tag" column for that row, click the "copy to clipboard" button.
 123
 124 To declare that you want to run inside a container, create a section
 125 called `hints` with a subsection `DockerRequirement`.  Under
 126 `DockerRequirement`, paste the text your copied in the above step.
 127 Replace the text `docker pull` to `dockerPull:` and indent it so it is
 128 in the `DockerRequirement` section.
 129
 130 ```
 131 hints:
 132   DockerRequirement:
 133     dockerPull: quay.io/biocontainers/subread:1.5.0p3--0
 134 ```
 135
 136 # 7. Running a tool on its own
 137
 138 When creating a tool wrapper, it is helpful to run it on its own to test it.
 139
 140 The input to a single tool is the same kind of input parameters file
 141 that we used as input to a workflow in the previous lesson.
 142
 143 featureCounts.yaml:
 144
 145 ```
 146 counts_input_bam:
 147   class: File
 148   location: Aligned.sortedByCoord.out.bam
 149 gtf:
 150   class: File
 151   location: rnaseq/reference_data/chr1-hg19_genes.gtf
 152 ```
 153
 154 The invocation is also the same:
 155
 156 ```
 157 cwl-runner featureCounts.cwl featureCounts.yaml
 158 ```
 159
 160 # 8. Adding it to the workflow
 161
 162 Now that we have confirmed that it works, we can add it to our workflow.
 163 We add it to `steps`, connecting the output of samtools to
 164 `counts_input_bam` and the `gtf` taking the workflow input of the same
 165 name.
 166
 167 ```
 168 steps:
 169   ...
 170   featureCounts:
 171     requirements:
 172       ResourceRequirement:
 173         ramMin: 500
 174     run: featureCounts.cwl
 175     in:
 176       counts_input_bam: samtools/bam_sorted_indexed
 177       gtf: gtf
 178     out: [featurecounts]
 179 ```
 180
 181 We will add the result from featurecounts to the output:
 182
 183 ```
 184 outputs:
 185   ...
 186   featurecounts:
 187     type: File
 188     outputSource: featureCounts/featurecounts
 189 ```
 190
 191 You should now be able to re-run the workflow and it will run the
 192 "featureCounts" step and include "featurecounts" in the output.