lesson3/lesson3.md

   1 # Writing a tool wrapper
   2
   3 It is time to add the last step in the analysis.
   4
   5 This will use the "featureCounts" tool from the "subread" package.
   6
   7 ### 1. File header
   8
   9 Create a new file "featureCounts.cwl"
  10
  11 Start with this header
  12
  13 ```
  14 cwlVersion: v1.2
  15 class: CommandLineTool
  16 ```
  17
  18 ### 2. Command line tool inputs
  19
  20 A CommandLineTool describes a single invocation of a command line program.
  21
  22 It consumes some input parameters, runs a program, and produce output
  23 values.
  24
  25 Here is the original shell command:
  26
  27 ```
  28 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
  29 ```
  30
  31 The variables used in the bash script are `$cores`, `$gtf`, `$counts` and `$counts_input_bam`.
  32
  33 The parameters
  34
  35 This gives us two file inputs, `gtf` and `counts_input_bam` which we can declare in our `inputs` section:
  36
  37 ```
  38 inputs:
  39   gtf: File
  40   counts_input_bam: File
  41 ```
  42
  43 ### 3. Specifying the program to run
  44
  45 Give the name of the program to run in `baseCommand`.
  46
  47 ```
  48 baseCommand: featureCounts
  49 ```
  50
  51 ### 4. Command arguments
  52
  53 The easiest way to describe the command line is with an `arguments`
  54 section.  This takes a comma-separated list of command line arguments.
  55
  56 Input variables are included on the command line as
  57 `$(inputs.name_of_parameter)`.  When the tool is executed, these input
  58 parameter values are substituted for these variable.
  59
  60 Special variables are also available.  The runtime environment
  61 describes the resources allocated to running the program.  Here we use
  62 `$(runtime.cores)` to decide how many threads to request.
  63
  64 ```
  65 arguments: [-T, $(runtime.cores),
  66             -a, $(inputs.gtf),
  67             -o, featurecounts.tsv,
  68             $(inputs.counts_input_bam)]
  69 ```
  70
  71 ### 5. Outputs section
  72
  73 In CWL, you must explicitly identify the outputs of a program.  This
  74 associates output parameters with specific files, and enables the
  75 workflow runner to know which files must be saved and which files can
  76 be discarded.
  77
  78 In the previous section, we told the featureCounts program the name of
  79 our output files should be `featurecounts.tsv`.
  80
  81 We can declare an output parameter called `featurecounts` that will
  82 have that output file as its value.
  83
  84 The `outputBinding` section describes how to determine the value of
  85 the parameter.  The `glob` field tells it to search for a file in the
  86 output directory called `featurecounts.tsv`
  87
  88 ```
  89 outputs:
  90   featurecounts:
  91     type: File
  92       outputBinding:
  93       glob: featurecounts.tsv
  94 ```
  95
  96 ### 6. Running in a container
  97
  98 In order to run the tool, it needs to be installed.
  99 Using software containers, a tool can be pre-installed into a
 100 compatible runtime environment, and that runtime environment (called a
 101 container image) can be downloaded and run on demand.
 102
 103 Many bioinformatics tools are already available as containers.  One
 104 resource is the BioContainers project.  Let's find the "subread" software:
 105
 106    1. Visit https://biocontainers.pro/
 107    2. Click on "Registry"
 108    3. Search for "subread"
 109    4. Click on the search result for "subread"
 110    5. Click on the tab "Packages and Containers"
 111    6. Choose a row with type "docker", then on the right side of the "Full
 112 Tag" column for that row, click the "copy to clipboard" button.
 113
 114 To declare that you want to run inside a container, create a section
 115 called `hints` with a subsection `DockerRequirement`.  Under
 116 `DockerRequirement`, paste the text your copied in the above step.
 117 Replace the text `docker pull` to `dockerPull:` and indent it so it is
 118 in the `DockerRequirement` section.
 119
 120 ```
 121 hints:
 122   DockerRequirement:
 123     dockerPull: quay.io/biocontainers/subread:1.5.0p3--0
 124 ```
 125
 126 ### 7. Running a tool on its own
 127
 128 When creating a tool wrapper, it is helpful to run it on its own to test it.
 129
 130 The input to a single tool is the same kind of input parameters file
 131 that we used as input to a workflow in the previous lesson.
 132
 133 featureCounts.yaml:
 134
 135 ```
 136 counts_input_bam:
 137   class: File
 138   location: Aligned.sortedByCoord.out.bam
 139 gtf:
 140   class: File
 141   location: rnaseq/reference_data/chr1-hg19_genes.gtf
 142 ```
 143
 144 The invocation is also the same:
 145
 146 ```
 147 cwl-runner featureCounts.cwl featureCounts.yaml
 148 ```
 149
 150 ### 8. Adding it to the workflow
 151
 152 Now that we have confirmed that it works, we can add it to our workflow.
 153 We add it to `steps`, connecting the output of samtools to
 154 `counts_input_bam` and the `gtf` taking the workflow input of the same
 155 name.
 156
 157 ```
 158 steps:
 159   ...
 160   featureCounts:
 161     requirements:
 162       ResourceRequirement:
 163         ramMin: 500
 164     run: featureCounts.cwl
 165       in:
 166         counts_input_bam: samtools/bam_sorted_indexed
 167         gtf: gtf
 168       out: [featurecounts]
 169 ```
 170
 171 We will add the result from featurecounts to the output:
 172
 173 ```
 174 outputs:
 175   ...
 176   featurecounts:
 177     type: File
 178       outputSource: featureCounts/featurecounts
 179 ```
 180
 181 You should now be able to re-run the workflow and it will run the
 182 "featureCounts" step and include "featurecounts" in the output.