lesson3/lesson3.md

   1 # Writing a tool wrapper
   2
   3 It is time to add the last step in the analysis.
   4
   5 This will use the "featureCounts" tool from the "subread" package.
   6
   7 # Writing the tool wrapper
   8
   9 1. Create a new file "featureCounts.cwl"
  10
  11 2. Start with this header
  12
  13 ```
  14 cwlVersion: v1.2
  15 class: CommandLineTool
  16 ```
  17
  18 3. Command line tool inputs
  19
  20 A CommandLineTool describes a single invocation of a command line program.
  21
  22 It consumes some input parameters, runs a program, and produce output
  23 values.
  24
  25 Here's the original bash script
  26
  27 ```
  28 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
  29 ```
  30
  31 The variables used in the bash script are `$cores`, `$gtf`, `$counts` and `$counts_input_bam`.
  32
  33 The parameters
  34
  35 This gives us two file inputs, `gtf` and `counts_input_bam` which we can declare in our `inputs` section:
  36
  37 ```
  38 inputs:
  39   gtf: File
  40   counts_input_bam: File
  41 ```
  42
  43 4. The base command
  44
  45 This one is easy.  This is the name of program to run:
  46
  47 ```
  48 baseCommand: featureCounts
  49 ```
  50
  51 5. The command arguments
  52
  53 The easiest way to describe the command line is with an `arguments`
  54 section.  This takes a comma-separated list of command line arguments.
  55
  56 Input variables are included on the command line as
  57 `$(inputs.name_of_parameter)`.  When the tool is executed, these input
  58 parameter values are substituted for these variable.
  59
  60 Special variables are also available.  The runtime environment
  61 describes the resources allocated to running the program.  Here we use
  62 `$(runtime.cores)` to decide how many threads to request.
  63
  64 File variables can also yield a partial filename, by adding
  65 `.nameroot`.  This is the filename with the final dot-extension
  66 stripped off.
  67
  68 ```
  69 arguments: [-T, $(runtime.cores),
  70             -a, $(inputs.gtf),
  71                         -o, $(inputs.counts_input_bam.nameroot)_featurecounts.txt,
  72                         $(inputs.counts_input_bam)]
  73 ```
  74
  75 6. The outputs section
  76
  77 In CWL, you must explicitly identify the outputs of a program.  This
  78 associates output parameters with specific files, and allows the
  79 workflow runner to know which files must be saved and which files can
  80 be discarded.
  81
  82 In the previous section, we told the featureCounts program the name of
  83 our output files should be
  84 `$(inputs.counts_input_bam.nameroot)_featurecounts.txt`.
  85
  86 We can declare an output parameter called `featurecounts` that will
  87 have that output file as its value.
  88
  89 The `outputBinding` section describes how to determine the value of
  90 the parameter.  The `glob` field tells it to search for a file in the
  91 output directory with the
  92 `$(inputs.counts_input_bam.nameroot)_featurecounts.txt`
  93
  94 ```
  95 outputs:
  96   featurecounts:
  97     type: File
  98         outputBinding:
  99           glob: $(inputs.counts_input_bam.nameroot)_featurecounts.txt
 100 ```
 101
 102 N.
 103
 104 The most portable way to run a tool is to wrap it in a Docker
 105 container.  (Some CWL platforms, such as Arvados, require it).  Many
 106 bioinformatics tools are already available as containers.  One
 107 resource is the BioContainers project.
 108
 109 Visit https://biocontainers.pro/
 110
 111 Click on "Registry"
 112
 113 Search for "subread"
 114
 115 Click on the search result for "subread"
 116
 117 Click on the tab "Packages and Containers"
 118
 119 Choose a row with type "docker", then click the "copy to clipboard"
 120 button on the right side of the"Full Tag" column for that row.