doc/user/cwl/cwl-style.html.textile.liquid

   1 ---
   2 layout: default
   3 navsection: userguide
   4 title: Best Practices for writing CWL
   5 ...
   6 {% comment %}
   7 Copyright (C) The Arvados Authors. All rights reserved.
   8
   9 SPDX-License-Identifier: CC-BY-SA-3.0
  10 {% endcomment %}
  11
  12 * To run on Arvados, a workflow should provide a @DockerRequirement@ in the @hints@ section.
  13
  14 * Build a reusable library of components.  Share tool wrappers and subworkflows between projects.  Make use of and contribute to "community maintained workflows and tools":https://github.com/common-workflow-language/workflows and tool registries such as "Dockstore":http://dockstore.org .
  15
  16 * When combining a parameter value with a string, such as adding a filename extension, write @$(inputs.file.basename).ext@ instead of @$(inputs.file.basename + 'ext')@.  The first form is evaluated as a simple text substitution, the second form (using the @+@ operator) is evaluated as an arbitrary Javascript expression and requires that you declare @InlineJavascriptRequirement@.
  17
  18 * Avoid declaring @InlineJavascriptRequirement@ or @ShellCommandRequirement@ unless you specifically need them.  Don't include them "just in case" because they change the default behavior and may imply extra overhead.
  19
  20 * Don't write CWL scripts that access the Arvados SDK.  This is non-portable; a script that access Arvados directly won't work with @cwltool@ or crunch v2.
  21
  22 * CommandLineTools wrapping custom scripts should represent the script as an input parameter with the script file as a default value.  Use @secondaryFiles@ for scripts that consist of multiple files.  For example:
  23
  24 <pre>
  25 cwlVersion: v1.0
  26 class: CommandLineTool
  27 baseCommand: python
  28 inputs:
  29   script:
  30     type: File
  31     inputBinding: {position: 1}
  32     default:
  33       class: File
  34       location: bclfastq.py
  35       secondaryFiles:
  36         - class: File
  37           location: helper1.py
  38         - class: File
  39           location: helper2.py
  40   inputfile:
  41     type: File
  42     inputBinding: {position: 2}
  43 outputs:
  44   out:
  45     type: File
  46     outputBinding:
  47       glob: "*.fastq"
  48 </pre>
  49
  50 * You can get the designated temporary directory using @$(runtime.tmpdir)@ in your CWL file, or from the @$TMPDIR@ environment variable in your script.
  51
  52 * Similarly, you can get the designated output directory using $(runtime.outdir), or from the @HOME@ environment variable in your script.
  53
  54 * Use @ExpressionTool@ to efficiently rearrange input files between steps of a Workflow.  For example, the following expression accepts a directory containing files paired by @_R1_@ and @_R2_@ and produces an array of Directories containing each pair.
  55
  56 <pre>
  57 class: ExpressionTool
  58 cwlVersion: v1.0
  59 inputs:
  60   inputdir: Directory
  61 outputs:
  62   out: Directory[]
  63 requirements:
  64   InlineJavascriptRequirement: {}
  65 expression: |
  66   ${
  67     var samples = {};
  68     for (var i = 0; i < inputs.inputdir.listing.length; i++) {
  69       var file = inputs.inputdir.listing[i];
  70       var groups = file.basename.match(/^(.+)(_R[12]_)(.+)$/);
  71       if (groups) {
  72         if (!samples[groups[1]]) {
  73           samples[groups[1]] = [];
  74         }
  75         samples[groups[1]].push(file);
  76       }
  77     }
  78     var dirs = [];
  79     for (var key in samples) {
  80       dirs.push({"class": "Directory",
  81                  "basename": key,
  82                  "listing": [samples[key]]});
  83     }
  84     return {"out": dirs};
  85   }
  86 </pre>
  87
  88 * Avoid specifying resource requirements in CommandLineTool.  Prefer to specify them in the workflow.  You can provide a default resource requirement in the top level @hints@ section, and individual steps can override it with their own resource requirement.
  89
  90 <pre>
  91 cwlVersion: v1.0
  92 class: Workflow
  93 inputs:
  94   inp: File
  95 hints:
  96   ResourceRequirement:
  97     ramMin: 1000
  98     coresMin: 1
  99     tmpdirMin: 45000
 100 steps:
 101   step1:
 102     in: {inp: inp}
 103     out: [out]
 104     run: tool1.cwl
 105   step2:
 106     in: {inp: step1/inp}
 107     out: [out]
 108     run: tool2.cwl
 109     hints:
 110       ResourceRequirement:
 111         ramMin: 2000
 112         coresMin: 2
 113         tmpdirMin: 90000
 114 </pre>
 115
 116 * Available compute nodes types vary over time and across different cloud providers, so try to limit the RAM requirement to what the program actually needs.  However, if you need to target a specific compute node type, see this discussion on "calculating RAM request and choosing instance type for containers.":{{site.baseurl}}/api/execution.html#RAM
 117
 118 * Instead of scattering separate steps, prefer to scatter over a subworkflow.
 119
 120 With the following pattern, @step1@ has to wait for all samples to complete before @step2@ can start computing on any samples.  This means a single long-running sample can prevent the rest of the workflow from moving on:
 121
 122 <pre>
 123 cwlVersion: v1.0
 124 class: Workflow
 125 inputs:
 126   inp: File
 127 steps:
 128   step1:
 129     in: {inp: inp}
 130     scatter: inp
 131     out: [out]
 132     run: tool1.cwl
 133   step2:
 134     in: {inp: step1/inp}
 135     scatter: inp
 136     out: [out]
 137     run: tool2.cwl
 138   step3:
 139     in: {inp: step2/inp}
 140     scatter: inp
 141     out: [out]
 142     run: tool3.cwl
 143 </pre>
 144
 145 Instead, scatter over a subworkflow.  In this pattern, a sample can proceed to @step2@ as soon as @step1@ is done, independently of any other samples.
 146 Example: (note, the subworkflow can also be put in a separate file)
 147
 148 <pre>
 149 cwlVersion: v1.0
 150 class: Workflow
 151 steps:
 152   step1:
 153     in: {inp: inp}
 154     scatter: inp
 155     out: [out]
 156     run:
 157       class: Workflow
 158       inputs:
 159         inp: File
 160       outputs:
 161         out:
 162           type: File
 163           outputSource: step3/out
 164       steps:
 165         step1:
 166           in: {inp: inp}
 167           out: [out]
 168           run: tool1.cwl
 169         step2:
 170           in: {inp: step1/inp}
 171           out: [out]
 172           run: tool2.cwl
 173         step3:
 174           in: {inp: step2/inp}
 175           out: [out]
 176           run: tool3.cwl
 177 </pre>
 178
 179 h2(#migrate). Migrating running CWL on jobs API to containers API
 180
 181 * When migrating from jobs API (--api=jobs) (sometimes referred to as "crunch v1") to the containers API (--api=containers) ("crunch v2") there are a few differences in behavior:
 182 ** The tool is limited to accessing only collections which are explicitly listed in the input, and further limited to only the subdirectories of collections listed in input.  For example, given an explicit file input @/dir/subdir/file1.txt@, a tool will not be able to implicitly access the file @/dir/file2.txt@.  Use @secondaryFiles@ or a @Directory@ input to describe trees of files.
 183 ** Files listed in @InitialWorkDirRequirement@ appear in the output directory as normal files (not symlinks) but cannot be moved, renamed or deleted.  These files will be added to the output collection but without any additional copies of the underlying data.
 184 ** Tools are disallowed network access by default.  Tools which require network access must include @arv:APIRequirement: {}@ in their @requirements@ section.