doc/user/cwl/cwl-style.html.textile.liquid

   1 ---
   2 layout: default
   3 navsection: userguide
   4 title: Best Practices for writing CWL
   5 ...
   6
   7 * To run on Arvados, a workflow should provide a @DockerRequirement@ in the @hints@ section.
   8
   9 * Build a reusable library of components.  Share tool wrappers and subworkflows between projects.  Make use of and contribute to "community maintained workflows and tools":https://github.com/common-workflow-language/workflows and tool registries such as "Dockstore":http://dockstore.org .
  10
  11 * When combining a parameter value with a string, such as adding a filename extension, write @$(inputs.file.basename).ext@ instead of @$(inputs.file.basename + 'ext')@.  The first form is evaluated as a simple text substitution, the second form (using the @+@ operator) is evaluated as an arbitrary Javascript expression and requires that you declare @InlineJavascriptRequirement@.
  12
  13 * Avoid declaring @InlineJavascriptRequirement@ or @ShellCommandRequirement@ unless you specifically need them.  Don't include them "just in case" because they change the default behavior and may imply extra overhead.
  14
  15 * Don't write CWL scripts that access the Arvados SDK.  This is non-portable; a script that access Arvados directly won't work with @cwltool@ or crunch v2.
  16
  17 * CommandLineTools wrapping custom scripts should represent the script as an input parameter with the script file as a default value.  Use @secondaryFiles@ for scripts that consist of multiple files.  For example:
  18
  19 <pre>
  20 cwlVersion: v1.0
  21 class: CommandLineTool
  22 baseCommand: python
  23 inputs:
  24   script:
  25     type: File
  26     inputBinding: {position: 1}
  27     default:
  28       class: File
  29       location: bclfastq.py
  30       secondaryFiles:
  31         - class: File
  32           location: helper1.py
  33         - class: File
  34           location: helper2.py
  35   inputfile:
  36     type: File
  37     inputBinding: {position: 2}
  38 outputs:
  39   out:
  40     type: File
  41     outputBinding:
  42       glob: "*.fastq"
  43 </pre>
  44
  45 * You can get the designated temporary directory using @$(runtime.tmpdir)@ in your CWL file, or from the @$TMPDIR@ environment variable in your script.
  46
  47 * Similarly, you can get the designated output directory using $(runtime.outdir), or from the @HOME@ environment variable in your script.
  48
  49 * Use @ExpressionTool@ to efficiently rearrange input files between steps of a Workflow.  For example, the following expression accepts a directory containing files paired by @_R1_@ and @_R2_@ and produces an array of Directories containing each pair.
  50
  51 <pre>
  52 class: ExpressionTool
  53 cwlVersion: v1.0
  54 inputs:
  55   inputdir: Directory
  56 outputs:
  57   out: Directory[]
  58 requirements:
  59   InlineJavascriptRequirement: {}
  60 expression: |
  61   ${
  62     var samples = {};
  63     for (var i = 0; i < inputs.inputdir.listing.length; i++) {
  64       var file = inputs.inputdir.listing[i];
  65       var groups = file.basename.match(/^(.+)(_R[12]_)(.+)$/);
  66       if (groups) {
  67         if (!samples[groups[1]]) {
  68           samples[groups[1]] = [];
  69         }
  70         samples[groups[1]].push(file);
  71       }
  72     }
  73     var dirs = [];
  74     for (var key in samples) {
  75       dirs.push({"class": "Directory",
  76                  "basename": key,
  77                  "listing": [samples[key]]});
  78     }
  79     return {"out": dirs};
  80   }
  81 </pre>
  82
  83 * Avoid specifying resource requirements in CommandLineTool.  Prefer to specify them in the workflow.  You can provide a default resource requirement in the top level @hints@ section, and individual steps can override it with their own resource requirement.
  84
  85 <pre>
  86 cwlVersion: v1.0
  87 class: Workflow
  88 inputs:
  89   inp: File
  90 hints:
  91   ResourceRequirement:
  92     ramMin: 1000
  93     coresMin: 1
  94     tmpdirMin: 45000
  95 steps:
  96   step1:
  97     in: {inp: inp}
  98     out: [out]
  99     run: tool1.cwl
 100   step2:
 101     in: {inp: step1/inp}
 102     out: [out]
 103     run: tool2.cwl
 104     hints:
 105       ResourceRequirement:
 106         ramMin: 2000
 107         coresMin: 2
 108         tmpdirMin: 90000
 109 </pre>
 110
 111 * Instead of scattering separate steps, prefer to scatter over a subworkflow.
 112
 113 With the following pattern, @step1@ has to wait for all samples to complete before @step2@ can start computing on any samples.  This means a single long-running sample can prevent the rest of the workflow from moving on:
 114
 115 <pre>
 116 cwlVersion: v1.0
 117 class: Workflow
 118 inputs:
 119   inp: File
 120 steps:
 121   step1:
 122     in: {inp: inp}
 123     scatter: inp
 124     out: [out]
 125     run: tool1.cwl
 126   step2:
 127     in: {inp: step1/inp}
 128     scatter: inp
 129     out: [out]
 130     run: tool2.cwl
 131   step3:
 132     in: {inp: step2/inp}
 133     scatter: inp
 134     out: [out]
 135     run: tool3.cwl
 136 </pre>
 137
 138 Instead, scatter over a subworkflow.  In this pattern, a sample can proceed to @step2@ as soon as @step1@ is done, independently of any other samples.
 139 Example: (note, the subworkflow can also be put in a separate file)
 140
 141 <pre>
 142 cwlVersion: v1.0
 143 class: Workflow
 144 steps:
 145   step1:
 146     in: {inp: inp}
 147     scatter: inp
 148     out: [out]
 149     run:
 150       class: Workflow
 151       inputs:
 152         inp: File
 153       outputs:
 154         out:
 155           type: File
 156           outputSource: step3/out
 157       steps:
 158         step1:
 159           in: {inp: inp}
 160           out: [out]
 161           run: tool1.cwl
 162         step2:
 163           in: {inp: step1/inp}
 164           out: [out]
 165           run: tool2.cwl
 166         step3:
 167           in: {inp: step2/inp}
 168           out: [out]
 169           run: tool3.cwl
 170 </pre>
 171
 172 * When migrating from crunch v1 API (--api=jobs) to the crunch v2 API (--api=containers) there are a few differences in behavior:
 173 ** The tool is limited to accessing only collections which are explicitly listed in the input, and further limited to only the subdirectories of collections listed in input.  For example, given an explicit file input @/dir/subdir/file1.txt@, a tool will not be able to implicitly access the file @/dir/file2.txt@.  Use @secondaryFiles@ or a @Directory@ input to describe trees of files.
 174 ** Files listed in @InitialWorkDirRequirement@ appear in the output directory as normal files (not symlinks) but cannot be moved, renamed or deleted.  These files will be added to the output collection but without any additional copies of the underlying data.
 175 ** Tools are disallowed network access by default.  Tools which require network access must include @arv:APIRequirement: {}@ in their @requirements@ section.