doc/user/cwl/cwl-style.html.textile.liquid

   1 ---
   2 layout: default
   3 navsection: userguide
   4 title: Guidelines for Writing High-Performance Portable Workflows
   5 ...
   6 {% comment %}
   7 Copyright (C) The Arvados Authors. All rights reserved.
   8
   9 SPDX-License-Identifier: CC-BY-SA-3.0
  10 {% endcomment %}
  11
  12 h2(#performance). Performance
  13
  14 To get the best perfomance from your workflows, be aware of the following Arvados features, behaviors, and best practices:
  15
  16 If you have a sequence of short-running steps (less than 1-2 minutes each), use the Arvados extension "arv:RunInSingleContainer":cwl-extensions.html#RunInSingleContainer to avoid scheduling and data transfer overhead by running all the steps together at once.  To use this feature, @cwltool@ must be installed in the container image.
  17
  18 Avoid declaring @InlineJavascriptRequirement@ or @ShellCommandRequirement@ unless you specifically need them.  Don't include them "just in case" because they change the default behavior and may add extra overhead.
  19
  20 When combining a parameter value with a string, such as adding a filename extension, write @$(inputs.file.basename).ext@ instead of @$(inputs.file.basename + 'ext')@.  The first form is evaluated as a simple text substitution, the second form (using the @+@ operator) is evaluated as an arbitrary Javascript expression and requires that you declare @InlineJavascriptRequirement@.
  21
  22 Use @ExpressionTool@ to efficiently rearrange input files between steps of a Workflow.  For example, the following expression accepts a directory containing files paired by @_R1_@ and @_R2_@ and produces an array of Directories containing each pair.
  23
  24 {% codeblock as yaml %}
  25 class: ExpressionTool
  26 cwlVersion: v1.0
  27 inputs:
  28   inputdir: Directory
  29 outputs:
  30   out: Directory[]
  31 requirements:
  32   InlineJavascriptRequirement: {}
  33 expression: |
  34   ${
  35     var samples = {};
  36     for (var i = 0; i < inputs.inputdir.listing.length; i++) {
  37       var file = inputs.inputdir.listing[i];
  38       var groups = file.basename.match(/^(.+)(_R[12]_)(.+)$/);
  39       if (groups) {
  40         if (!samples[groups[1]]) {
  41           samples[groups[1]] = [];
  42         }
  43         samples[groups[1]].push(file);
  44       }
  45     }
  46     var dirs = [];
  47     for (var key in samples) {
  48       dirs.push({"class": "Directory",
  49                  "basename": key,
  50                  "listing": [samples[key]]});
  51     }
  52     return {"out": dirs};
  53   }
  54 {% endcodeblock %}
  55
  56 Available compute nodes types vary over time and across different cloud providers, so try to limit the RAM requirement to what the program actually needs.  However, if you need to target a specific compute node type, see this discussion on "calculating RAM request and choosing instance type for containers.":{{site.baseurl}}/api/execution.html#RAM
  57
  58 Instead of scattering separate steps, prefer to scatter over a subworkflow.
  59
  60 With the following pattern, @step1@ has to wait for all samples to complete before @step2@ can start computing on any samples.  This means a single long-running sample can prevent the rest of the workflow from moving on:
  61
  62 {% codeblock as yaml %}
  63 cwlVersion: v1.0
  64 class: Workflow
  65 inputs:
  66   inp: File
  67 steps:
  68   step1:
  69     in: {inp: inp}
  70     scatter: inp
  71     out: [out]
  72     run: tool1.cwl
  73   step2:
  74     in: {inp: step1/inp}
  75     scatter: inp
  76     out: [out]
  77     run: tool2.cwl
  78   step3:
  79     in: {inp: step2/inp}
  80     scatter: inp
  81     out: [out]
  82     run: tool3.cwl
  83 {% endcodeblock %}
  84
  85 Instead, scatter over a subworkflow.  In this pattern, a sample can proceed to @step2@ as soon as @step1@ is done, independently of any other samples.
  86 Example: (note, the subworkflow can also be put in a separate file)
  87
  88 {% codeblock as yaml %}
  89 cwlVersion: v1.0
  90 class: Workflow
  91 steps:
  92   step1:
  93     in: {inp: inp}
  94     scatter: inp
  95     out: [out]
  96     run:
  97       class: Workflow
  98       inputs:
  99         inp: File
 100       outputs:
 101         out:
 102           type: File
 103           outputSource: step3/out
 104       steps:
 105         step1:
 106           in: {inp: inp}
 107           out: [out]
 108           run: tool1.cwl
 109         step2:
 110           in: {inp: step1/inp}
 111           out: [out]
 112           run: tool2.cwl
 113         step3:
 114           in: {inp: step2/inp}
 115           out: [out]
 116           run: tool3.cwl
 117 {% endcodeblock %}
 118
 119
 120 h2. Portability
 121
 122 To write workflows that are easy to modify and portable across CWL runners (in the event you need to share your workflow with others), there are several best practices to follow:
 123
 124 Workflows should always provide @DockerRequirement@ in the @hints@ or @requirements@ section.
 125
 126 Build a reusable library of components.  Share tool wrappers and subworkflows between projects.  Make use of and contribute to "community maintained workflows and tools":https://github.com/common-workflow-language/workflows and tool registries such as "Dockstore":http://dockstore.org .
 127
 128 CommandLineTools wrapping custom scripts should represent the script as an input parameter with the script file as a default value.  Use @secondaryFiles@ for scripts that consist of multiple files.  For example:
 129
 130 {% codeblock as yaml %}
 131 cwlVersion: v1.0
 132 class: CommandLineTool
 133 baseCommand: python
 134 inputs:
 135   script:
 136     type: File
 137     inputBinding: {position: 1}
 138     default:
 139       class: File
 140       location: bclfastq.py
 141       secondaryFiles:
 142         - class: File
 143           location: helper1.py
 144         - class: File
 145           location: helper2.py
 146   inputfile:
 147     type: File
 148     inputBinding: {position: 2}
 149 outputs:
 150   out:
 151     type: File
 152     outputBinding:
 153       glob: "*.fastq"
 154 {% endcodeblock %}
 155
 156 You can get the designated temporary directory using @$(runtime.tmpdir)@ in your CWL file, or from the @$TMPDIR@ environment variable in your script.
 157
 158 Similarly, you can get the designated output directory using $(runtime.outdir), or from the @HOME@ environment variable in your script.
 159
 160 Avoid specifying resource requirements in CommandLineTool.  Prefer to specify them in the workflow.  You can provide a default resource requirement in the top level @hints@ section, and individual steps can override it with their own resource requirement.
 161
 162 {% codeblock as yaml %}
 163 cwlVersion: v1.0
 164 class: Workflow
 165 inputs:
 166   inp: File
 167 hints:
 168   ResourceRequirement:
 169     ramMin: 1000
 170     coresMin: 1
 171     tmpdirMin: 45000
 172 steps:
 173   step1:
 174     in: {inp: inp}
 175     out: [out]
 176     run: tool1.cwl
 177   step2:
 178     in: {inp: step1/inp}
 179     out: [out]
 180     run: tool2.cwl
 181     hints:
 182       ResourceRequirement:
 183         ramMin: 2000
 184         coresMin: 2
 185         tmpdirMin: 90000
 186 {% endcodeblock %}