doc/user/cwl/cwl-style.html.textile.liquid

   1 ---
   2 layout: default
   3 navsection: userguide
   4 title: Guidelines for Writing High-Performance Portable Workflows
   5 ...
   6 {% comment %}
   7 Copyright (C) The Arvados Authors. All rights reserved.
   8
   9 SPDX-License-Identifier: CC-BY-SA-3.0
  10 {% endcomment %}
  11
  12 h2(#performance). Performance
  13
  14 To get the best perfomance from your workflows, be aware of the following Arvados features, behaviors, and best practices.
  15
  16 h3. Does your application support NVIDIA GPU acceleration?
  17
  18 Use "cwltool:CUDARequirement":cwl-extensions.html#CUDARequirement to request nodes with GPUs.
  19
  20 h3. Trying to reduce costs?
  21
  22 Try "using preemptible (spot) instances":cwl-run-options.html#preemptible .
  23
  24 h3. You have a sequence of short-running steps
  25
  26 If you have a sequence of short-running steps (less than 1-2 minutes each), use the Arvados extension "arv:RunInSingleContainer":cwl-extensions.html#RunInSingleContainer to avoid scheduling and data transfer overhead by running all the steps together in the same container on the same node.  To use this feature, @cwltool@ must be installed in the container image.  Example:
  27
  28 {% codeblock as yaml %}
  29 class: Workflow
  30 cwlVersion: v1.0
  31 $namespaces:
  32   arv: "http://arvados.org/cwl#"
  33 inputs:
  34   file: File
  35 outputs: []
  36 requirements:
  37   SubworkflowFeatureRequirement: {}
  38 steps:
  39   subworkflow-with-short-steps:
  40     in:
  41       file: file
  42     out: [out]
  43     # This hint indicates that the subworkflow should be bundled and
  44     # run in a single container, instead of the normal behavior, which
  45     # is to run each step in a separate container.  This greatly
  46     # reduces overhead if you have a series of short jobs, without
  47     # requiring any changes the CWL definition of the sub workflow.
  48     hints:
  49       - class: arv:RunInSingleContainer
  50     run: subworkflow-with-short-steps.cwl
  51 {% endcodeblock %}
  52
  53 h3. Avoid declaring @InlineJavascriptRequirement@ or @ShellCommandRequirement@
  54
  55 Avoid declaring @InlineJavascriptRequirement@ or @ShellCommandRequirement@ unless you specifically need them.  Don't include them "just in case" because they change the default behavior and may add extra overhead.
  56
  57 h3. Prefer text substitution to Javascript
  58
  59 When combining a parameter value with a string, such as adding a filename extension, write @$(inputs.file.basename).ext@ instead of @$(inputs.file.basename + 'ext')@.  The first form is evaluated as a simple text substitution, the second form (using the @+@ operator) is evaluated as an arbitrary Javascript expression and requires that you declare @InlineJavascriptRequirement@.
  60
  61 h3. Use @ExpressionTool@ to efficiently rearrange input files
  62
  63 Use @ExpressionTool@ to efficiently rearrange input files between steps of a Workflow.  For example, the following expression accepts a directory containing files paired by @_R1_@ and @_R2_@ and produces an array of Directories containing each pair.
  64
  65 {% codeblock as yaml %}
  66 class: ExpressionTool
  67 cwlVersion: v1.0
  68 inputs:
  69   inputdir: Directory
  70 outputs:
  71   out: Directory[]
  72 requirements:
  73   InlineJavascriptRequirement: {}
  74 expression: |
  75   ${
  76     var samples = {};
  77     for (var i = 0; i < inputs.inputdir.listing.length; i++) {
  78       var file = inputs.inputdir.listing[i];
  79       var groups = file.basename.match(/^(.+)(_R[12]_)(.+)$/);
  80       if (groups) {
  81         if (!samples[groups[1]]) {
  82           samples[groups[1]] = [];
  83         }
  84         samples[groups[1]].push(file);
  85       }
  86     }
  87     var dirs = [];
  88     for (var key in samples) {
  89       dirs.push({"class": "Directory",
  90                  "basename": key,
  91                  "listing": [samples[key]]});
  92     }
  93     return {"out": dirs};
  94   }
  95 {% endcodeblock %}
  96
  97 h3. Limit RAM requests to what you really need
  98
  99 Available compute nodes types vary over time and across different cloud providers, so it is important to limit the RAM requirement to what the program actually needs.  However, if you need to target a specific compute node type, see this discussion on "calculating RAM request and choosing instance type for containers.":{{site.baseurl}}/api/execution.html#RAM
 100
 101 h3. Avoid scattering by step by step
 102
 103 Instead of a scatter step that feeds into another scatter step, prefer to scatter over a subworkflow.
 104
 105 With the following pattern, @step1@ has to wait for all samples to complete before @step2@ can start computing on any samples.  This means a single long-running sample can prevent the rest of the workflow from moving on:
 106
 107 {% codeblock as yaml %}
 108 cwlVersion: v1.0
 109 class: Workflow
 110 inputs:
 111   inp: File
 112 steps:
 113   step1:
 114     in: {inp: inp}
 115     scatter: inp
 116     out: [out]
 117     run: tool1.cwl
 118   step2:
 119     in: {inp: step1/inp}
 120     scatter: inp
 121     out: [out]
 122     run: tool2.cwl
 123   step3:
 124     in: {inp: step2/inp}
 125     scatter: inp
 126     out: [out]
 127     run: tool3.cwl
 128 {% endcodeblock %}
 129
 130 Instead, scatter over a subworkflow.  In this pattern, a sample can proceed to @step2@ as soon as @step1@ is done, independently of any other samples.
 131 Example: (note, the subworkflow can also be put in a separate file)
 132
 133 {% codeblock as yaml %}
 134 cwlVersion: v1.0
 135 class: Workflow
 136 steps:
 137   step1:
 138     in: {inp: inp}
 139     scatter: inp
 140     out: [out]
 141     run:
 142       class: Workflow
 143       inputs:
 144         inp: File
 145       outputs:
 146         out:
 147           type: File
 148           outputSource: step3/out
 149       steps:
 150         step1:
 151           in: {inp: inp}
 152           out: [out]
 153           run: tool1.cwl
 154         step2:
 155           in: {inp: step1/inp}
 156           out: [out]
 157           run: tool2.cwl
 158         step3:
 159           in: {inp: step2/inp}
 160           out: [out]
 161           run: tool3.cwl
 162 {% endcodeblock %}
 163
 164
 165 h2. Portability
 166
 167 To write workflows that are easy to modify and portable across CWL runners (in the event you need to share your workflow with others), there are several best practices to follow:
 168
 169 h3. Always provide @DockerRequirement@
 170
 171 Workflows should always provide @DockerRequirement@ in the @hints@ or @requirements@ section.
 172
 173 h3. Build a reusable library of components
 174
 175 Share tool wrappers and subworkflows between projects.  Make use of and contribute to "community maintained workflows and tools":https://github.com/common-workflow-library and tool registries such as "Dockstore":http://dockstore.org .
 176
 177 h3. Supply scripts as input parameters
 178
 179 CommandLineTools wrapping custom scripts should represent the script as an input parameter with the script file as a default value.  Use @secondaryFiles@ for scripts that consist of multiple files.  For example:
 180
 181 {% codeblock as yaml %}
 182 cwlVersion: v1.0
 183 class: CommandLineTool
 184 baseCommand: python
 185 inputs:
 186   script:
 187     type: File
 188     inputBinding: {position: 1}
 189     default:
 190       class: File
 191       location: bclfastq.py
 192       secondaryFiles:
 193         - class: File
 194           location: helper1.py
 195         - class: File
 196           location: helper2.py
 197   inputfile:
 198     type: File
 199     inputBinding: {position: 2}
 200 outputs:
 201   out:
 202     type: File
 203     outputBinding:
 204       glob: "*.fastq"
 205 {% endcodeblock %}
 206
 207 h3. Getting the temporary and output directories
 208
 209 You can get the designated temporary directory using @$(runtime.tmpdir)@ in your CWL file, or from the @$TMPDIR@ environment variable in your script.
 210
 211 Similarly, you can get the designated output directory using @$(runtime.outdir)@, or from the @HOME@ environment variable in your script.
 212
 213 h3. Specifying @ResourceRequirement@
 214
 215 Avoid specifying resources in the @requirements@ section of a @CommandLineTool@, put it in the @hints@ section instead.  This enables you to override the tool resource hint with a workflow step level requirement:
 216
 217 {% codeblock as yaml %}
 218 cwlVersion: v1.0
 219 class: Workflow
 220 inputs:
 221   inp: File
 222 steps:
 223   step1:
 224     in: {inp: inp}
 225     out: [out]
 226     run: tool1.cwl
 227   step2:
 228     in: {inp: step1/inp}
 229     out: [out]
 230     run: tool2.cwl
 231     requirements:
 232       ResourceRequirement:
 233         ramMin: 2000
 234         coresMin: 2
 235         tmpdirMin: 90000
 236 {% endcodeblock %}
 237
 238 h3. Importing data into Keep
 239
 240 You can use HTTP URLs as File input parameters and @arvados-cwl-runner@ will download them to Keep for you:
 241
 242 {% codeblock as yaml %}
 243 fastq1:
 244   class: File
 245   location: https://example.com/genomes/sampleA_1.fastq
 246 fastq2:
 247   class: File
 248   location: https://example.com/genomes/sampleA_2.fastq
 249 {% endcodeblock %}
 250
 251 Files are downloaded and stored in Keep collections with HTTP header information stored in metadata.  If a file was previously downloaded, @arvados-cwl-runner@ uses HTTP caching rules to decide if a file should be re-downloaded or not.
 252
 253 The default behavior is to transfer the files on the client, prior to submitting the workflow run.  This guarantees the data is available when the workflow is submitted.  However, if data transfer is time consuming and you are submitting multiple workflow runs in a row, or the node submitting the workflow has limited bandwidth, you can use the @--defer-download@ option to have the data transfer performed by workflow runner process on a compute node, after the workflow is submitted.
 254
 255 @arvados-cwl-runner@ provides two additional options to control caching behavior.
 256
 257 * @--varying-url-params@ will ignore the listed URL query parameters from any HTTP URLs when checking if a URL has already been downloaded to Keep.
 258 * @--prefer-cached-downloads@ will search Keep for the previously downloaded URL and use that if found, without checking the upstream resource. This means changes in the upstream resource won't be detected, but it also means the workflow will not fail if the upstream resource becomes inaccessible.
 259
 260 One use of this is to import files from "AWS S3 signed URLs":https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html
 261
 262 Here is an example usage.  The use of @--varying-url-params=AWSAccessKeyId,Signature,Expires@ is especially relevant, this removes these parameters from the cached URL, which means that if a new signed URL for the same object is generated later, it can be found in the cache.
 263
 264 {% codeblock as sh %}
 265 arvados-cwl-runner --defer-download \
 266                    --varying-url-params=AWSAccessKeyId,Signature,Expires \
 267                    --prefer-cached-downloads \
 268                    workflow.cwl params.yml
 269 {% endcodeblock %}