4 title: Guidelines for Writing High-Performance Portable Workflows
7 Copyright (C) The Arvados Authors. All rights reserved.
9 SPDX-License-Identifier: CC-BY-SA-3.0
12 h2(#performance). Performance
14 To get the best perfomance from your workflows, be aware of the following Arvados features, behaviors, and best practices.
16 h3. Does your application support NVIDIA GPU acceleration?
18 Use "cwltool:CUDARequirement":cwl-extensions.html#CUDARequirement to request nodes with GPUs.
20 h3. Trying to reduce costs?
22 Try "using preemptible (spot) instances":cwl-run-options.html#preemptible .
24 h3. You have a sequence of short-running steps
26 If you have a sequence of short-running steps (less than 1-2 minutes each), use the Arvados extension "arv:RunInSingleContainer":cwl-extensions.html#RunInSingleContainer to avoid scheduling and data transfer overhead by running all the steps together in the same container on the same node. To use this feature, @cwltool@ must be installed in the container image. Example:
28 {% codeblock as yaml %}
32 arv: "http://arvados.org/cwl#"
37 SubworkflowFeatureRequirement: {}
39 subworkflow-with-short-steps:
43 # This hint indicates that the subworkflow should be bundled and
44 # run in a single container, instead of the normal behavior, which
45 # is to run each step in a separate container. This greatly
46 # reduces overhead if you have a series of short jobs, without
47 # requiring any changes the CWL definition of the sub workflow.
49 - class: arv:RunInSingleContainer
50 run: subworkflow-with-short-steps.cwl
53 h3. Avoid declaring @InlineJavascriptRequirement@ or @ShellCommandRequirement@
55 Avoid declaring @InlineJavascriptRequirement@ or @ShellCommandRequirement@ unless you specifically need them. Don't include them "just in case" because they change the default behavior and may add extra overhead.
57 h3. Prefer text substitution to Javascript
59 When combining a parameter value with a string, such as adding a filename extension, write @$(inputs.file.basename).ext@ instead of @$(inputs.file.basename + 'ext')@. The first form is evaluated as a simple text substitution, the second form (using the @+@ operator) is evaluated as an arbitrary Javascript expression and requires that you declare @InlineJavascriptRequirement@.
61 h3. Use @ExpressionTool@ to efficiently rearrange input files
63 Use @ExpressionTool@ to efficiently rearrange input files between steps of a Workflow. For example, the following expression accepts a directory containing files paired by @_R1_@ and @_R2_@ and produces an array of Directories containing each pair.
65 {% codeblock as yaml %}
73 InlineJavascriptRequirement: {}
77 for (var i = 0; i < inputs.inputdir.listing.length; i++) {
78 var file = inputs.inputdir.listing[i];
79 var groups = file.basename.match(/^(.+)(_R[12]_)(.+)$/);
81 if (!samples[groups[1]]) {
82 samples[groups[1]] = [];
84 samples[groups[1]].push(file);
88 for (var key in samples) {
89 dirs.push({"class": "Directory",
91 "listing": [samples[key]]});
97 h3. Limit RAM requests to what you really need
99 Available compute nodes types vary over time and across different cloud providers, so it is important to limit the RAM requirement to what the program actually needs. However, if you need to target a specific compute node type, see this discussion on "calculating RAM request and choosing instance type for containers.":{{site.baseurl}}/api/execution.html#RAM
101 h3. Avoid scattering by step by step
103 Instead of a scatter step that feeds into another scatter step, prefer to scatter over a subworkflow.
105 With the following pattern, @step1@ has to wait for all samples to complete before @step2@ can start computing on any samples. This means a single long-running sample can prevent the rest of the workflow from moving on:
107 {% codeblock as yaml %}
130 Instead, scatter over a subworkflow. In this pattern, a sample can proceed to @step2@ as soon as @step1@ is done, independently of any other samples.
131 Example: (note, the subworkflow can also be put in a separate file)
133 {% codeblock as yaml %}
148 outputSource: step3/out
167 To write workflows that are easy to modify and portable across CWL runners (in the event you need to share your workflow with others), there are several best practices to follow:
169 h3. Always provide @DockerRequirement@
171 Workflows should always provide @DockerRequirement@ in the @hints@ or @requirements@ section.
173 h3. Build a reusable library of components
175 Build a reusable library of components. Share tool wrappers and subworkflows between projects. Make use of and contribute to "community maintained workflows and tools":https://github.com/common-workflow-library and tool registries such as "Dockstore":http://dockstore.org .
177 h3. Supply scripts as input parameters
179 CommandLineTools wrapping custom scripts should represent the script as an input parameter with the script file as a default value. Use @secondaryFiles@ for scripts that consist of multiple files. For example:
181 {% codeblock as yaml %}
183 class: CommandLineTool
188 inputBinding: {position: 1}
191 location: bclfastq.py
199 inputBinding: {position: 2}
207 h3. Getting the temporary and output directories
209 You can get the designated temporary directory using @$(runtime.tmpdir)@ in your CWL file, or from the @$TMPDIR@ environment variable in your script.
211 Similarly, you can get the designated output directory using $(runtime.outdir), or from the @HOME@ environment variable in your script.
213 h3. Specifying @ResourceRequirement@
215 Avoid specifying resources in the @requirements@ section of a @CommandLineTool@, put it in the @hints@ section instead. This enables you to override the tool resource hint with a workflow step level requirement:
217 {% codeblock as yaml %}