--- layout: default navsection: userguide title: Guidelines for Writing High-Performance Portable Workflows ... {% comment %} Copyright (C) The Arvados Authors. All rights reserved. SPDX-License-Identifier: CC-BY-SA-3.0 {% endcomment %} h2(#performance). Performance To get the best perfomance from your workflows, be aware of the following Arvados features, behaviors, and best practices. h3. Does your application support NVIDIA GPU acceleration? Use "cwltool:CUDARequirement":cwl-extensions.html#CUDARequirement to request nodes with GPUs. h3. Trying to reduce costs? Try "using preemptible (spot) instances":cwl-run-options.html#preemptible . h3. You have a sequence of short-running steps If you have a sequence of short-running steps (less than 1-2 minutes each), use the Arvados extension "arv:RunInSingleContainer":cwl-extensions.html#RunInSingleContainer to avoid scheduling and data transfer overhead by running all the steps together in the same container on the same node. To use this feature, @cwltool@ must be installed in the container image. Example: {% codeblock as yaml %} class: Workflow cwlVersion: v1.0 $namespaces: arv: "http://arvados.org/cwl#" inputs: file: File outputs: [] requirements: SubworkflowFeatureRequirement: {} steps: subworkflow-with-short-steps: in: file: file out: [out] # This hint indicates that the subworkflow should be bundled and # run in a single container, instead of the normal behavior, which # is to run each step in a separate container. This greatly # reduces overhead if you have a series of short jobs, without # requiring any changes the CWL definition of the sub workflow. hints: - class: arv:RunInSingleContainer run: subworkflow-with-short-steps.cwl {% endcodeblock %} h3. Avoid declaring @InlineJavascriptRequirement@ or @ShellCommandRequirement@ Avoid declaring @InlineJavascriptRequirement@ or @ShellCommandRequirement@ unless you specifically need them. Don't include them "just in case" because they change the default behavior and may add extra overhead. h3. Prefer text substitution to Javascript When combining a parameter value with a string, such as adding a filename extension, write @$(inputs.file.basename).ext@ instead of @$(inputs.file.basename + 'ext')@. The first form is evaluated as a simple text substitution, the second form (using the @+@ operator) is evaluated as an arbitrary Javascript expression and requires that you declare @InlineJavascriptRequirement@. h3. Use @ExpressionTool@ to efficiently rearrange input files Use @ExpressionTool@ to efficiently rearrange input files between steps of a Workflow. For example, the following expression accepts a directory containing files paired by @_R1_@ and @_R2_@ and produces an array of Directories containing each pair. {% codeblock as yaml %} class: ExpressionTool cwlVersion: v1.0 inputs: inputdir: Directory outputs: out: Directory[] requirements: InlineJavascriptRequirement: {} expression: | ${ var samples = {}; for (var i = 0; i < inputs.inputdir.listing.length; i++) { var file = inputs.inputdir.listing[i]; var groups = file.basename.match(/^(.+)(_R[12]_)(.+)$/); if (groups) { if (!samples[groups[1]]) { samples[groups[1]] = []; } samples[groups[1]].push(file); } } var dirs = []; for (var key in samples) { dirs.push({"class": "Directory", "basename": key, "listing": [samples[key]]}); } return {"out": dirs}; } {% endcodeblock %} h3. Limit RAM requests to what you really need Available compute nodes types vary over time and across different cloud providers, so it is important to limit the RAM requirement to what the program actually needs. However, if you need to target a specific compute node type, see this discussion on "calculating RAM request and choosing instance type for containers.":{{site.baseurl}}/api/execution.html#RAM h3. Avoid scattering by step by step Instead of a scatter step that feeds into another scatter step, prefer to scatter over a subworkflow. With the following pattern, @step1@ has to wait for all samples to complete before @step2@ can start computing on any samples. This means a single long-running sample can prevent the rest of the workflow from moving on: {% codeblock as yaml %} cwlVersion: v1.0 class: Workflow inputs: inp: File steps: step1: in: {inp: inp} scatter: inp out: [out] run: tool1.cwl step2: in: {inp: step1/inp} scatter: inp out: [out] run: tool2.cwl step3: in: {inp: step2/inp} scatter: inp out: [out] run: tool3.cwl {% endcodeblock %} Instead, scatter over a subworkflow. In this pattern, a sample can proceed to @step2@ as soon as @step1@ is done, independently of any other samples. Example: (note, the subworkflow can also be put in a separate file) {% codeblock as yaml %} cwlVersion: v1.0 class: Workflow steps: step1: in: {inp: inp} scatter: inp out: [out] run: class: Workflow inputs: inp: File outputs: out: type: File outputSource: step3/out steps: step1: in: {inp: inp} out: [out] run: tool1.cwl step2: in: {inp: step1/inp} out: [out] run: tool2.cwl step3: in: {inp: step2/inp} out: [out] run: tool3.cwl {% endcodeblock %} h2. Portability To write workflows that are easy to modify and portable across CWL runners (in the event you need to share your workflow with others), there are several best practices to follow: h3. Always provide @DockerRequirement@ Workflows should always provide @DockerRequirement@ in the @hints@ or @requirements@ section. h3. Build a reusable library of components Share tool wrappers and subworkflows between projects. Make use of and contribute to "community maintained workflows and tools":https://github.com/common-workflow-library and tool registries such as "Dockstore":http://dockstore.org . h3. Supply scripts as input parameters CommandLineTools wrapping custom scripts should represent the script as an input parameter with the script file as a default value. Use @secondaryFiles@ for scripts that consist of multiple files. For example: {% codeblock as yaml %} cwlVersion: v1.0 class: CommandLineTool baseCommand: python inputs: script: type: File inputBinding: {position: 1} default: class: File location: bclfastq.py secondaryFiles: - class: File location: helper1.py - class: File location: helper2.py inputfile: type: File inputBinding: {position: 2} outputs: out: type: File outputBinding: glob: "*.fastq" {% endcodeblock %} h3. Getting the temporary and output directories You can get the designated temporary directory using @$(runtime.tmpdir)@ in your CWL file, or from the @$TMPDIR@ environment variable in your script. Similarly, you can get the designated output directory using @$(runtime.outdir)@, or from the @HOME@ environment variable in your script. h3. Specifying @ResourceRequirement@ Avoid specifying resources in the @requirements@ section of a @CommandLineTool@, put it in the @hints@ section instead. This enables you to override the tool resource hint with a workflow step level requirement: {% codeblock as yaml %} cwlVersion: v1.0 class: Workflow inputs: inp: File steps: step1: in: {inp: inp} out: [out] run: tool1.cwl step2: in: {inp: step1/inp} out: [out] run: tool2.cwl requirements: ResourceRequirement: ramMin: 2000 coresMin: 2 tmpdirMin: 90000 {% endcodeblock %} h3. Importing data into Keep You can use HTTP URLs as File input parameters and @arvados-cwl-runner@ will download them to Keep for you: {% codeblock as yaml %} fastq1: class: File location: https://example.com/genomes/sampleA_1.fastq fastq2: class: File location: https://example.com/genomes/sampleA_2.fastq {% endcodeblock %} Files are downloaded and stored in Keep collections with HTTP header information stored in metadata. If a file was previously downloaded, @arvados-cwl-runner@ uses HTTP caching rules to decide if a file should be re-downloaded or not. The default behavior is to transfer the files on the client, prior to submitting the workflow run. This guarantees the data is available when the workflow is submitted. However, if data transfer is time consuming and you are submitting multiple workflow runs in a row, or the node submitting the workflow has limited bandwidth, you can use the @--defer-download@ option to have the data transfer performed by workflow runner process on a compute node, after the workflow is submitted. @arvados-cwl-runner@ provides two additional options to control caching behavior. * @--varying-url-params@ will ignore the listed URL query parameters from any HTTP URLs when checking if a URL has already been downloaded to Keep. * @--prefer-cached-downloads@ will search Keep for the previously downloaded URL and use that if found, without checking the upstream resource. This means changes in the upstream resource won't be detected, but it also means the workflow will not fail if the upstream resource becomes inaccessible. One use of this is to import files from "AWS S3 signed URLs":https://docs.aws.amazon.com/AmazonS3/latest/userguide/ShareObjectPreSignedURL.html Here is an example usage. The use of @--varying-url-params=AWSAccessKeyId,Signature,Expires@ is especially relevant, this removes these parameters from the cached URL, which means that if a new signed URL for the same object is generated later, it can be found in the cache. {% codeblock as sh %} arvados-cwl-runner --defer-download \ --varying-url-params=AWSAccessKeyId,Signature,Expires \ --prefer-cached-downloads \ workflow.cwl params.yml {% endcodeblock %}