arv:WorkflowRunnerResources:
ramMin: 2048
coresMin: 2
+ arv:ClusterTarget:
+ cluster_id: clsr1
+ project_uuid: clsr1-j7d0g-qxc4jcji7n4lafx
</pre>
The one exception to this is @arv:APIRequirement@, see note below.
|_. Field |_. Type |_. Description |
|ramMin|int|RAM, in mebibytes, to reserve for the arvados-cwl-runner process. Default 1 GiB|
|coresMin|int|Number of cores to reserve to the arvados-cwl-runner process. Default 1 core.|
+
+h2(#clustertarget). arv:ClusterTarget
+
+Specify which Arvados cluster should execute a container or subworkflow, and the parent project for the container request.
+
+table(table table-bordered table-condensed).
+|_. Field |_. Type |_. Description |
+|cluster_id|string|The five-character alphanumeric cluster id (uuid prefix) where a container or subworkflow will execute. May be an expression.|
+|project_uuid|string|The uuid of the project which will own container request and output of the container. May be an expression.|
SPDX-License-Identifier: CC-BY-SA-3.0
{% endcomment %}
-To support running analysis on geographically dispersed data (avoiding expensive data transfers by sending the computation to the data) and "hybrid cloud" configurations where an on-premise cluster can expand its capabilities by delegating work to a cloud-base cluster, Arvados supports federated workflows. In a federated workflow, different steps of a workflow may execute on different clusters. Arvados manages data transfer and delegation of credentials, so this as easy as simply adding cluster target hints to your existing workflow.
+To support running analysis on geographically dispersed data (avoiding expensive data transfers by sending the computation to the data) and "hybrid cloud" configurations where an on-premise cluster can expand its capabilities by delegating work to a cloud-base cluster, Arvados supports federated workflows. In a federated workflow, different steps of a workflow may execute on different clusters. Arvados manages data transfer and delegation of credentials, all that is required is adding "arv:ClusterTarget":cwl-extensions.html#clustertarget hints to your existing workflow.
-h2. Federated scatter/gather example
+!(full-width)federated-workflow.svg!
+
+h2. Get the example files
+
+The tutorial files are located in the "documentation section of the Arvados source repository:":https://github.com/curoverse/arvados/tree/master/doc/user/cwl/federated
+<notextile>
+<pre><code>~$ <span class="userinput">git clone https://github.com/curoverse/arvados</span>
+~$ <span class="userinput">cd arvados/doc/user/cwl/federated</span>
+</code></pre>
+</notextile>
+h2. Federated scatter/gather example
+
+In this following example, an analysis task is executed on three different clusters with different data, then the results are combined to produce the final output.
{% codeblock as yaml %}
{% include 'federated_cwl' %}
{% endcodeblock %}
+Example input document:
+
{% codeblock as yaml %}
{% include 'shards_yml' %}
{% endcodeblock %}
+#
+# Demonstrate Arvados federation features. This performs a parallel
+# scatter over some arbitrary number of files and federated clusters,
+# then joins the results.
+#
cwlVersion: v1.0
class: Workflow
$namespaces:
+ # When using Arvados extensions to CWL, must declare the 'arv' namespace
arv: "http://arvados.org/cwl#"
+
requirements:
InlineJavascriptRequirement: {}
- DockerRequirement:
- dockerPull: arvados/fed-test:scatter-gather
ScatterFeatureRequirement: {}
StepInputExpressionRequirement: {}
+
+ DockerRequirement:
+ # Replace this with your own Docker container
+ dockerPull: arvados/jobs
+
+ # Define a record type so we can conveniently associate the input
+ # file, the cluster on which the file lives, and the project on that
+ # cluster that will own the container requests and intermediate
+ # outputs.
SchemaDefRequirement:
types:
- name: FileOnCluster
file: File
cluster: string
project: string
+
inputs:
+ # Expect an array of FileOnCluster records (defined above)
+ # as our input.
shards:
type:
type: array
items: FileOnCluster
+
outputs:
+ # Will produce an output file with the results of the distributed
+ # analysis jobs joined together.
joined:
type: File
outputSource: gather-results/joined
+
steps:
distributed-analysis:
in:
- shards: shards
- inp: {valueFrom: $(inputs.shards.file)}
- scatter: shards
+ # Take "shards" array as input, we scatter over it below.
+ shard: shards
+
+ # Use an expression to extract the "file" field to assign to the
+ # "inp" parameter of the tool.
+ inp: {valueFrom: $(inputs.shard.file)}
+
+ # Scatter over shards, this means creating a parallel job for each
+ # element in the "shards" array. Expressions are evaluated for
+ # each element.
+ scatter: shard
+
+ # Specify the cluster target for this job. This means each
+ # separate scatter job will execute on the cluster that was
+ # specified in the "cluster" field.
+ #
+ # Arvados handles streaming data between clusters, for example,
+ # the Docker image containing the code for a particular tool will
+ # be fetched on demand, as long as it is available somewhere in
+ # the federation.
hints:
arv:ClusterTarget:
- cluster_id: $(inputs.shards.cluster)
- project_uuid: $(inputs.shards.project)
+ cluster_id: $(inputs.shard.cluster)
+ project_uuid: $(inputs.shard.project)
+
out: [out]
run: md5sum.cwl
+
+ # Collect the results of the distributed step and join them into a
+ # single output file. Arvados handles streaming inputs,
+ # intermediate results, and outputs between clusters on demand.
gather-results:
in:
inp: distributed-analysis/out