From: Peter Amstutz Date: Wed, 21 Nov 2018 21:23:59 +0000 (-0500) Subject: 14440: Comments on how federated.cwl works X-Git-Tag: 1.3.0~21^2~4 X-Git-Url: https://git.arvados.org/arvados.git/commitdiff_plain/8ec9c69eab7b914c0ea6d71dc8648f9a1f1c8255 14440: Comments on how federated.cwl works Arvados-DCO-1.1-Signed-off-by: Peter Amstutz --- diff --git a/.licenseignore b/.licenseignore index 113bf4fa4e..ba4c2dc38b 100644 --- a/.licenseignore +++ b/.licenseignore @@ -13,6 +13,7 @@ build/package-test-dockerfiles/ubuntu1604/etc-apt-preferences.d-arvados *by-sa-3.0.txt *COPYING doc/fonts/* +doc/user/cwl/federated/* */docker_image docker/jobs/apt.arvados.org.list */en.bootstrap.yml diff --git a/doc/user/cwl/cwl-extensions.html.textile.liquid b/doc/user/cwl/cwl-extensions.html.textile.liquid index f9ecf7a534..c75b43a641 100644 --- a/doc/user/cwl/cwl-extensions.html.textile.liquid +++ b/doc/user/cwl/cwl-extensions.html.textile.liquid @@ -43,6 +43,9 @@ hints: arv:WorkflowRunnerResources: ramMin: 2048 coresMin: 2 + arv:ClusterTarget: + cluster_id: clsr1 + project_uuid: clsr1-j7d0g-qxc4jcji7n4lafx The one exception to this is @arv:APIRequirement@, see note below. @@ -134,3 +137,12 @@ table(table table-bordered table-condensed). |_. Field |_. Type |_. Description | |ramMin|int|RAM, in mebibytes, to reserve for the arvados-cwl-runner process. Default 1 GiB| |coresMin|int|Number of cores to reserve to the arvados-cwl-runner process. Default 1 core.| + +h2(#clustertarget). arv:ClusterTarget + +Specify which Arvados cluster should execute a container or subworkflow, and the parent project for the container request. + +table(table table-bordered table-condensed). +|_. Field |_. Type |_. Description | +|cluster_id|string|The five-character alphanumeric cluster id (uuid prefix) where a container or subworkflow will execute. May be an expression.| +|project_uuid|string|The uuid of the project which will own container request and output of the container. May be an expression.| diff --git a/doc/user/cwl/federated-workflows.html.textile.liquid b/doc/user/cwl/federated-workflows.html.textile.liquid index 5f96702d74..77155fed51 100644 --- a/doc/user/cwl/federated-workflows.html.textile.liquid +++ b/doc/user/cwl/federated-workflows.html.textile.liquid @@ -9,16 +9,30 @@ Copyright (C) The Arvados Authors. All rights reserved. SPDX-License-Identifier: CC-BY-SA-3.0 {% endcomment %} -To support running analysis on geographically dispersed data (avoiding expensive data transfers by sending the computation to the data) and "hybrid cloud" configurations where an on-premise cluster can expand its capabilities by delegating work to a cloud-base cluster, Arvados supports federated workflows. In a federated workflow, different steps of a workflow may execute on different clusters. Arvados manages data transfer and delegation of credentials, so this as easy as simply adding cluster target hints to your existing workflow. +To support running analysis on geographically dispersed data (avoiding expensive data transfers by sending the computation to the data) and "hybrid cloud" configurations where an on-premise cluster can expand its capabilities by delegating work to a cloud-base cluster, Arvados supports federated workflows. In a federated workflow, different steps of a workflow may execute on different clusters. Arvados manages data transfer and delegation of credentials, all that is required is adding "arv:ClusterTarget":cwl-extensions.html#clustertarget hints to your existing workflow. -h2. Federated scatter/gather example +!(full-width)federated-workflow.svg! + +h2. Get the example files + +The tutorial files are located in the "documentation section of the Arvados source repository:":https://github.com/curoverse/arvados/tree/master/doc/user/cwl/federated + +
~$ git clone https://github.com/curoverse/arvados
+~$ cd arvados/doc/user/cwl/federated
+
+
+h2. Federated scatter/gather example + +In this following example, an analysis task is executed on three different clusters with different data, then the results are combined to produce the final output. {% codeblock as yaml %} {% include 'federated_cwl' %} {% endcodeblock %} +Example input document: + {% codeblock as yaml %} {% include 'shards_yml' %} {% endcodeblock %} diff --git a/doc/user/cwl/federated/federated.cwl b/doc/user/cwl/federated/federated.cwl index 250f6d1c51..5314a7675b 100644 --- a/doc/user/cwl/federated/federated.cwl +++ b/doc/user/cwl/federated/federated.cwl @@ -1,13 +1,27 @@ +# +# Demonstrate Arvados federation features. This performs a parallel +# scatter over some arbitrary number of files and federated clusters, +# then joins the results. +# cwlVersion: v1.0 class: Workflow $namespaces: + # When using Arvados extensions to CWL, must declare the 'arv' namespace arv: "http://arvados.org/cwl#" + requirements: InlineJavascriptRequirement: {} - DockerRequirement: - dockerPull: arvados/fed-test:scatter-gather ScatterFeatureRequirement: {} StepInputExpressionRequirement: {} + + DockerRequirement: + # Replace this with your own Docker container + dockerPull: arvados/jobs + + # Define a record type so we can conveniently associate the input + # file, the cluster on which the file lives, and the project on that + # cluster that will own the container requests and intermediate + # outputs. SchemaDefRequirement: types: - name: FileOnCluster @@ -16,27 +30,56 @@ requirements: file: File cluster: string project: string + inputs: + # Expect an array of FileOnCluster records (defined above) + # as our input. shards: type: type: array items: FileOnCluster + outputs: + # Will produce an output file with the results of the distributed + # analysis jobs joined together. joined: type: File outputSource: gather-results/joined + steps: distributed-analysis: in: - shards: shards - inp: {valueFrom: $(inputs.shards.file)} - scatter: shards + # Take "shards" array as input, we scatter over it below. + shard: shards + + # Use an expression to extract the "file" field to assign to the + # "inp" parameter of the tool. + inp: {valueFrom: $(inputs.shard.file)} + + # Scatter over shards, this means creating a parallel job for each + # element in the "shards" array. Expressions are evaluated for + # each element. + scatter: shard + + # Specify the cluster target for this job. This means each + # separate scatter job will execute on the cluster that was + # specified in the "cluster" field. + # + # Arvados handles streaming data between clusters, for example, + # the Docker image containing the code for a particular tool will + # be fetched on demand, as long as it is available somewhere in + # the federation. hints: arv:ClusterTarget: - cluster_id: $(inputs.shards.cluster) - project_uuid: $(inputs.shards.project) + cluster_id: $(inputs.shard.cluster) + project_uuid: $(inputs.shard.project) + out: [out] run: md5sum.cwl + + # Collect the results of the distributed step and join them into a + # single output file. Arvados handles streaming inputs, + # intermediate results, and outputs between clusters on demand. gather-results: in: inp: distributed-analysis/out