--- layout: default navsection: userguide navmenu: Tutorials title: "Constructing a Crunch pipeline" ... h1. Constructing a Crunch pipeline A pipeline in Arvados is a collection of crunch scripts, in which the output from one script may be used as the input to another script. *This tutorial assumes that you are "logged into an Arvados VM instance":{{site.baseurl}}/user/getting_started/ssh-access.html#login, and have a "working environment.":{{site.baseurl}}/user/getting_started/check-environment.html* h2. Create a new script Our second script will filter the output of @parallel_hash.py@ and only include hashes that start with 0. Create a new script in @crunch_scripts/@ called @0-filter.py@:

{% include '0_filter_py' %}

Now add it to git:

$ git add 0-filter.py
$ git commit -m"zero filter"
$ git push origin master

h2. Create a pipeline template Next, create a file that contains the pipeline definition:

$ cat >the_pipeline <<EOF
{
  "name":"my_first_pipeline",
  "components":{
    "do_hash":{
      "script":"parallel-hash.py",
      "script_parameters":{
        "input": "887cd41e9c613463eab2f0d885c6dd96+83"
      },
      "script_version":"you:master"
    },
    "filter":{
      "script":"0-filter.py",
      "script_parameters":{
        "input":{
          "output_of":"do_hash"
        }
      },
      "script_version":"you:master"
    }
  }
}
EOF

* @"name"@ is a human-readable name for the pipeline * @"components"@ is a set of scripts that make up the pipeline * Each component is listed with a human-readable name (@"do_hash"@ and @"filter"@ in this example) * Each item in @"components"@ is a single Arvados job, and uses the same format that we saw previously with @arv job create@ * @"output_of"@ indicates that the @"input"@ of @"filter"@ is the @"output"@ of the @"do_hash"@ component. This is a _dependency_. Arvados uses the dependencies between jobs to automatically determine the correct order to run the jobs. Now, use @arv pipeline_template create@ tell Arvados about your pipeline template:

$ arv pipeline_template create --pipeline-template "$(cat the_pipeline)"
qr1hi-p5p6p-xxxxxxxxxxxxxxx

Your new pipeline template will appear on the Workbench %(rarr)→% Compute %(rarr)→% Pipeline templates page. h3. Running a pipeline Run the pipeline using @arv pipeline run@, using the UUID that you received from @arv pipeline create@:

$ arv pipeline run --template qr1hi-p5p6p-xxxxxxxxxxxxxxx
2013-12-16 14:08:40 +0000 -- pipeline_instance qr1hi-d1hrv-vxzkp38nlde9yyr
do_hash qr1hi-8i9sb-hoyc2u964ecv1s6 queued 2013-12-16T14:08:40Z
filter  -                           -
2013-12-16 14:08:51 +0000 -- pipeline_instance qr1hi-d1hrv-vxzkp38nlde9yyr
do_hash qr1hi-8i9sb-hoyc2u964ecv1s6 e2ccd204bca37c77c0ba59fc470cd0f7+162
filter  qr1hi-8i9sb-w5k40fztqgg9i2x queued 2013-12-16T14:08:50Z
2013-12-16 14:09:01 +0000 -- pipeline_instance qr1hi-d1hrv-vxzkp38nlde9yyr
do_hash qr1hi-8i9sb-hoyc2u964ecv1s6 e2ccd204bca37c77c0ba59fc470cd0f7+162
filter  qr1hi-8i9sb-w5k40fztqgg9i2x 735ac35adf430126cf836547731f3af6+56

This instantiates your pipeline and displays a live feed of its status. The new pipeline instance will also show up on the Workbench %(rarr)→% Compute %(rarr)→% Pipeline instances page. Arvados adds each pipeline component to the job queue as its dependencies are satisfied (or immediately if it has no dependencies) and finishes when all components are completed or failed and there is no more work left to do. The Keep locators of the output of each of @"do_hash"@ and @"filter"@ component are available from the output log shown above. The output is also available on the Workbench by navigating to %(rarr)→% Compute %(rarr)→% Pipeline instances %(rarr)→% pipeline uuid under the *id* column %(rarr)→% components.

$ arv keep get e2ccd204bca37c77c0ba59fc470cd0f7+162/md5sum.txt
0f1d6bcf55c34bed7f92a805d2d89bbf alice.txt
504938460ef369cd275e4ef58994cffe bob.txt
8f3b36aff310e06f3c5b9e95678ff77a carol.txt
$ arv keep get 735ac35adf430126cf836547731f3af6+56
0f1d6bcf55c34bed7f92a805d2d89bbf alice.txt

Indeed, the filter has picked out just the "alice" file as having a hash that starts with 0. h3. Running a pipeline with different parameters Notice that the pipeline definition explicitly specifies the Keep locator for the input:

...
    "do_hash":{
      "script_parameters":{
        "input": "887cd41e9c613463eab2f0d885c6dd96+83"
      },
    }
...

What if we want to run the pipeline on a different input block? One option is to define a new pipeline template, but would potentially result in clutter with many pipeline templates defined for one-off jobs. Instead, you can override values in the input of the component like this:

$ arv pipeline run --template qr1hi-d1hrv-vxzkp38nlde9yyr do_hash::input=33a9f3842b01ea3fdf27cc582f5ea2af+242
2013-12-17 20:31:24 +0000 -- pipeline_instance qr1hi-d1hrv-tlkq20687akys8e
do_hash qr1hi-8i9sb-rffhuay4jryl2n2 queued 2013-12-17T20:31:24Z
filter  -                           -
2013-12-17 20:31:34 +0000 -- pipeline_instance qr1hi-d1hrv-tlkq20687akys8e
do_hash qr1hi-8i9sb-rffhuay4jryl2n2 {:done=>1, :running=>1, :failed=>0, :todo=>0}
filter  -                           -
2013-12-17 20:31:44 +0000 -- pipeline_instance qr1hi-d1hrv-tlkq20687akys8e
do_hash qr1hi-8i9sb-rffhuay4jryl2n2 {:done=>1, :running=>1, :failed=>0, :todo=>0}
filter  -                           -
2013-12-17 20:31:55 +0000 -- pipeline_instance qr1hi-d1hrv-tlkq20687akys8e
do_hash qr1hi-8i9sb-rffhuay4jryl2n2 880b55fb4470b148a447ff38cacdd952+54
filter  qr1hi-8i9sb-j347g1sqovdh0op queued 2013-12-17T20:31:55Z
2013-12-17 20:32:05 +0000 -- pipeline_instance qr1hi-d1hrv-tlkq20687akys8e
do_hash qr1hi-8i9sb-rffhuay4jryl2n2 880b55fb4470b148a447ff38cacdd952+54
filter  qr1hi-8i9sb-j347g1sqovdh0op fb728f0ffe152058fa64b9aeed344cb5+54

Now check the output:

$ arv keep ls -s fb728f0ffe152058fa64b9aeed344cb5+54
0 0-filter.txt

Here the filter script output is empty, so none of the files in the collection have hash code that start with 0.