--- layout: default navsection: userguide navmenu: Tutorials title: "Parallel Crunch tasks" navorder: 15 --- h1. Tutorial: Parallel Crunch tasks In the tutorial "writing a crunch script,":tutorial-firstscript.html our script used a "for" loop to compute the md5 hashes for each file in sequence. This approach, while simple, is not able to take advantage of the compute cluster with multiple nodes and cores to speed up computation by running tasks in parallel. This tutorial will demonstrate how to create parallel Crunch tasks. Start by entering the @crunch_scripts@ directory of your git repository:
$ cd you/crunch_scripts
Next, using your favorite text editor, create a new file called @parallel-hash.py@ in the @crunch_scripts@ directory. Add the following code to compute the md5 hash of each file in a collection:
{% include parallel_hash_script.py %}
Make the file executable: notextile.
$ chmod +x parallel-hash.py
Next, add the file to @git@ staging, commit and push:
$ git add parallel-hash.py
$ git commit -m"parallel hash"
$ git push origin master
You should now be able to run your new script using Crunch, with "script" referring to our new "parallel-hash.py" script. We will use a different input from our previous examples. We will use @887cd41e9c613463eab2f0d885c6dd96+83@ which consists of three files, "alice.txt", "bob.txt" and "carol.txt" (the example collection used previously in "fetching data from Arvados using Keep":tutorial-keep.html).
$ cat >the_job <<EOF
{
 "script": "parallel-hash.py",
 "script_version": "you:master",
 "script_parameters":
 {
  "input": "887cd41e9c613463eab2f0d885c6dd96+83"
 }
}
EOF
$ arv -h job create --job "$(cat the_job)"
{
 ...
 "uuid":"qr1hi-xxxxx-xxxxxxxxxxxxxxx"
 ...
}
$ arv -h job get --uuid qr1hi-xxxxx-xxxxxxxxxxxxxxx
{
 ...
 "output":"e2ccd204bca37c77c0ba59fc470cd0f7+162+K@qr1hi",
 ...
}
Because the job ran in parallel, each instance of parallel-hash creates a separate @md5sum.txt@ as output. Arvados automatically collates theses files into a single collection, which is the output of the job:
$ arv keep get e2ccd204bca37c77c0ba59fc470cd0f7+162+K@qr1hi
md5sum.txt
md5sum.txt
md5sum.txt
$ arv keep get e2ccd204bca37c77c0ba59fc470cd0f7+162+K@qr1hi/md5sum.txt
0f1d6bcf55c34bed7f92a805d2d89bbf alice.txt
504938460ef369cd275e4ef58994cffe bob.txt
8f3b36aff310e06f3c5b9e95678ff77a carol.txt
h2. The one job per file pattern This example demonstrates how to schedule a new task per file. Because this is a common pattern, the Crunch Python API contains a convenience function to "queue a task for each input file":{{site.basedoc}}/user/reference/crunch-utility-libraries.html#one_task_per_input which reduces the amount of boilerplate code required to handle parallel jobs.