In the previous tutorials, we used @arvados.job_setup.one_task_per_input_file()@ to automatically parallelize our jobs by creating a separate task per file. For some types of jobs, you may need to split the work up differently, for example creating tasks to process different segments of a single large file. In this this tutorial will demonstrate how to create Crunch tasks directly.
-Start by entering the @crunch_scripts@ directory of your git repository:
+Start by entering the @crunch_scripts@ directory of your Git repository:
<notextile>
<pre><code>~$ <span class="userinput">cd <b>you</b>/crunch_scripts</span>
notextile. <pre>~/<b>you</b>/crunch_scripts$ <code class="userinput">nano parallel-hash.py</code></pre>
-Add the following code to compute the md5 hash of each file in a collection:
+Add the following code to compute the MD5 hash of each file in a collection:
<notextile> {% code 'parallel_hash_script_py' as python %} </notextile>
notextile. <pre><code>~/<b>you</b>/crunch_scripts$ <span class="userinput">chmod +x parallel-hash.py</span></code></pre>
-Next, add the file to @git@ staging, commit and push:
+Add the file to the Git staging area, commit, and push:
<notextile>
<pre><code>~/<b>you</b>/crunch_scripts$ <span class="userinput">git add parallel-hash.py</span>
</code></pre>
</notextile>
-(Your shell should automatically fill in @$USER@ with your login name. The job JSON that gets saved should have @"repository"@ pointed at your personal git repository.)
+(Your shell should automatically fill in @$USER@ with your login name. The job JSON that gets saved should have @"repository"@ pointed at your personal Git repository.)
Because the job ran in parallel, each instance of parallel-hash creates a separate @md5sum.txt@ as output. Arvados automatically collates theses files into a single collection, which is the output of the job: