4 title: "Tutorial: Construct a new pipeline"
8 h1. Tutorial: Construct a new pipeline
10 Here you will write two new crunch scripts, incorporate them into a new pipeline template, run the new pipeline a couple of times using different parameters, and compare the results. One of the new scripts will use the Arvados API to look up trait→human→data relations and use this information to compile a collection of data to analyze.
12 <!-- _Like the previous tutorial, this needs more of a basis in some actual
13 clinical/research question to motivate it_ -->
17 * Log in to a VM "using SSH":ssh-access.html
18 * Put an "API token":api-tokens.html in your @ARVADOS_API_TOKEN@ environment variable
19 * Put the API host name in your @ARVADOS_API_HOST@ environment variable
21 If everything is set up correctly, the command @arv -h user current@ will display your account information.
23 h3. Git repository access
25 Pushing code to your git repository involves using your private key. There are a few ways to arrange this:
27 *Option 1:* Use an SSH agent, and log in to your VM with agent forwarding enabled. With Linux, BSD, MacOS, etc., this looks something like:
31 eval `ssh-agent` # (only if "ssh-add -l" said it could not open a connection)
32 ssh-add # (this adds your private key to the agent)
34 ssh-add -l # (run this in your VM account to confirm forwarding works)
37 <!-- _This discussion about ssh should probably go under the "ssh"
40 With PuTTY under Windows, run "pageant", add your key to the agent, and turn on agent forwarding in your PuTTY settings.
42 *Option 2:* Edit code on your workstation and push code to your Arvados repository from there instead of your VM account. Depending on your @.ssh/config@ file, you will use names like @my_vm_name.arvados@ instead of @my_vm_name.{{ site.arvados_api_host }}@ in git and ssh commands.
45 git clone git@git.arvados:my_repo_name.git
51 *Option 3:* Edit code in your VM, and use git on your workstation as an intermediary.
54 git clone git@my_vm_name.arvados:my_repo_name.git
56 git remote add arvados git@git.arvados:my_repo_name.git
58 [...make edits and commits in your repository on the VM...]
60 git pull #(update local copy)
61 git push arvados master:master #(update arvados hosted copy)
64 Whichever setup you choose, if everything is working correctly, this command should give you a list of repositories you can access:
67 ssh -T git@git.{{ site.arvados_api_host }}
73 hello your_user_name, the gitolite version here is v2.0.2-17-g66f2065
74 the gitolite config gives you the following access:
78 <!-- _You need to have a git repository set up already, which is not
79 necessarily the case for new users, so this should link to the git
80 section about setting up a new repo_ -->
82 h3. Set some variables
84 Adjust these to match your login account name and the URL of your Arvados repository. The Access→VMs and Access→Repositories pages on Workbench will show the specifics.
87 repo_url=git@git.{{ site.arvados_api_host }}:my_repo_name.git
88 repo_name=my_repo_name
91 h3. Set up a new branch in your Arvados git repository
93 We will create a new empty branch called "pipeline-tutorial" and add our new crunch scripts there.
96 mkdir pipeline-tutorial
99 git checkout -b pipeline-tutorial
100 git remote add origin $repo_url
103 <!-- _Should explain each step_ -->
105 <!-- _Creating an empty branch in an empty repository makes git do weird
106 things, need to fix using
107 git branch --set-upstream pipeline_tutorial origin/pipeline_tutorial
108 but I don't know what this means._ -->
110 h3. Write the create-collection-by-trait script
113 mkdir -p crunch_scripts
114 touch crunch_scripts/create-collection-by-trait
115 chmod +x crunch_scripts/create-collection-by-trait
116 nano crunch_scripts/create-collection-by-trait
119 <!-- _the -p to mkdir isn't necessary here_ -->
124 #!/usr/bin/env python
130 trait_name = arvados.current_job()['script_parameters']['trait_name']
132 # get UUIDs of all matching traits
133 trait_uuids = map(lambda t: t['uuid'],
134 filter(lambda t: re.search(trait_name, t['name'], re.IGNORECASE),
135 arvados.service.traits().list(limit=1000).execute()['items']))
137 # list humans linked to these conditions
138 trait_links = arvados.service.links().list(limit=10000, where=json.dumps({
139 'link_class': 'human_trait',
140 'tail_kind': 'arvados#human',
141 'head_uuid': trait_uuids
142 })).execute()['items']
143 human_uuids = map(lambda l: l['tail_uuid'], trait_links)
145 # find collections linked to these humans
146 provenance_links = arvados.service.links().list(where=json.dumps({
147 "link_class": "provenance",
149 "tail_uuid": human_uuids
150 })).execute()['items']
151 collection_uuids = map(lambda l: l['head_uuid'], provenance_links)
153 # pick out all of the "var" files, and build a new collection
155 for locator in collection_uuids:
156 for f in arvados.CollectionReader(locator).all_files():
157 if re.search('var-.*\.tsv.bz2', f.name()):
158 out_manifest += f.as_manifest()
160 # output the new collection
161 arvados.current_task().set_output(arvados.Keep.put(out_manifest))
164 h3. Write the find-dbsnp-id script
167 touch crunch_scripts/find-dbsnp-id
168 chmod +x crunch_scripts/find-dbsnp-id
169 nano crunch_scripts/find-dbsnp-id
175 #!/usr/bin/env python
180 arvados.job_setup.one_task_per_input_file(if_sequence=0, and_end_task=True)
182 this_job = arvados.current_job()
183 this_task = arvados.current_task()
184 this_task_input = this_task['parameters']['input']
185 dbsnp_search_pattern = re.compile("\\bdbsnp\\.\\d+:" +
186 this_job['script_parameters']['dbsnp_id'] +
189 input_file = list(arvados.CollectionReader(this_task_input).all_files())[0]
190 out = arvados.CollectionWriter()
191 out.set_current_file_name(input_file.decompressed_name())
192 out.set_current_stream_name(input_file.stream_name())
193 for line in input_file.readlines():
194 if dbsnp_search_pattern.search(line):
197 this_task.set_output(out.finish())
200 <!-- _This should probably match the code we ran the user through in the
201 previous tutorial, with the only difference being that the prior
202 tutorial is interactive, and this tutorial is demonstrating how to
205 h3. Commit your new code
208 git add crunch_scripts/create-collection-by-trait
209 git add crunch_scripts/find-dbsnp-id
210 git commit -m 'add scripts from tutorial'
213 h3. Push your new code to your Arvados git repository
215 Push the new "pipeline-tutorial" branch to your Arvados hosted repository.
218 git push origin pipeline-tutorial
221 h3. Note the commit ID of your latest code
223 Show the latest commit. The first line includes a 40-digit hexadecimal number that uniquely identifies the content of your git tree. You will specify this in your pipeline template in the next step to ensure that Arvados uses the correct version of your git tree when running jobs.
232 commit 37c7faef1b066a2dcdb0667fbe82b7cdd7d0be93
236 h3. Write the pipeline template
238 Make a directory called @pipeline_templates@ and create a file called @find-dbsnp-by-trait.json@.
241 mkdir pipeline_templates
242 nano pipeline_templates/find-dbsnp-by-trait.json
245 Copy the following pipeline template.
249 "name":"find_dbsnp_by_trait",
251 "create_collection":{
252 "script":"create-collection-by-trait",
253 "script_parameters":{
254 "trait_name":"Non-melanoma skin cancer"
256 "script_version":"YOUR_GIT_COMMIT_SHA1_HERE"
259 "script":"find-dbsnp-id",
260 "script_parameters":{
262 "output_of":"create_collection"
264 "dbsnp_id":"rs1126809"
266 "script_version":"YOUR_GIT_COMMIT_SHA1_HERE"
272 <!-- _This desperately needs to be explained, since this is the actual
273 pipeline definition_ -->
275 h3. Store the pipeline template in Arvados
278 read -rd $'\000' the_pipeline < pipeline_templates/find-dbsnp-by-trait.json
280 arv pipeline_template create --pipeline-template "$the_pipeline"
283 @arv@ will output the UUID of the new pipeline template.
286 qr1hi-p5p6p-uf9gi9nolgakm85
289 The new pipeline template will also appear on the Workbench→Compute→Pipeline templates page.
291 <!-- _Storing the pipeline in arvados as well as in git seems
294 h3. Invoke the pipeline using "arv pipeline run"
296 Replace the UUID here with the UUID of your own new pipeline template:
299 arv pipeline run --template qr1hi-p5p6p-uf9gi9nolgakm85
302 This instantiates your pipeline template: it submits the first job, waits for it to finish, submits the next job, etc.
304 h3. Monitor pipeline progress
306 The "arv pipeline run" command displays progress in your terminal until the pipeline instance is finished.
309 2013-07-17 05:06:15 +0000 -- pipeline_instance qr1hi-d1hrv-8i4tz440whvwf2o
310 create_collection qr1hi-8i9sb-haibhu51olihlwp 9e2e489a73e1a918de8ecfc6f59ae5a1+1803+K@qr1hi
311 find_variant qr1hi-8i9sb-sqduc932xb1tpff cad082ba4cb174ffbebf751bbe3ed77c+506+K@qr1hi
314 The new pipeline instance will also show up on your Workbench→Compute→Pipeline instances page.
316 h3. Find the output collection UUID
318 The output of the "find_variant" component is shown in your terminal with the last status update from the "arv pipeline run" command.
320 It is also displayed on the pipeline instance detail page: go to Workbench→Compute→Pipeline instances and click the UUID of your pipeline instance.
322 <!-- _There needs to be an easier way to get the output from the
325 h3. Compute a summary statistic from the output collection
327 For this step we will use python to read the output manifest and count how many of the inputs produced hits.
329 Type @python@ at a command prompt and paste this script at the prompt:
334 hash = 'cad082ba4cb174ffbebf751bbe3ed77c+506+K@qr1hi' # Use your result here!
336 collection = arvados.CollectionReader(hash)
337 hits = len(filter(lambda f: f.size() > 0, collection.all_files()))
338 misses = len(filter(lambda f: f.size() == 0, collection.all_files()))
339 print "%d had the variant, %d did not." % (hits, misses)
345 4 had the variant, 3 did not.
348 <!-- _Explain each step_ -->
350 h3. Run the pipeline again using different parameters
352 We can use the same pipeline template to run the jobs again, this time overriding the "trait_name" parameter with a different value:
355 arv pipeline run --template qr1hi-p5p6p-uf9gi9nolgakm85 create_collection::trait_name=cancer
358 When this template instance finishes, run the same Python script on the new output collection to summarize the results.