doc/user/tutorial-new-pipeline.textile

   1 ---
   2 layout: default
   3 navsection: userguide
   4 title: "Tutorial: Construct a new pipeline"
   5 navorder: 24
   6 ---
   7
   8 h1. Tutorial: Construct a new pipeline
   9
  10 Here you will write two new crunch scripts, incorporate them into a new pipeline template, run the new pipeline a couple of times using different parameters, and compare the results. One of the new scripts will use the Arvados API to look up trait&rarr;human&rarr;data relations and use this information to compile a collection of data to analyze.
  11
  12 h3. Prerequisites
  13
  14 * Log in to a VM "using SSH":ssh-access.html
  15 * Put an "API token":api-tokens.html in your @ARVADOS_API_TOKEN@ environment variable
  16 * Put the API host name in your @ARVADOS_API_HOST@ environment variable
  17
  18 If everything is set up correctly, the command @arv -h user current@ will display your account information.
  19
  20 h3. Git repository access
  21
  22 Pushing code to your git repository involves using your private key. There are a few ways to arrange this:
  23
  24 *Option 1:* Use an SSH agent, and log in to your VM with agent forwarding enabled. With Linux, BSD, MacOS, etc., this looks something like:
  25
  26 <pre>
  27 ssh-add -l
  28 eval `ssh-agent` # (only if "ssh-add -l" said it could not open a connection)
  29 ssh-add          # (this adds your private key to the agent)
  30 ssh -A my_vm.arvados
  31 ssh-add -l       # (run this in your VM account to confirm forwarding works)
  32 </pre>
  33
  34 With PuTTY under Windows, run "pageant", add your key to the agent, and turn on agent forwarding in your PuTTY settings.
  35
  36 *Option 2:* Edit code on your workstation and push code to your Arvados repository from there instead of your VM account. Depending on your @.ssh/config@ file, you will use names like @my_vm_name.arvados@ instead of @my_vm_name.{{ site.arvados_api_host }}@ in git and ssh commands.
  37
  38 <pre>
  39 git clone git@git.arvados:my_repo_name.git
  40 cd my_repo_name
  41 [...]
  42 git push
  43 </pre>
  44
  45 *Option 3:*  Edit code in your VM, and use git on your workstation as an intermediary.
  46
  47 <pre>
  48 git clone git@my_vm_name.arvados:my_repo_name.git
  49 cd my_repo_name
  50 git remote add arvados git@git.arvados:my_repo_name.git
  51
  52 [...make edits and commits in your repository on the VM...]
  53
  54 git pull                        #(update local copy)
  55 git push arvados master:master  #(update arvados hosted copy)
  56 </pre>
  57
  58 Whichever setup you choose, if everything is working correctly, this command should give you a list of repositories you can access:
  59
  60 <pre>
  61 ssh git@git.{{ site.arvados_api_host }}
  62 </pre>
  63
  64 &darr;
  65
  66 <pre>
  67 hello your_user_name, the gitolite version here is v2.0.2-17-g66f2065
  68 the gitolite config gives you the following access:
  69      R   W      your_repo_name
  70 </pre>
  71
  72 h3. Set some variables
  73
  74 Adjust these to match your login account name and the URL of your Arvados repository. The Access&rarr;VMs and Access&rarr;Repositories pages on Workbench will show the specifics.
  75
  76 <pre>
  77 repo_url=git@git.{{ site.arvados_api_host }}:my_repo_name.git
  78 repo_name=my_repo_name
  79 </pre>
  80
  81 h3. Set up a new branch in your Arvados git repository
  82
  83 We will create a new empty branch called "pipeline-tutorial" and add our new crunch scripts there.
  84
  85 <pre>
  86 mkdir pipeline-tutorial
  87 cd pipeline-tutorial
  88 git init
  89 git checkout -b pipeline-tutorial
  90 git remote add origin $repo_url
  91 </pre>
  92
  93 h3. Write the create-collection-by-trait script
  94
  95 <pre>
  96 mkdir -p crunch_scripts
  97 touch crunch_scripts/create-collection-by-trait
  98 chmod +x crunch_scripts/create-collection-by-trait
  99 nano crunch_scripts/create-collection-by-trait
 100 </pre>
 101
 102 Here is the script:
 103
 104 <pre>
 105 #!/usr/bin/env python
 106
 107 import arvados
 108 import re
 109 import json
 110
 111 trait_name = arvados.current_job()['script_parameters']['trait_name']
 112
 113 # get UUIDs of all matching traits
 114 trait_uuids = map(lambda t: t['uuid'],
 115                   filter(lambda t: re.search(trait_name, t['name'], re.IGNORECASE),
 116                          arvados.service.traits().list(limit=1000).execute()['items']))
 117
 118 # list humans linked to these conditions
 119 trait_links = arvados.service.links().list(limit=10000, where=json.dumps({
 120             'link_class': 'human_trait',
 121             'tail_kind': 'arvados#human',
 122             'head_uuid': trait_uuids
 123             })).execute()['items']
 124 human_uuids = map(lambda l: l['tail_uuid'], trait_links)
 125
 126 # find collections linked to these humans
 127 provenance_links = arvados.service.links().list(where=json.dumps({
 128     "link_class": "provenance",
 129     "name": "provided",
 130     "tail_uuid": human_uuids
 131   })).execute()['items']
 132 collection_uuids = map(lambda l: l['head_uuid'], provenance_links)
 133
 134 # pick out all of the "var" files, and build a new collection
 135 out_manifest = ''
 136 for locator in collection_uuids:
 137     for f in arvados.CollectionReader(locator).all_files():
 138         if re.search('var-.*\.tsv.bz2', f.name()):
 139             out_manifest += f.as_manifest()
 140
 141 # output the new collection
 142 arvados.current_task().set_output(arvados.Keep.put(out_manifest))
 143 </pre>
 144
 145 h3. Write the find-dbsnp-id script
 146
 147 <pre>
 148 touch crunch_scripts/find-dbsnp-id
 149 chmod +x crunch_scripts/find-dbsnp-id
 150 nano crunch_scripts/find-dbsnp-id
 151 </pre>
 152
 153 Here is the script:
 154
 155 <pre>
 156 #!/usr/bin/env python
 157
 158 import arvados
 159 import re
 160
 161 arvados.job_setup.one_task_per_input_file(if_sequence=0, and_end_task=True)
 162
 163 this_job = arvados.current_job()
 164 this_task = arvados.current_task()
 165 this_task_input = this_task['parameters']['input']
 166 dbsnp_search_pattern = re.compile("\\bdbsnp\\.\\d+:" +
 167                                   this_job['script_parameters']['dbsnp_id'] +
 168                                   "\\b")
 169
 170 input_file = list(arvados.CollectionReader(this_task_input).all_files())[0]
 171 out = arvados.CollectionWriter()
 172 out.set_current_file_name(input_file.decompressed_name())
 173 out.set_current_stream_name(input_file.stream_name())
 174 for line in input_file.readlines():
 175     if dbsnp_search_pattern.search(line):
 176         out.write(line)
 177
 178 this_task.set_output(out.finish())
 179 </pre>
 180
 181 h3. Commit your new code
 182
 183 <pre>
 184 git add crunch_scripts/create-collection-by-trait
 185 git add crunch_scripts/find-dbsnp-id
 186 git commit -m 'add scripts from tutorial'
 187 </pre>
 188
 189 h3. Push your new code to your Arvados git repository
 190
 191 Push the new "pipeline-tutorial" branch to your Arvados hosted repository.
 192
 193 <pre>
 194 git push origin pipeline-tutorial
 195 </pre>
 196
 197 h3. Note the commit ID of your latest code
 198
 199 Show the latest commit. The first line includes a 40-digit hexadecimal number that uniquely identifies the content of your git tree. You will specify this in your pipeline template in the next step to ensure that Arvados uses the correct version of your git tree when running jobs.
 200
 201 <pre>
 202 git show | head
 203 </pre>
 204
 205 &darr;
 206
 207 <pre>
 208 commit 37c7faef1b066a2dcdb0667fbe82b7cdd7d0be93
 209 [...]
 210 </pre>
 211
 212 h3. Write the pipeline template
 213
 214 Make a directory called @pipeline_templates@ and create a file called @find-dbsnp-by-trait.json@.
 215
 216 <pre>
 217 mkdir pipeline_templates
 218 nano pipeline_templates/find-dbsnp-by-trait.json
 219 </pre>
 220
 221 Copy the following pipeline template.
 222
 223 <pre>
 224 {
 225   "name":"find_dbsnp_by_trait",
 226   "components":{
 227     "create_collection":{
 228       "script":"create-collection-by-trait",
 229       "script_parameters":{
 230         "trait_name":"Non-melanoma skin cancer"
 231       },
 232       "script_version":"YOUR_GIT_COMMIT_SHA1_HERE"
 233     },
 234     "find_variant":{
 235       "script":"find-dbsnp-id",
 236       "script_parameters":{
 237         "input":{
 238           "output_of":"create_collection"
 239         },
 240         "dbsnp_id":"rs1126809"
 241       },
 242       "script_version":"YOUR_GIT_COMMIT_SHA1_HERE"
 243     }
 244   }
 245 }
 246 </pre>
 247
 248 h3. Store the pipeline template in Arvados
 249
 250 <pre>
 251 read -rd $'\000' the_pipeline < pipeline_templates/find-dbsnp-by-trait.json
 252
 253 arv pipeline_template create --pipeline-template "$the_pipeline"
 254 </pre>
 255
 256 @arv@ will output the UUID of the new pipeline template.
 257
 258 <pre>
 259 qr1hi-p5p6p-uf9gi9nolgakm85
 260 </pre>
 261
 262 The new pipeline template will also appear on the Workbench&rarr;Compute&rarr;Pipeline&nbsp;templates page.
 263
 264 h3. Invoke the pipeline using "arv pipeline run"
 265
 266 Replace the UUID here with the UUID of your own new pipeline template:
 267
 268 <pre>
 269 arv pipeline run --template qr1hi-p5p6p-uf9gi9nolgakm85
 270 </pre>
 271
 272 This instantiates your pipeline template: it submits the first job, waits for it to finish, submits the next job, etc.
 273
 274 h3. Monitor pipeline progress
 275
 276 The "arv pipeline run" command displays progress in your terminal until the pipeline instance is finished.
 277
 278 <pre>
 279 2013-07-17 05:06:15 +0000 -- pipeline_instance qr1hi-d1hrv-8i4tz440whvwf2o
 280 create_collection qr1hi-8i9sb-haibhu51olihlwp 9e2e489a73e1a918de8ecfc6f59ae5a1+1803+K@qr1hi
 281 find_variant      qr1hi-8i9sb-sqduc932xb1tpff cad082ba4cb174ffbebf751bbe3ed77c+506+K@qr1hi
 282 </pre>
 283
 284 The new pipeline instance will also show up on your Workbench&rarr;Compute&rarr;Pipeline&nbsp;instances page.
 285
 286 h3. Find the output collection UUID
 287
 288 The output of the "find_variant" component is shown in your terminal with the last status update from the "arv pipeline run" command.
 289
 290 It is also displayed on the pipeline instance detail page: go to Workbench&rarr;Compute&rarr;Pipeline&nbsp;instances and click the UUID of your pipeline instance.
 291
 292 h3. Compute a summary statistic from the output collection
 293
 294 For this step we will use python to read the output manifest and count how many of the inputs produced hits.
 295
 296 Type @python@ at a command prompt and paste this script at the prompt:
 297
 298 <pre>
 299 import arvados
 300
 301 hash = 'cad082ba4cb174ffbebf751bbe3ed77c+506+K@qr1hi' # Use your result here!
 302
 303 collection = arvados.CollectionReader(hash)
 304 hits = len(filter(lambda f: f.size() > 0, collection.all_files()))
 305 misses = len(filter(lambda f: f.size() == 0, collection.all_files()))
 306 print "%d had the variant, %d did not." % (hits, misses)
 307 </pre>
 308
 309 &darr;
 310
 311 <pre>
 312 4 had the variant, 3 did not.
 313 </pre>
 314
 315 h3. Run the pipeline again using different parameters
 316
 317 We can use the same pipeline template to run the jobs again, this time overriding the "trait_name" parameter with a different value:
 318
 319 <pre>
 320 arv pipeline run --template qr1hi-p5p6p-uf9gi9nolgakm85 create_collection::trait_name=cancer
 321 </pre>
 322
 323 When this template instance finishes, run the same Python script on the new output collection to summarize the results.