doc/user/tutorial-new-pipeline.textile

   1 ---
   2 layout: default
   3 navsection: userguide
   4 title: "Tutorial 5: Construct a new pipeline"
   5 navorder: 15
   6 ---
   7
   8 h1. Tutorial 5: Construct a new pipeline
   9
  10 Here you will write two new crunch scripts, incorporate them into a new pipeline template, run the new pipeline a couple of times using different parameters, and compare the results. One of the new scripts will use the Arvados API to look up trait&rarr;human&rarr;data relations and use this information to compile a collection of data to analyze.
  11
  12 _Like the previous tutorial, this needs more of a basis in some actual
  13 clinical/research question to motivate it_
  14
  15 h3. Prerequisites
  16
  17 * Log in to a VM "using SSH":ssh-access.html
  18 * Put an "API token":api-tokens.html in your @ARVADOS_API_TOKEN@ environment variable
  19 * Put the API host name in your @ARVADOS_API_HOST@ environment variable
  20
  21 If everything is set up correctly, the command @arv -h user current@ will display your account information.
  22
  23 h3. Git repository access
  24
  25 Pushing code to your git repository involves using your private key. There are a few ways to arrange this:
  26
  27 *Option 1:* Use an SSH agent, and log in to your VM with agent forwarding enabled. With Linux, BSD, MacOS, etc., this looks something like:
  28
  29 <pre>
  30 ssh-add -l
  31 eval `ssh-agent` # (only if "ssh-add -l" said it could not open a connection)
  32 ssh-add          # (this adds your private key to the agent)
  33 ssh -A my_vm.arvados
  34 ssh-add -l       # (run this in your VM account to confirm forwarding works)
  35 </pre>
  36
  37 _This discussion about ssh should probably go under the "ssh" section_
  38
  39 With PuTTY under Windows, run "pageant", add your key to the agent, and turn on agent forwarding in your PuTTY settings.
  40
  41 *Option 2:* Edit code on your workstation and push code to your Arvados repository from there instead of your VM account. Depending on your @.ssh/config@ file, you will use names like @my_vm_name.arvados@ instead of @my_vm_name.{{ site.arvados_api_host }}@ in git and ssh commands.
  42
  43 <pre>
  44 git clone git@git.arvados:my_repo_name.git
  45 cd my_repo_name
  46 [...]
  47 git push
  48 </pre>
  49
  50 *Option 3:*  Edit code in your VM, and use git on your workstation as an intermediary.
  51
  52 <pre>
  53 git clone git@my_vm_name.arvados:my_repo_name.git
  54 cd my_repo_name
  55 git remote add arvados git@git.arvados:my_repo_name.git
  56
  57 [...make edits and commits in your repository on the VM...]
  58
  59 git pull                        #(update local copy)
  60 git push arvados master:master  #(update arvados hosted copy)
  61 </pre>
  62
  63 Whichever setup you choose, if everything is working correctly, this command should give you a list of repositories you can access:
  64
  65 <pre>
  66 ssh -T git@git.{{ site.arvados_api_host }}
  67 </pre>
  68
  69 &darr;
  70
  71 <pre>
  72 hello your_user_name, the gitolite version here is v2.0.2-17-g66f2065
  73 the gitolite config gives you the following access:
  74      R   W      your_repo_name
  75 </pre>
  76
  77 _You need to have a git repository set up already, which is not
  78 necessarily the case for new users, so this should link to the git
  79 section about setting up a new repo_
  80
  81 h3. Set some variables
  82
  83 Adjust these to match your login account name and the URL of your Arvados repository. The Access&rarr;VMs and Access&rarr;Repositories pages on Workbench will show the specifics.
  84
  85 <pre>
  86 repo_url=git@git.{{ site.arvados_api_host }}:my_repo_name.git
  87 repo_name=my_repo_name
  88 </pre>
  89
  90 h3. Set up a new branch in your Arvados git repository
  91
  92 We will create a new empty branch called "pipeline-tutorial" and add our new crunch scripts there.
  93
  94 <pre>
  95 mkdir pipeline-tutorial
  96 cd pipeline-tutorial
  97 git init
  98 git checkout -b pipeline-tutorial
  99 git remote add origin $repo_url
 100 </pre>
 101
 102 _Should explain each step_
 103 _Creating an empty branch in an empty repository makes git do weird
 104 things, need to fix using
 105 <pre>
 106 git branch --set-upstream pipeline_tutorial origin/pipeline_tutorial
 107 </pre>
 108 but I don't know what this means._
 109
 110 h3. Write the create-collection-by-trait script
 111
 112 <pre>
 113 mkdir -p crunch_scripts
 114 touch crunch_scripts/create-collection-by-trait
 115 chmod +x crunch_scripts/create-collection-by-trait
 116 nano crunch_scripts/create-collection-by-trait
 117 </pre>
 118
 119 _the -p to mkdir isn't necessary here_
 120
 121 Here is the script:
 122
 123 <pre>
 124 #!/usr/bin/env python
 125
 126 import arvados
 127 import re
 128 import json
 129
 130 trait_name = arvados.current_job()['script_parameters']['trait_name']
 131
 132 # get UUIDs of all matching traits
 133 trait_uuids = map(lambda t: t['uuid'],
 134                   filter(lambda t: re.search(trait_name, t['name'], re.IGNORECASE),
 135                          arvados.service.traits().list(limit=1000).execute()['items']))
 136
 137 # list humans linked to these conditions
 138 trait_links = arvados.service.links().list(limit=10000, where=json.dumps({
 139             'link_class': 'human_trait',
 140             'tail_kind': 'arvados#human',
 141             'head_uuid': trait_uuids
 142             })).execute()['items']
 143 human_uuids = map(lambda l: l['tail_uuid'], trait_links)
 144
 145 # find collections linked to these humans
 146 provenance_links = arvados.service.links().list(where=json.dumps({
 147     "link_class": "provenance",
 148     "name": "provided",
 149     "tail_uuid": human_uuids
 150   })).execute()['items']
 151 collection_uuids = map(lambda l: l['head_uuid'], provenance_links)
 152
 153 # pick out all of the "var" files, and build a new collection
 154 out_manifest = ''
 155 for locator in collection_uuids:
 156     for f in arvados.CollectionReader(locator).all_files():
 157         if re.search('var-.*\.tsv.bz2', f.name()):
 158             out_manifest += f.as_manifest()
 159
 160 # output the new collection
 161 arvados.current_task().set_output(arvados.Keep.put(out_manifest))
 162 </pre>
 163
 164 h3. Write the find-dbsnp-id script
 165
 166 <pre>
 167 touch crunch_scripts/find-dbsnp-id
 168 chmod +x crunch_scripts/find-dbsnp-id
 169 nano crunch_scripts/find-dbsnp-id
 170 </pre>
 171
 172 Here is the script:
 173
 174 <pre>
 175 #!/usr/bin/env python
 176
 177 import arvados
 178 import re
 179
 180 arvados.job_setup.one_task_per_input_file(if_sequence=0, and_end_task=True)
 181
 182 this_job = arvados.current_job()
 183 this_task = arvados.current_task()
 184 this_task_input = this_task['parameters']['input']
 185 dbsnp_search_pattern = re.compile("\\bdbsnp\\.\\d+:" +
 186                                   this_job['script_parameters']['dbsnp_id'] +
 187                                   "\\b")
 188
 189 input_file = list(arvados.CollectionReader(this_task_input).all_files())[0]
 190 out = arvados.CollectionWriter()
 191 out.set_current_file_name(input_file.decompressed_name())
 192 out.set_current_stream_name(input_file.stream_name())
 193 for line in input_file.readlines():
 194     if dbsnp_search_pattern.search(line):
 195         out.write(line)
 196
 197 this_task.set_output(out.finish())
 198 </pre>
 199
 200 _This should probably match the code we ran the user through in the
 201 previous tutorial, with the only difference being that the prior
 202 tutorial is interactive, and this tutorial is demonstrating how to
 203 create a job.
 204
 205 h3. Commit your new code
 206
 207 <pre>
 208 git add crunch_scripts/create-collection-by-trait
 209 git add crunch_scripts/find-dbsnp-id
 210 git commit -m 'add scripts from tutorial'
 211 </pre>
 212
 213 h3. Push your new code to your Arvados git repository
 214
 215 Push the new "pipeline-tutorial" branch to your Arvados hosted repository.
 216
 217 <pre>
 218 git push origin pipeline-tutorial
 219 </pre>
 220
 221 h3. Note the commit ID of your latest code
 222
 223 Show the latest commit. The first line includes a 40-digit hexadecimal number that uniquely identifies the content of your git tree. You will specify this in your pipeline template in the next step to ensure that Arvados uses the correct version of your git tree when running jobs.
 224
 225 <pre>
 226 git show | head
 227 </pre>
 228
 229 &darr;
 230
 231 <pre>
 232 commit 37c7faef1b066a2dcdb0667fbe82b7cdd7d0be93
 233 [...]
 234 </pre>
 235
 236 h3. Write the pipeline template
 237
 238 Make a directory called @pipeline_templates@ and create a file called @find-dbsnp-by-trait.json@.
 239
 240 <pre>
 241 mkdir pipeline_templates
 242 nano pipeline_templates/find-dbsnp-by-trait.json
 243 </pre>
 244
 245 Copy the following pipeline template.
 246
 247 <pre>
 248 {
 249   "name":"find_dbsnp_by_trait",
 250   "components":{
 251     "create_collection":{
 252       "script":"create-collection-by-trait",
 253       "script_parameters":{
 254         "trait_name":"Non-melanoma skin cancer"
 255       },
 256       "script_version":"YOUR_GIT_COMMIT_SHA1_HERE"
 257     },
 258     "find_variant":{
 259       "script":"find-dbsnp-id",
 260       "script_parameters":{
 261         "input":{
 262           "output_of":"create_collection"
 263         },
 264         "dbsnp_id":"rs1126809"
 265       },
 266       "script_version":"YOUR_GIT_COMMIT_SHA1_HERE"
 267     }
 268   }
 269 }
 270 </pre>
 271
 272 _This desparately needs to be explained, since this is the actual
 273 pipeline definition_
 274
 275 h3. Store the pipeline template in Arvados
 276
 277 <pre>
 278 read -rd $'\000' the_pipeline < pipeline_templates/find-dbsnp-by-trait.json
 279
 280 arv pipeline_template create --pipeline-template "$the_pipeline"
 281 </pre>
 282
 283 @arv@ will output the UUID of the new pipeline template.
 284
 285 <pre>
 286 qr1hi-p5p6p-uf9gi9nolgakm85
 287 </pre>
 288
 289 The new pipeline template will also appear on the Workbench&rarr;Compute&rarr;Pipeline&nbsp;templates page.
 290
 291 _Storing the pipeline in arvados as well as in git seems redundant_
 292
 293 h3. Invoke the pipeline using "arv pipeline run"
 294
 295 Replace the UUID here with the UUID of your own new pipeline template:
 296
 297 <pre>
 298 arv pipeline run --template qr1hi-p5p6p-uf9gi9nolgakm85
 299 </pre>
 300
 301 This instantiates your pipeline template: it submits the first job, waits for it to finish, submits the next job, etc.
 302
 303 h3. Monitor pipeline progress
 304
 305 The "arv pipeline run" command displays progress in your terminal until the pipeline instance is finished.
 306
 307 <pre>
 308 2013-07-17 05:06:15 +0000 -- pipeline_instance qr1hi-d1hrv-8i4tz440whvwf2o
 309 create_collection qr1hi-8i9sb-haibhu51olihlwp 9e2e489a73e1a918de8ecfc6f59ae5a1+1803+K@qr1hi
 310 find_variant      qr1hi-8i9sb-sqduc932xb1tpff cad082ba4cb174ffbebf751bbe3ed77c+506+K@qr1hi
 311 </pre>
 312
 313 The new pipeline instance will also show up on your Workbench&rarr;Compute&rarr;Pipeline&nbsp;instances page.
 314
 315 h3. Find the output collection UUID
 316
 317 The output of the "find_variant" component is shown in your terminal with the last status update from the "arv pipeline run" command.
 318
 319 It is also displayed on the pipeline instance detail page: go to Workbench&rarr;Compute&rarr;Pipeline&nbsp;instances and click the UUID of your pipeline instance.
 320
 321 _There needs to be an easier way to get the output from the workbench_
 322
 323 h3. Compute a summary statistic from the output collection
 324
 325 For this step we will use python to read the output manifest and count how many of the inputs produced hits.
 326
 327 Type @python@ at a command prompt and paste this script at the prompt:
 328
 329 <pre>
 330 import arvados
 331
 332 hash = 'cad082ba4cb174ffbebf751bbe3ed77c+506+K@qr1hi' # Use your result here!
 333
 334 collection = arvados.CollectionReader(hash)
 335 hits = len(filter(lambda f: f.size() > 0, collection.all_files()))
 336 misses = len(filter(lambda f: f.size() == 0, collection.all_files()))
 337 print "%d had the variant, %d did not." % (hits, misses)
 338 </pre>
 339
 340 &darr;
 341
 342 <pre>
 343 4 had the variant, 3 did not.
 344 </pre>
 345
 346 _Explain each step_
 347
 348 _h3. Run the pipeline again using different parameters
 349
 350 We can use the same pipeline template to run the jobs again, this time overriding the "trait_name" parameter with a different value:
 351
 352 <pre>
 353 arv pipeline run --template qr1hi-p5p6p-uf9gi9nolgakm85 create_collection::trait_name=cancer
 354 </pre>
 355
 356 When this template instance finishes, run the same Python script on the new output collection to summarize the results.