X-Git-Url: https://git.arvados.org/arvados.git/blobdiff_plain/e2362276ad6fbfdda73a9c67c880c3b6da2f7eaa..e5ab13b7c5049571b450df5485a22e82504b97a9:/doc/user/tutorial-new-pipeline.textile diff --git a/doc/user/tutorial-new-pipeline.textile b/doc/user/tutorial-new-pipeline.textile index 3375db38cc..1dca21f78b 100644 --- a/doc/user/tutorial-new-pipeline.textile +++ b/doc/user/tutorial-new-pipeline.textile @@ -9,6 +9,9 @@ h1. Tutorial: Construct a new pipeline Here you will write two new crunch scripts, incorporate them into a new pipeline template, run the new pipeline a couple of times using different parameters, and compare the results. One of the new scripts will use the Arvados API to look up trait→human→data relations and use this information to compile a collection of data to analyze. +_Like the previous tutorial, this needs more of a basis in some actual +clinical/research question to motivate it_ + h3. Prerequisites * Log in to a VM "using SSH":ssh-access.html @@ -31,6 +34,8 @@ ssh -A my_vm.arvados ssh-add -l # (run this in your VM account to confirm forwarding works) +_This discussion about ssh should probably go under the "ssh" section_ + With PuTTY under Windows, run "pageant", add your key to the agent, and turn on agent forwarding in your PuTTY settings. *Option 2:* Edit code on your workstation and push code to your Arvados repository from there instead of your VM account. Depending on your @.ssh/config@ file, you will use names like @my_vm_name.arvados@ instead of @my_vm_name.{{ site.arvados_api_host }}@ in git and ssh commands. @@ -58,7 +63,7 @@ git push arvados master:master #(update arvados hosted copy) Whichever setup you choose, if everything is working correctly, this command should give you a list of repositories you can access:
-ssh git@git.{{ site.arvados_api_host }}
+ssh -T git@git.{{ site.arvados_api_host }}
 
↓ @@ -69,6 +74,10 @@ the gitolite config gives you the following access: R W your_repo_name +_You need to have a git repository set up already, which is not +necessarily the case for new users, so this should link to the git +section about setting up a new repo_ + h3. Set some variables Adjust these to match your login account name and the URL of your Arvados repository. The Access→VMs and Access→Repositories pages on Workbench will show the specifics. @@ -90,6 +99,14 @@ git checkout -b pipeline-tutorial git remote add origin $repo_url +_Should explain each step_ +_Creating an empty branch in an empty repository makes git do weird +things, need to fix using +
+git branch --set-upstream pipeline_tutorial origin/pipeline_tutorial
+
+but I don't know what this means._ + h3. Write the create-collection-by-trait script
@@ -99,6 +116,8 @@ chmod +x crunch_scripts/create-collection-by-trait
 nano crunch_scripts/create-collection-by-trait
 
+_the -p to mkdir isn't necessary here_ + Here is the script:
@@ -178,6 +197,11 @@ for line in input_file.readlines():
 this_task.set_output(out.finish())
 
+_This should probably match the code we ran the user through in the +previous tutorial, with the only difference being that the prior +tutorial is interactive, and this tutorial is demonstrating how to +create a job. + h3. Commit your new code
@@ -245,10 +269,13 @@ Copy the following pipeline template.
 }
 
+_This desparately needs to be explained, since this is the actual +pipeline definition_ + h3. Store the pipeline template in Arvados
-read -rd "\000" the_pipeline < pipeline_templates/find-dbsnp-by-trait.json
+read -rd $'\000' the_pipeline < pipeline_templates/find-dbsnp-by-trait.json
 
 arv pipeline_template create --pipeline-template "$the_pipeline"
 
@@ -261,6 +288,8 @@ qr1hi-p5p6p-uf9gi9nolgakm85 The new pipeline template will also appear on the Workbench→Compute→Pipeline templates page. +_Storing the pipeline in arvados as well as in git seems redundant_ + h3. Invoke the pipeline using "arv pipeline run" Replace the UUID here with the UUID of your own new pipeline template: @@ -289,6 +318,8 @@ The output of the "find_variant" component is shown in your terminal with the la It is also displayed on the pipeline instance detail page: go to Workbench→Compute→Pipeline instances and click the UUID of your pipeline instance. +_There needs to be an easier way to get the output from the workbench_ + h3. Compute a summary statistic from the output collection For this step we will use python to read the output manifest and count how many of the inputs produced hits. @@ -312,12 +343,14 @@ print "%d had the variant, %d did not." % (hits, misses) 4 had the variant, 3 did not. -h3. Run the pipeline again using different parameters +_Explain each step_ + +_h3. Run the pipeline again using different parameters We can use the same pipeline template to run the jobs again, this time overriding the "trait_name" parameter with a different value:
-wh-run-pipeline-instance --template qr1hi-p5p6p-uf9gi9nolgakm85 create_collection::trait_name=cancer
+arv pipeline run --template qr1hi-p5p6p-uf9gi9nolgakm85 create_collection::trait_name=cancer
 
When this template instance finishes, run the same Python script on the new output collection to summarize the results.