Here you will write two new crunch scripts, incorporate them into a new pipeline template, run the new pipeline a couple of times using different parameters, and compare the results. One of the new scripts will use the Arvados API to look up trait→human→data relations and use this information to compile a collection of data to analyze.
+_Like the previous tutorial, this needs more of a basis in some actual
+clinical/research question to motivate it_
+
h3. Prerequisites
* Log in to a VM "using SSH":ssh-access.html
ssh-add -l # (run this in your VM account to confirm forwarding works)
</pre>
+_This discussion about ssh should probably go under the "ssh" section_
+
With PuTTY under Windows, run "pageant", add your key to the agent, and turn on agent forwarding in your PuTTY settings.
*Option 2:* Edit code on your workstation and push code to your Arvados repository from there instead of your VM account. Depending on your @.ssh/config@ file, you will use names like @my_vm_name.arvados@ instead of @my_vm_name.{{ site.arvados_api_host }}@ in git and ssh commands.
Whichever setup you choose, if everything is working correctly, this command should give you a list of repositories you can access:
<pre>
-ssh git@git.{{ site.arvados_api_host }}
+ssh -T git@git.{{ site.arvados_api_host }}
</pre>
↓
R W your_repo_name
</pre>
+_You need to have a git repository set up already, which is not
+necessarily the case for new users, so this should link to the git
+section about setting up a new repo_
+
h3. Set some variables
Adjust these to match your login account name and the URL of your Arvados repository. The Access→VMs and Access→Repositories pages on Workbench will show the specifics.
git remote add origin $repo_url
</pre>
+_Should explain each step_
+_Creating an empty branch in an empty repository makes git do weird
+things, need to fix using
+<pre>
+git branch --set-upstream pipeline_tutorial origin/pipeline_tutorial
+</pre>
+but I don't know what this means._
+
h3. Write the create-collection-by-trait script
<pre>
mkdir -p crunch_scripts
touch crunch_scripts/create-collection-by-trait
chmod +x crunch_scripts/create-collection-by-trait
-edit crunch_scripts/create-collection-by-trait
+nano crunch_scripts/create-collection-by-trait
</pre>
+_the -p to mkdir isn't necessary here_
+
Here is the script:
<pre>
<pre>
touch crunch_scripts/find-dbsnp-id
chmod +x crunch_scripts/find-dbsnp-id
-edit crunch_scripts/find-dbsnp-id
+nano crunch_scripts/find-dbsnp-id
</pre>
Here is the script:
this_task.set_output(out.finish())
</pre>
+_This should probably match the code we ran the user through in the
+previous tutorial, with the only difference being that the prior
+tutorial is interactive, and this tutorial is demonstrating how to
+create a job.
+
h3. Commit your new code
<pre>
<pre>
mkdir pipeline_templates
-edit pipeline_templates/find-dbsnp-by-trait.json
+nano pipeline_templates/find-dbsnp-by-trait.json
</pre>
Copy the following pipeline template.
<pre>
{
- "name":"find-dbsnp-by-trait",
+ "name":"find_dbsnp_by_trait",
"components":{
- "create-collection":{
+ "create_collection":{
"script":"create-collection-by-trait",
"script_parameters":{
"trait_name":"Non-melanoma skin cancer"
},
"script_version":"YOUR_GIT_COMMIT_SHA1_HERE"
},
- "find-variant":{
+ "find_variant":{
"script":"find-dbsnp-id",
"script_parameters":{
"input":{
- "output_of":"create-collection"
+ "output_of":"create_collection"
},
"dbsnp_id":"rs1126809"
},
}
</pre>
+_This desparately needs to be explained, since this is the actual
+pipeline definition_
+
h3. Store the pipeline template in Arvados
<pre>
-read -rd "\000" the_pipeline < pipeline_templates/find-dbsnp-by-trait.json
+read -rd $'\000' the_pipeline < pipeline_templates/find-dbsnp-by-trait.json
arv pipeline_template create --pipeline-template "$the_pipeline"
</pre>
The new pipeline template will also appear on the Workbench→Compute→Pipeline templates page.
+_Storing the pipeline in arvados as well as in git seems redundant_
+
h3. Invoke the pipeline using "arv pipeline run"
Replace the UUID here with the UUID of your own new pipeline template:
It is also displayed on the pipeline instance detail page: go to Workbench→Compute→Pipeline instances and click the UUID of your pipeline instance.
+_There needs to be an easier way to get the output from the workbench_
+
h3. Compute a summary statistic from the output collection
For this step we will use python to read the output manifest and count how many of the inputs produced hits.
4 had the variant, 3 did not.
</pre>
-h3. Run the pipeline again using different parameters
+_Explain each step_
+
+_h3. Run the pipeline again using different parameters
We can use the same pipeline template to run the jobs again, this time overriding the "trait_name" parameter with a different value:
<pre>
-wh-run-pipeline-instance --template qr1hi-p5p6p-uf9gi9nolgakm85 create_collection::trait_name=cancer
+arv pipeline run --template qr1hi-p5p6p-uf9gi9nolgakm85 create_collection::trait_name=cancer
</pre>
When this template instance finishes, run the same Python script on the new output collection to summarize the results.