Notes on new user documentation

[arvados.git] / doc / user / tutorial-new-pipeline.textile
diff --git a/doc/user/tutorial-new-pipeline.textile b/doc/user/tutorial-new-pipeline.textile

index 8f1a3434587f91a97cbfb13e4472ebcd8aefad6c..1dca21f78bb0cd7296d972c4c85611f1504fa471 100644 (file)
--- a/doc/user/tutorial-new-pipeline.textile
+++ b/doc/user/tutorial-new-pipeline.textile
@@ -9,6 +9,9 @@ h1. Tutorial: Construct a new pipeline
  
  Here you will write two new crunch scripts, incorporate them into a new pipeline template, run the new pipeline a couple of times using different parameters, and compare the results. One of the new scripts will use the Arvados API to look up trait&rarr;human&rarr;data relations and use this information to compile a collection of data to analyze.
  
+_Like the previous tutorial, this needs more of a basis in some actual
+clinical/research question to motivate it_
+
  h3. Prerequisites
  
  * Log in to a VM "using SSH":ssh-access.html
@@ -31,6 +34,8 @@ ssh -A my_vm.arvados
  ssh-add -l       # (run this in your VM account to confirm forwarding works)
  </pre>
  
+_This discussion about ssh should probably go under the "ssh" section_
+
  With PuTTY under Windows, run "pageant", add your key to the agent, and turn on agent forwarding in your PuTTY settings.
  
  *Option 2:* Edit code on your workstation and push code to your Arvados repository from there instead of your VM account. Depending on your @.ssh/config@ file, you will use names like @my_vm_name.arvados@ instead of @my_vm_name.{{ site.arvados_api_host }}@ in git and ssh commands.
@@ -58,7 +63,7 @@ git push arvados master:master  #(update arvados hosted copy)
  Whichever setup you choose, if everything is working correctly, this command should give you a list of repositories you can access:
  
  <pre>
-ssh git@git.{{ site.arvados_api_host }}
+ssh -T git@git.{{ site.arvados_api_host }}
  </pre>
  
  &darr;
@@ -69,6 +74,10 @@ the gitolite config gives you the following access:
       R   W      your_repo_name
  </pre>
  
+_You need to have a git repository set up already, which is not
+necessarily the case for new users, so this should link to the git
+section about setting up a new repo_
+
  h3. Set some variables
  
  Adjust these to match your login account name and the URL of your Arvados repository. The Access&rarr;VMs and Access&rarr;Repositories pages on Workbench will show the specifics.
@@ -90,15 +99,25 @@ git checkout -b pipeline-tutorial
  git remote add origin $repo_url
  </pre>
  
+_Should explain each step_
+_Creating an empty branch in an empty repository makes git do weird
+things, need to fix using
+<pre>
+git branch --set-upstream pipeline_tutorial origin/pipeline_tutorial
+</pre>
+but I don't know what this means._
+
  h3. Write the create-collection-by-trait script
  
  <pre>
  mkdir -p crunch_scripts
  touch crunch_scripts/create-collection-by-trait
  chmod +x crunch_scripts/create-collection-by-trait
-edit crunch_scripts/create-collection-by-trait
+nano crunch_scripts/create-collection-by-trait
  </pre>
  
+_the -p to mkdir isn't necessary here_
+
  Here is the script:
  
  <pre>
@@ -147,7 +166,7 @@ h3. Write the find-dbsnp-id script
  <pre>
  touch crunch_scripts/find-dbsnp-id
  chmod +x crunch_scripts/find-dbsnp-id
-edit crunch_scripts/find-dbsnp-id
+nano crunch_scripts/find-dbsnp-id
  </pre>
  
  Here is the script:
@@ -178,6 +197,11 @@ for line in input_file.readlines():
  this_task.set_output(out.finish())
  </pre>
  
+_This should probably match the code we ran the user through in the
+previous tutorial, with the only difference being that the prior
+tutorial is interactive, and this tutorial is demonstrating how to
+create a job.
+
  h3. Commit your new code
  
  <pre>
@@ -215,27 +239,27 @@ Make a directory called @pipeline_templates@ and create a file called @find-dbsn
  
  <pre>
  mkdir pipeline_templates
-edit pipeline_templates/find-dbsnp-by-trait.json
+nano pipeline_templates/find-dbsnp-by-trait.json
  </pre>
  
  Copy the following pipeline template.
  
  <pre>
  {
-  "name":"find-dbsnp-by-trait",
+  "name":"find_dbsnp_by_trait",
    "components":{
-    "create-collection":{
+    "create_collection":{
        "script":"create-collection-by-trait",
        "script_parameters":{
          "trait_name":"Non-melanoma skin cancer"
        },
        "script_version":"YOUR_GIT_COMMIT_SHA1_HERE"
      },
-    "find-variant":{
+    "find_variant":{
        "script":"find-dbsnp-id",
        "script_parameters":{
          "input":{
-          "output_of":"create-collection"
+          "output_of":"create_collection"
          },
          "dbsnp_id":"rs1126809"
        },
@@ -245,10 +269,13 @@ Copy the following pipeline template.
  }
  </pre>
  
+_This desparately needs to be explained, since this is the actual
+pipeline definition_
+
  h3. Store the pipeline template in Arvados
  
  <pre>
-read -rd "\000" the_pipeline < pipeline_templates/find-dbsnp-by-trait.json
+read -rd $'\000' the_pipeline < pipeline_templates/find-dbsnp-by-trait.json
  
  arv pipeline_template create --pipeline-template "$the_pipeline"
  </pre>
@@ -261,6 +288,8 @@ qr1hi-p5p6p-uf9gi9nolgakm85
  
  The new pipeline template will also appear on the Workbench&rarr;Compute&rarr;Pipeline&nbsp;templates page.
  
+_Storing the pipeline in arvados as well as in git seems redundant_
+
  h3. Invoke the pipeline using "arv pipeline run"
  
  Replace the UUID here with the UUID of your own new pipeline template:
@@ -289,6 +318,8 @@ The output of the "find_variant" component is shown in your terminal with the la
  
  It is also displayed on the pipeline instance detail page: go to Workbench&rarr;Compute&rarr;Pipeline&nbsp;instances and click the UUID of your pipeline instance.
  
+_There needs to be an easier way to get the output from the workbench_
+
  h3. Compute a summary statistic from the output collection
  
  For this step we will use python to read the output manifest and count how many of the inputs produced hits.
@@ -312,12 +343,14 @@ print "%d had the variant, %d did not." % (hits, misses)
  4 had the variant, 3 did not.
  </pre>
  
-h3. Run the pipeline again using different parameters
+_Explain each step_
+
+_h3. Run the pipeline again using different parameters
  
  We can use the same pipeline template to run the jobs again, this time overriding the "trait_name" parameter with a different value:
  
  <pre>
-wh-run-pipeline-instance --template qr1hi-p5p6p-uf9gi9nolgakm85 create_collection::trait_name=cancer
+arv pipeline run --template qr1hi-p5p6p-uf9gi9nolgakm85 create_collection::trait_name=cancer
  </pre>
  
  When this template instance finishes, run the same Python script on the new output collection to summarize the results.