From: peter Date: Wed, 4 Dec 2013 21:32:39 +0000 (-0500) Subject: Notes on new user documentation X-Git-Tag: 1.1.0~2849^2~10 X-Git-Url: https://git.arvados.org/arvados.git/commitdiff_plain/e5ab13b7c5049571b450df5485a22e82504b97a9?hp=-c Notes on new user documentation --- e5ab13b7c5049571b450df5485a22e82504b97a9 diff --git a/doc/user/api-tokens.md b/doc/user/api-tokens.md index 7a02d21e15..de8c6d4b67 100644 --- a/doc/user/api-tokens.md +++ b/doc/user/api-tokens.md @@ -7,6 +7,10 @@ navorder: 1 # Getting an API token +> Needs a line or two to the effect of "an API token is a secret key that +> enables the command line client to access arvados with the proper +> permissions". + Open a browser and point it to the Workbench URL for your site. It will look something like this: @@ -16,6 +20,8 @@ Log in, if you haven't done that already. Click the "API tokens" link. +> Need to indicate that "API Tokens" is underneath "Access" + At the top of the "API tokens" page, you will see a few lines like this. ### Pasting the following lines at a shell prompt will allow Arvados SDKs @@ -33,13 +39,18 @@ to your terminal session. Now, `arv -h user current` will display your account info in JSON format. +> What does `-h` mean? + Optionally, copy those lines to your .bashrc file so you don't have to repeat this process each time you log in. ### SSL + development mode +> This section should go somewhere else, it is confusing to a new user. + If you are using a local development server with a self-signed certificate, you might need to bypass certificate verification. Don't do this if you are using a production service. export ARVADOS_API_HOST_INSECURE=yes + diff --git a/doc/user/index.md b/doc/user/index.md index 4555c9de9a..e9fc7699f0 100644 --- a/doc/user/index.md +++ b/doc/user/index.md @@ -5,6 +5,13 @@ title: Getting started navorder: 0 --- +> I'd like to see the user guide consist of the following sections: +> 1. background (general architecture/theory of operation from the user's perspective) +> 2. getting started / tutorials +> 3. how to (in depth topics) +> 4. tools reference (command line, workbench, etc) +> Currently the user guide is mostly just 2. + # Getting started As a new user, you should take a quick tour of Arvados environment. @@ -20,33 +27,57 @@ will look something like this: Depending on site policy, a site administrator might have to activate your account before you see any more good stuff. -### Browse shared data and pipelines +> "Good stuff" is vague. + +### Browse shared data angd pipelines On the Workbench home page, you should see some datasets, programs, jobs, and pipelines that you can explore. +> This would be a great place for a screenshot or at least a little +> more guidance on where to look (these things are all accessed +> through the menu bar) + ### Install the command line SDK on your workstation (Optional) +> Is this really optional? All the tutorials are about how to use +> the command line SDK + Most of the functionality in Arvados is exposed by the REST API. This means (depending on site policy and firewall) that you can do a lot of stuff with the command line client and other SDKs running on your own computer. +> "A lot of stuff" is vague. + Technically you can make all API calls using a generic web client like [curl](http://curl.haxx.se/docs/) but you will have a more enjoyable experience with the Arvados CLI client. +> I would mention this somewhere else, a new user isn't going to be using +> curl. + See [command line SDK](sdk-cli.html) for installation instructions. ### Request a virtual machine +> The purpose of this whole section is confusing, because after explaning that you +> can access arvados from your workstation with the client SDK, it then +> implies that you actually need to go and log into an arvados VM instance +> instead. + It's more fun to do stuff with a virtual machine, especially if you know about [screen](http://www.gnu.org/software/screen/). +> Screen is cool, but not relevant here. + In order to get access to an Arvados VM, you need to: 1. Upload an SSH public key ([learn how](ssh-access.html)) 1. Request a new VM (or access to a shared VM) +> Needs some kind of discussion on how to request a new VM or discover +> the hostname of the shared VM + diff --git a/doc/user/intro-git.md b/doc/user/intro-git.md index 6996b905a3..37e847e959 100644 --- a/doc/user/intro-git.md +++ b/doc/user/intro-git.md @@ -13,8 +13,12 @@ Tutorials for Beginners](http://sixrevisions.com/resources/git-tutorials-beginners/)). Here we just cover the specifics of using git in the Arvados environment. +> "git is used in arvados for ..." + ### Find your repository +_This needs to have a section on creating repositories_ + Go to Workbench → Access → Repositories. [https://workbench.{{ site.arvados_api_host }}/repositories](https://workbench.{{ site.arvados_api_host }}/repositories) @@ -34,6 +38,9 @@ like this: ssh -A shell.q +> The .q is inconsistent with the earlier tutorial which sets up +> the .arvados configuration shortcut + At the shell prompt in the VM, type `ssh-add -l` to display a list of keys that can be used. You should see something like this: @@ -51,6 +58,8 @@ Log in to your VM (using `ssh -A`!) and type: (Replace "EXAMPLE" with your own repository's name, or just copy the usage example shown on the Repositories page.) +> The repositories page on the workbench under "access" + ### Commit to your repository This part works just like any other git tree. diff --git a/doc/user/intro-jobs.textile b/doc/user/intro-jobs.textile index 2a6762bab5..431caca838 100644 --- a/doc/user/intro-jobs.textile +++ b/doc/user/intro-jobs.textile @@ -9,8 +9,13 @@ h1. Intro: Jobs You can run MapReduce jobs by storing a job script in a git repository and creating a "job":../api/Jobs.html. +_Need to define MapReduce_ + Crunch jobs offer several advantages over running programs on your own local machine: +_This underplays it a bit, I would say it offers many, significanty +advantages, not just "several"_ + * Increase concurrency by running tasks asynchronously, using many CPUs and network interfaces at once (especially beneficial for CPU-bound and I/O-bound tasks respectively). * Track inputs, outputs, and settings so you can verify that the inputs, settings, and sequence of programs you used to arrive at an output is really what you think it was. @@ -27,6 +32,10 @@ A job consists of a number of tasks which can be executed asynchronously. A single job program, or "crunch script", executes each task of a given job. The logic of a typical crunch script looks like this: +_This discussion of the structure of a job seems to miss the mark, +it's both too detailed for an introdction but not detailed enough to +be able to make use of the knowledge_ + * If this is the first task: examine the input, divide it into a number of asynchronous tasks, instruct Arvados to queue these tasks, output nothing, and indicate successful completion. * Otherwise, fetch a portion of the input from the cloud storage system, do some computation, store some output in the cloud, output a fragment of the output manifest, and indicate successful completion. @@ -37,8 +46,13 @@ If a job task fails, it is automatically re-attempted. If a task fails repeated h3. Developing and testing crunch scripts +_This seems like it should go in the tutorial section_ + Usually, it makes sense to test your script locally on small data sets. When you are satisfied that it works, commit it to the git repository and run it in Arvados. +_I'm confused. Is this example for running locally or running +remotely on arvados?_ + Save your job script (say, @foo@) in @{git-repo}/crunch_scripts/foo@. Make sure you have @ARVADOS_API_TOKEN@ and @ARVADOS_API_HOST@ set correctly ("more info":api-tokens.html). diff --git a/doc/user/intro-keep.md b/doc/user/intro-keep.md index 01dba9fbf0..a4a479ab3f 100644 --- a/doc/user/intro-keep.md +++ b/doc/user/intro-keep.md @@ -10,6 +10,9 @@ navorder: 3 Keep is a content-addressable storage system. Its semantics are inherently different from the POSIX-like file systems you're used to. +> Explain what "content-addressable" means more specifically. +> Define "locator" + Using Keep looks like this: 1. Write data. @@ -39,12 +42,20 @@ filesystem. It contains subdirectories and filenames, and indicates where to find the data blocks which comprise the files. It is encoded in plain text. +> Can a collection contain sub-collections? +> The "plain text" encoding is JSON, right? Either be specific or +> remove it because the user doesn't really need to know about the encoding +> at this level. + A data block contains between 1 byte and 64 MiB of data. Its locator is the MD5 checksum of the data, followed by a plus sign and its size in bytes (encoded as a decimal number). `acbd18db4cc2f85cedef654fccc4a4d8+3` +> What does this locator encode? Give an example so the astute +> reader could use "md5" herself to construct the id. + A locator may include additional "hints" to help the Keep store find a data block more quickly. For example, in the locator `acbd18db4cc2f85cedef654fccc4a4d8+3+K@{{ site.arvados_api_host }}` the @@ -65,6 +76,9 @@ delete unneeded data blocks. ### Tagging valuable data +> Now this goes from background introduction to tutorial, +> so this should probably be split up + Valuable data must be marked explicitly by creating a Collection in Arvados. Otherwise, the data blocks will be deleted during garbage collection. @@ -73,16 +87,22 @@ Use the arv(1) program to create a collection. For example: arv collection create --uuid "acbd18db4cc2f85cedef654fccc4a4d8+3" +> What does this actually do? + ## Getting started Write three bytes of data to Keep. echo -n foo | whput - +> What does "wh" stand for in the program name? + Output: acbd18db4cc2f85cedef654fccc4a4d8+3+K@arv01 +> Explain that this is the locator that Keep has stored the data under + Retrieve the data. whget acbd18db4cc2f85cedef654fccc4a4d8+3+K@arv01 @@ -106,3 +126,4 @@ Output: ### Mounting a single collection as a POSIX filesystem +> Needs a yellow "this web page under construction" sign with a guy shoveling dirt. diff --git a/doc/user/sdk-cli.textile b/doc/user/sdk-cli.textile index 145ca4d478..55e477db57 100644 --- a/doc/user/sdk-cli.textile +++ b/doc/user/sdk-cli.textile @@ -13,6 +13,8 @@ If you are logged in to an Arvados VM, the command line SDK should be installed. @arv --help@ +_Help is not helpful. See bug #1667_ + h3. First... Set the ARVADOS_API_HOST environment variable. @@ -21,16 +23,22 @@ Set the ARVADOS_API_HOST environment variable. Log in to Workbench and get an API token for your account. Set the ARVADOS_API_TOKEN environment variable. -@export ARVADOS_API_TOKEN=c0vdbi8wp7f703lbthyadlvmaivgldxssy3l32isslmax93k9@ +@export +ARVADOS_API_TOKEN=c0vdbi8wp7f703lbthyadlvmaivgldxssy3l32isslmax93k9@ If you are using a development instance with an unverifiable SSL certificate, set the ARVADOS_API_HOST_INSECURE environment variable. @export ARVADOS_API_HOST_INSECURE=1@ +_This should link back to "api-tokens":api-tokens.html instead of +re-explaining it__ + h3. Usage @arv [global_options] resource_type resource_method [method_parameters]@ +_This is what arv --help really ought to print out_ + h3. Basic examples Get UUID of the current user @@ -53,6 +61,9 @@ Get list of groups (showing entire records) h3. Global options +_Move these up to before "basic examples", and give examples of what +these options do and how they might be useful._ + - @--json@, @-j@ := Output entire response as compact JSON. - @--pretty@, @--human@, @-h@ := Output entire response as JSON with whitespace for better human-readability. diff --git a/doc/user/ssh-access.md b/doc/user/ssh-access.md index 9f8c0c025d..0e403dc799 100644 --- a/doc/user/ssh-access.md +++ b/doc/user/ssh-access.md @@ -23,6 +23,9 @@ tutorial](https://www.google.com/search?q=github+ssh+key+help) ### Associate your SSH public key with your Arvados Workbench account +> Maybe mention that the "Add a new authorized key" button will be at the bottom of the page + + Go to the `Keys` page in Arvados Workbench (under the `Access` tab) and click the

Add a new authorized key

@@ -47,6 +50,11 @@ file: Host *.arvados ProxyCommand ssh -p2222 turnout@switchyard.{{ site.arvados_api_host }} -x -a $SSH_PROXY_FLAGS %h +> This needs to be explained that it is adding an alias to make it easier to log into an +> arvados server on port 2222. This is not actually necessary if the user doesn't mind some typing. +> Actually, it might make sense to show the regular command line first, and then mention later that +> it can be shortened using ~/.ssh/config. + If you have access to an account `foo` on a VM called `blurfl` then you can log in like this: @@ -60,6 +68,9 @@ Some other convenient configuration options are `User` and User foo ForwardAgent yes +> This shortened *.arvados to *.a +> This should be consistent + Adding `User foo` will log you in to the VM as user `foo` by default, so you can just `ssh blurfl.a`. The `ForwardAgent yes` option turns on the `ssh -A` option to forward your SSH credentials (if you are diff --git a/doc/user/tutorial-gatk-variantfiltration.textile b/doc/user/tutorial-gatk-variantfiltration.textile index 39ed4973f6..0a5a9a4e5e 100644 --- a/doc/user/tutorial-gatk-variantfiltration.textile +++ b/doc/user/tutorial-gatk-variantfiltration.textile @@ -9,6 +9,13 @@ h1. Tutorial: GATK VariantFiltration Here you will use the GATK VariantFiltration program to assign pass/fail scores to variants in a VCF file. +_This should be motivated better using a specific biomedical research +or diagnostic question that involves this analysis_ + +_From conversation with Ward: We should link to a discussion of the +personal genome project and explain that it a freely available dataset +that any researcher can use, which makes it appropriate to be used in these examples._ + h3. Prerequisites * Log in to a VM "using SSH":ssh-access.html @@ -19,8 +26,16 @@ If everything is set up correctly, the command @arv -h user current@ will displa h3. Get the GATK binary distribution. +_Perhaps separate out this and the next sections and link to it so the user only +has to do this if they really don't have GATK installed. Also provide +a way to determine if they do have it._ + Download the GATK binary tarball[1] -- e.g., @GenomeAnalysisTK-2.6-4.tar.bz2@ -- and copy it to your Arvados VM. +_Is it necessary to copy it to the Arvados VM first, you could put it into +keep from your desktop and/or use the workbench? Also if we are +telling them to copy it to the VM, maybe we should mention scp?_ + Store it in Keep.
@@ -29,6 +44,8 @@ arv keep put --in-manifest GenomeAnalysisTK-2.6-4.tar.bz2
 
 ↓
 
+_Make the itty bitty down arrows bigger, and maybe center them_
+
 
 c905c8d8443a9c44274d98b7c6cfaa32+94+K@qr1hi
 
@@ -88,6 +105,8 @@ Note the job UUID in the API response. h3. Monitor job progress +_This was already covered in tutorial1_ + There are three ways to monitor job progress: # Go to Workbench, drop down the Compute menu, and click Jobs. The job you submitted should appear at the top of the list. Hit "Refresh" until it finishes. @@ -99,6 +118,9 @@ curl -s -H "Authorization: OAuth2 $ARVADOS_API_TOKEN" \ https://{{ site.arvados_api_host }}/arvados/v1/jobs/JOB_UUID_HERE/log_tail_follow
+ +_That's it? Say something about the output we're going to get_ + h3. Notes fn1. Download the GATK tools → http://www.broadinstitute.org/gatk/download diff --git a/doc/user/tutorial-job-debug.textile b/doc/user/tutorial-job-debug.textile index 1be74e31ea..09e160cf06 100644 --- a/doc/user/tutorial-job-debug.textile +++ b/doc/user/tutorial-job-debug.textile @@ -28,6 +28,8 @@ yourname@shell:~$ cd yourrepo yourname@shell:~/yourrepo$ mkdir -p crunch_scripts +_Should mention somewhere that it *must* be in a directory called crunch_scripts_ + {% include notebox-begin.html %} The process described here should work regardless of whether @yourrepo@ contains a git repository. But normally you would only ever edit code in a git tree -- especially crunch scripts, which can't be used to run regular Arvados jobs until they're committed and pushed. @@ -124,6 +126,8 @@ out.write('hello world') out_collection = out.finish() +_Explain better_ + The return value of @out.finish()@ is the content address (hash) of a collection stored in Keep. h3. Record successful completion @@ -138,6 +142,8 @@ The @set_output()@ method tells Arvados that * The hash specified by the @out_collection@ parameter is the output of this task, and * This task completed successfully. +_Putting this in a separate section from the code blob above is confusing!_ + h3. Run the working script.
diff --git a/doc/user/tutorial-job1.textile b/doc/user/tutorial-job1.textile
index ff5f6a1cce..0cc24e64c2 100644
--- a/doc/user/tutorial-job1.textile
+++ b/doc/user/tutorial-job1.textile
@@ -11,12 +11,18 @@ Here you will use the "arv" command line tool to run a simple Crunch script on s
 
 h3. Prerequisites
 
+_Needs a mention of going to Access->VMs on the workbench_
+
 * Log in to a VM "using SSH":ssh-access.html
 * Put an "API token":api-tokens.html in your @ARVADOS_API_TOKEN@ environment variable
 * Put the API host name in your @ARVADOS_API_HOST@ environment variable
 
 If everything is set up correctly, the command @arv -h user current@ will display your account information.
 
+
+_If you are logged in to a fully provisioned VM, presumably the gems
+are already installed.  This discussion should go somewhere else._
+
 Arv depends on a few gems. It will tell you which ones to install, if they are not present yet. If you need to install the dependencies and are doing so as a non-root user, make sure you set GEM_HOME before you run gem install:
 
 
@@ -29,12 +35,19 @@ We will run the "hash" program, which computes the MD5 hash of each file in a co
 
 Pick a data collection. We'll use @33a9f3842b01ea3fdf27cc582f5ea2af@ here.
 
+_How do I know if I have this data?  Does it come as example data with
+the arvados distribution?  Is there something notable about it, like
+it is very large and spans multiple keep blocks?_
+
 
 the_collection=33a9f3842b01ea3fdf27cc582f5ea2af
 
Pick a code version. We'll use @5565778cf15ae9af22ad392053430213e9016631@ here. +_How do I know if I have this code version? What does this refer to? +A git revision? Or a keep id? In what repository?_ +
 the_version=5565778cf15ae9af22ad392053430213e9016631
 
@@ -54,7 +67,12 @@ read -rd $'\000' the_job < -(The @read -rd $'\000'@ part uses a bash feature to help us get a multi-line string with lots of double quotation marks into a shell variable.) +_Need to explain what the json fields mean, it is explained later but +there should be some mention up here._ + +(The @read -rd $'\000'@ part uses a bash feature to help us get a +multi-line string with lots of double quotation marks into a shell +variable.) Submit the job. @@ -79,11 +97,18 @@ arv -h job create --job "$the_job" }
+_What is this? An example of what "arv" returns? What do the fields mean?_ + h3. Monitor job progress +_And then the magic happens. There should be some more discussion of what +is going on in the background once the job is submitted from the +user's perspective. It is queued, running, etc?._ + Go to Workbench, drop down the Compute menu, and click Jobs. The job you submitted should appear at the top of the list. -Hit "Refresh" until it finishes. +Hit "Refresh" until it finishes. _We should really make the page +autorefresh or use a streamed-update framework_ You can also watch the log messages while the job runs: @@ -145,3 +170,9 @@ cd arvados git checkout $the_version less crunch_scripts/hash
+ +_If we're going to direct the user to open up the code, some +discussion of the python API is probably in order. If the hash +job is going to be the canonical first crunch map reduce program +for everybody, than we should break down the program line-by-line and +explain every step in detail._ diff --git a/doc/user/tutorial-new-pipeline.textile b/doc/user/tutorial-new-pipeline.textile index 0b42746c51..1dca21f78b 100644 --- a/doc/user/tutorial-new-pipeline.textile +++ b/doc/user/tutorial-new-pipeline.textile @@ -9,6 +9,9 @@ h1. Tutorial: Construct a new pipeline Here you will write two new crunch scripts, incorporate them into a new pipeline template, run the new pipeline a couple of times using different parameters, and compare the results. One of the new scripts will use the Arvados API to look up trait→human→data relations and use this information to compile a collection of data to analyze. +_Like the previous tutorial, this needs more of a basis in some actual +clinical/research question to motivate it_ + h3. Prerequisites * Log in to a VM "using SSH":ssh-access.html @@ -31,6 +34,8 @@ ssh -A my_vm.arvados ssh-add -l # (run this in your VM account to confirm forwarding works) +_This discussion about ssh should probably go under the "ssh" section_ + With PuTTY under Windows, run "pageant", add your key to the agent, and turn on agent forwarding in your PuTTY settings. *Option 2:* Edit code on your workstation and push code to your Arvados repository from there instead of your VM account. Depending on your @.ssh/config@ file, you will use names like @my_vm_name.arvados@ instead of @my_vm_name.{{ site.arvados_api_host }}@ in git and ssh commands. @@ -69,6 +74,10 @@ the gitolite config gives you the following access: R W your_repo_name +_You need to have a git repository set up already, which is not +necessarily the case for new users, so this should link to the git +section about setting up a new repo_ + h3. Set some variables Adjust these to match your login account name and the URL of your Arvados repository. The Access→VMs and Access→Repositories pages on Workbench will show the specifics. @@ -90,6 +99,14 @@ git checkout -b pipeline-tutorial git remote add origin $repo_url +_Should explain each step_ +_Creating an empty branch in an empty repository makes git do weird +things, need to fix using +
+git branch --set-upstream pipeline_tutorial origin/pipeline_tutorial
+
+but I don't know what this means._ + h3. Write the create-collection-by-trait script
@@ -99,6 +116,8 @@ chmod +x crunch_scripts/create-collection-by-trait
 nano crunch_scripts/create-collection-by-trait
 
+_the -p to mkdir isn't necessary here_ + Here is the script:
@@ -178,6 +197,11 @@ for line in input_file.readlines():
 this_task.set_output(out.finish())
 
+_This should probably match the code we ran the user through in the +previous tutorial, with the only difference being that the prior +tutorial is interactive, and this tutorial is demonstrating how to +create a job. + h3. Commit your new code
@@ -245,6 +269,9 @@ Copy the following pipeline template.
 }
 
+_This desparately needs to be explained, since this is the actual +pipeline definition_ + h3. Store the pipeline template in Arvados
@@ -261,6 +288,8 @@ qr1hi-p5p6p-uf9gi9nolgakm85
 
 The new pipeline template will also appear on the Workbench→Compute→Pipeline templates page.
 
+_Storing the pipeline in arvados as well as in git seems redundant_
+
 h3. Invoke the pipeline using "arv pipeline run"
 
 Replace the UUID here with the UUID of your own new pipeline template:
@@ -289,6 +318,8 @@ The output of the "find_variant" component is shown in your terminal with the la
 
 It is also displayed on the pipeline instance detail page: go to Workbench→Compute→Pipeline instances and click the UUID of your pipeline instance.
 
+_There needs to be an easier way to get the output from the workbench_
+
 h3. Compute a summary statistic from the output collection
 
 For this step we will use python to read the output manifest and count how many of the inputs produced hits.
@@ -312,7 +343,9 @@ print "%d had the variant, %d did not." % (hits, misses)
 4 had the variant, 3 did not.
 
-h3. Run the pipeline again using different parameters +_Explain each step_ + +_h3. Run the pipeline again using different parameters We can use the same pipeline template to run the jobs again, this time overriding the "trait_name" parameter with a different value: diff --git a/doc/user/tutorial-trait-search.textile b/doc/user/tutorial-trait-search.textile index 92646a7364..d98ba2efd5 100644 --- a/doc/user/tutorial-trait-search.textile +++ b/doc/user/tutorial-trait-search.textile @@ -9,6 +9,16 @@ h1. Tutorial: Search PGP data by trait Here you will use the Python SDK to find public WGS data for people who have a certain medical condition. +_Define WGS_ + +_Explain the motivation in this example a little better. If I'm +reading this right, the workflow is +traits -> people with those traits -> presense of a specific genetic +variant in the people with the reported traits_ + +_Rather than having the user do this through the Python command line, +it might be easier to write a file that is going to do each step_ + h3. Prerequisites * Log in to a VM "using SSH":ssh-access.html @@ -46,6 +56,9 @@ for t in filter(lambda t: re.search('cancer', t['name']), +_Should break this down into steps instead of being clever and making +it a python one-liner_ + ↓
@@ -74,6 +87,8 @@ trait_links = arvados.service.links().list(limit=1000,where=json.dumps({
   })).execute()['items']
 
+_Same comment, break this out and describe each step_ + The "tail_uuid" attribute of each of these Links refers to a Human.