# Getting an API token
+> Needs a line or two to the effect of "an API token is a secret key that
+> enables the command line client to access arvados with the proper
+> permissions".
+
Open a browser and point it to the Workbench URL for your site. It
will look something like this:
Click the "API tokens" link.
+> Need to indicate that "API Tokens" is underneath "Access"
+
At the top of the "API tokens" page, you will see a few lines like this.
### Pasting the following lines at a shell prompt will allow Arvados SDKs
Now, `arv -h user current` will display your account info in JSON
format.
+> What does `-h` mean?
+
Optionally, copy those lines to your .bashrc file so you don't have to
repeat this process each time you log in.
### SSL + development mode
+> This section should go somewhere else, it is confusing to a new user.
+
If you are using a local development server with a self-signed
certificate, you might need to bypass certificate verification. Don't
do this if you are using a production service.
export ARVADOS_API_HOST_INSECURE=yes
+
navorder: 0
---
+> I'd like to see the user guide consist of the following sections:
+> 1. background (general architecture/theory of operation from the user's perspective)
+> 2. getting started / tutorials
+> 3. how to (in depth topics)
+> 4. tools reference (command line, workbench, etc)
+> Currently the user guide is mostly just 2.
+
# Getting started
As a new user, you should take a quick tour of Arvados environment.
Depending on site policy, a site administrator might have to activate
your account before you see any more good stuff.
-### Browse shared data and pipelines
+> "Good stuff" is vague.
+
+### Browse shared data angd pipelines
On the Workbench home page, you should see some datasets, programs,
jobs, and pipelines that you can explore.
+> This would be a great place for a screenshot or at least a little
+> more guidance on where to look (these things are all accessed
+> through the menu bar)
+
### Install the command line SDK on your workstation
(Optional)
+> Is this really optional? All the tutorials are about how to use
+> the command line SDK
+
Most of the functionality in Arvados is exposed by the REST API. This
means (depending on site policy and firewall) that you can do a lot of
stuff with the command line client and other SDKs running on your own
computer.
+> "A lot of stuff" is vague.
+
Technically you can make all API calls using a generic web client like
[curl](http://curl.haxx.se/docs/) but you will have a more enjoyable
experience with the Arvados CLI client.
+> I would mention this somewhere else, a new user isn't going to be using
+> curl.
+
See [command line SDK](sdk-cli.html) for installation instructions.
### Request a virtual machine
+> The purpose of this whole section is confusing, because after explaning that you
+> can access arvados from your workstation with the client SDK, it then
+> implies that you actually need to go and log into an arvados VM instance
+> instead.
+
It's more fun to do stuff with a virtual machine, especially if you
know about [screen](http://www.gnu.org/software/screen/).
+> Screen is cool, but not relevant here.
+
In order to get access to an Arvados VM, you need to:
1. Upload an SSH public key ([learn how](ssh-access.html))
1. Request a new VM (or access to a shared VM)
+> Needs some kind of discussion on how to request a new VM or discover
+> the hostname of the shared VM
+
Beginners](http://sixrevisions.com/resources/git-tutorials-beginners/)). Here
we just cover the specifics of using git in the Arvados environment.
+> "git is used in arvados for ..."
+
### Find your repository
+_This needs to have a section on creating repositories_
+
Go to Workbench → Access → Repositories.
[https://workbench.{{ site.arvados_api_host }}/repositories](https://workbench.{{ site.arvados_api_host }}/repositories)
ssh -A shell.q
+> The .q is inconsistent with the earlier tutorial which sets up
+> the .arvados configuration shortcut
+
At the shell prompt in the VM, type `ssh-add -l` to display a list of
keys that can be used. You should see something like this:
(Replace "EXAMPLE" with your own repository's name, or just copy the
usage example shown on the Repositories page.)
+> The repositories page on the workbench under "access"
+
### Commit to your repository
This part works just like any other git tree.
You can run MapReduce jobs by storing a job script in a git repository and creating a "job":../api/Jobs.html.
+_Need to define MapReduce_
+
Crunch jobs offer several advantages over running programs on your own local machine:
+_This underplays it a bit, I would say it offers many, significanty
+advantages, not just "several"_
+
* Increase concurrency by running tasks asynchronously, using many CPUs and network interfaces at once (especially beneficial for CPU-bound and I/O-bound tasks respectively).
* Track inputs, outputs, and settings so you can verify that the inputs, settings, and sequence of programs you used to arrive at an output is really what you think it was.
A single job program, or "crunch script", executes each task of a given job. The logic of a typical crunch script looks like this:
+_This discussion of the structure of a job seems to miss the mark,
+it's both too detailed for an introdction but not detailed enough to
+be able to make use of the knowledge_
+
* If this is the first task: examine the input, divide it into a number of asynchronous tasks, instruct Arvados to queue these tasks, output nothing, and indicate successful completion.
* Otherwise, fetch a portion of the input from the cloud storage system, do some computation, store some output in the cloud, output a fragment of the output manifest, and indicate successful completion.
h3. Developing and testing crunch scripts
+_This seems like it should go in the tutorial section_
+
Usually, it makes sense to test your script locally on small data sets. When you are satisfied that it works, commit it to the git repository and run it in Arvados.
+_I'm confused. Is this example for running locally or running
+remotely on arvados?_
+
Save your job script (say, @foo@) in @{git-repo}/crunch_scripts/foo@.
Make sure you have @ARVADOS_API_TOKEN@ and @ARVADOS_API_HOST@ set correctly ("more info":api-tokens.html).
Keep is a content-addressable storage system. Its semantics are
inherently different from the POSIX-like file systems you're used to.
+> Explain what "content-addressable" means more specifically.
+> Define "locator"
+
Using Keep looks like this:
1. Write data.
where to find the data blocks which comprise the files. It is encoded
in plain text.
+> Can a collection contain sub-collections?
+> The "plain text" encoding is JSON, right? Either be specific or
+> remove it because the user doesn't really need to know about the encoding
+> at this level.
+
A data block contains between 1 byte and 64 MiB of data. Its locator
is the MD5 checksum of the data, followed by a plus sign and its size
in bytes (encoded as a decimal number).
`acbd18db4cc2f85cedef654fccc4a4d8+3`
+> What does this locator encode? Give an example so the astute
+> reader could use "md5" herself to construct the id.
+
A locator may include additional "hints" to help the Keep store find a
data block more quickly. For example, in the locator
`acbd18db4cc2f85cedef654fccc4a4d8+3+K@{{ site.arvados_api_host }}` the
### Tagging valuable data
+> Now this goes from background introduction to tutorial,
+> so this should probably be split up
+
Valuable data must be marked explicitly by creating a Collection in
Arvados. Otherwise, the data blocks will be deleted during garbage
collection.
arv collection create --uuid "acbd18db4cc2f85cedef654fccc4a4d8+3"
+> What does this actually do?
+
## Getting started
Write three bytes of data to Keep.
echo -n foo | whput -
+> What does "wh" stand for in the program name?
+
Output:
acbd18db4cc2f85cedef654fccc4a4d8+3+K@arv01
+> Explain that this is the locator that Keep has stored the data under
+
Retrieve the data.
whget acbd18db4cc2f85cedef654fccc4a4d8+3+K@arv01
### Mounting a single collection as a POSIX filesystem
+> Needs a yellow "this web page under construction" sign with a guy shoveling dirt.
@arv --help@
+_Help is not helpful. See bug #1667_
+
h3. First...
Set the ARVADOS_API_HOST environment variable.
Log in to Workbench and get an API token for your account. Set the ARVADOS_API_TOKEN environment variable.
-@export ARVADOS_API_TOKEN=c0vdbi8wp7f703lbthyadlvmaivgldxssy3l32isslmax93k9@
+@export
+ARVADOS_API_TOKEN=c0vdbi8wp7f703lbthyadlvmaivgldxssy3l32isslmax93k9@
If you are using a development instance with an unverifiable SSL certificate, set the ARVADOS_API_HOST_INSECURE environment variable.
@export ARVADOS_API_HOST_INSECURE=1@
+_This should link back to "api-tokens":api-tokens.html instead of
+re-explaining it__
+
h3. Usage
@arv [global_options] resource_type resource_method [method_parameters]@
+_This is what arv --help really ought to print out_
+
h3. Basic examples
Get UUID of the current user
h3. Global options
+_Move these up to before "basic examples", and give examples of what
+these options do and how they might be useful._
+
- @--json@, @-j@ := Output entire response as compact JSON.
- @--pretty@, @--human@, @-h@ := Output entire response as JSON with whitespace for better human-readability.
### Associate your SSH public key with your Arvados Workbench account
+> Maybe mention that the "Add a new authorized key" button will be at the bottom of the page
+
+
Go to the `Keys` page in Arvados Workbench (under the `Access` tab) and click the
<p style="margin-left: 4em"><span class="btn btn-primary disabled">Add a new authorized key</span></p>
Host *.arvados
ProxyCommand ssh -p2222 turnout@switchyard.{{ site.arvados_api_host }} -x -a $SSH_PROXY_FLAGS %h
+> This needs to be explained that it is adding an alias to make it easier to log into an
+> arvados server on port 2222. This is not actually necessary if the user doesn't mind some typing.
+> Actually, it might make sense to show the regular command line first, and then mention later that
+> it can be shortened using ~/.ssh/config.
+
If you have access to an account `foo` on a VM called `blurfl` then
you can log in like this:
User foo
ForwardAgent yes
+> This shortened *.arvados to *.a
+> This should be consistent
+
Adding `User foo` will log you in to the VM as user `foo` by default,
so you can just `ssh blurfl.a`. The `ForwardAgent yes` option turns on
the `ssh -A` option to forward your SSH credentials (if you are
Here you will use the GATK VariantFiltration program to assign pass/fail scores to variants in a VCF file.
+_This should be motivated better using a specific biomedical research
+or diagnostic question that involves this analysis_
+
+_From conversation with Ward: We should link to a discussion of the
+personal genome project and explain that it a freely available dataset
+that any researcher can use, which makes it appropriate to be used in these examples._
+
h3. Prerequisites
* Log in to a VM "using SSH":ssh-access.html
h3. Get the GATK binary distribution.
+_Perhaps separate out this and the next sections and link to it so the user only
+has to do this if they really don't have GATK installed. Also provide
+a way to determine if they do have it._
+
Download the GATK binary tarball[1] -- e.g., @GenomeAnalysisTK-2.6-4.tar.bz2@ -- and copy it to your Arvados VM.
+_Is it necessary to copy it to the Arvados VM first, you could put it into
+keep from your desktop and/or use the workbench? Also if we are
+telling them to copy it to the VM, maybe we should mention scp?_
+
Store it in Keep.
<pre>
↓
+_Make the itty bitty down arrows bigger, and maybe center them_
+
<pre>
c905c8d8443a9c44274d98b7c6cfaa32+94+K@qr1hi
</pre>
h3. Monitor job progress
+_This was already covered in tutorial1_
+
There are three ways to monitor job progress:
# Go to Workbench, drop down the Compute menu, and click Jobs. The job you submitted should appear at the top of the list. Hit "Refresh" until it finishes.
https://{{ site.arvados_api_host }}/arvados/v1/jobs/JOB_UUID_HERE/log_tail_follow
</pre>
+
+_That's it? Say something about the output we're going to get_
+
h3. Notes
fn1. Download the GATK tools → http://www.broadinstitute.org/gatk/download
yourname@shell:~/yourrepo$ mkdir -p crunch_scripts
</pre>
+_Should mention somewhere that it *must* be in a directory called crunch_scripts_
+
{% include notebox-begin.html %}
The process described here should work regardless of whether @yourrepo@ contains a git repository. But normally you would only ever edit code in a git tree -- especially crunch scripts, which can't be used to run regular Arvados jobs until they're committed and pushed.
out_collection = out.finish()
</pre>
+_Explain better_
+
The return value of @out.finish()@ is the content address (hash) of a collection stored in Keep.
h3. Record successful completion
* The hash specified by the @out_collection@ parameter is the output of this task, and
* This task completed successfully.
+_Putting this in a separate section from the code blob above is confusing!_
+
h3. Run the working script.
<pre>
h3. Prerequisites
+_Needs a mention of going to Access->VMs on the workbench_
+
* Log in to a VM "using SSH":ssh-access.html
* Put an "API token":api-tokens.html in your @ARVADOS_API_TOKEN@ environment variable
* Put the API host name in your @ARVADOS_API_HOST@ environment variable
If everything is set up correctly, the command @arv -h user current@ will display your account information.
+
+_If you are logged in to a fully provisioned VM, presumably the gems
+are already installed. This discussion should go somewhere else._
+
Arv depends on a few gems. It will tell you which ones to install, if they are not present yet. If you need to install the dependencies and are doing so as a non-root user, make sure you set GEM_HOME before you run gem install:
<pre>
Pick a data collection. We'll use @33a9f3842b01ea3fdf27cc582f5ea2af@ here.
+_How do I know if I have this data? Does it come as example data with
+the arvados distribution? Is there something notable about it, like
+it is very large and spans multiple keep blocks?_
+
<pre>
the_collection=33a9f3842b01ea3fdf27cc582f5ea2af
</pre>
Pick a code version. We'll use @5565778cf15ae9af22ad392053430213e9016631@ here.
+_How do I know if I have this code version? What does this refer to?
+A git revision? Or a keep id? In what repository?_
+
<pre>
the_version=5565778cf15ae9af22ad392053430213e9016631
</pre>
EOF
</pre>
-(The @read -rd $'\000'@ part uses a bash feature to help us get a multi-line string with lots of double quotation marks into a shell variable.)
+_Need to explain what the json fields mean, it is explained later but
+there should be some mention up here._
+
+(The @read -rd $'\000'@ part uses a bash feature to help us get a
+multi-line string with lots of double quotation marks into a shell
+variable.)
Submit the job.
}
</pre>
+_What is this? An example of what "arv" returns? What do the fields mean?_
+
h3. Monitor job progress
+_And then the magic happens. There should be some more discussion of what
+is going on in the background once the job is submitted from the
+user's perspective. It is queued, running, etc?._
+
Go to Workbench, drop down the Compute menu, and click Jobs. The job you submitted should appear at the top of the list.
-Hit "Refresh" until it finishes.
+Hit "Refresh" until it finishes. _We should really make the page
+autorefresh or use a streamed-update framework_
You can also watch the log messages while the job runs:
git checkout $the_version
less crunch_scripts/hash
</pre>
+
+_If we're going to direct the user to open up the code, some
+discussion of the python API is probably in order. If the hash
+job is going to be the canonical first crunch map reduce program
+for everybody, than we should break down the program line-by-line and
+explain every step in detail._
Here you will write two new crunch scripts, incorporate them into a new pipeline template, run the new pipeline a couple of times using different parameters, and compare the results. One of the new scripts will use the Arvados API to look up trait→human→data relations and use this information to compile a collection of data to analyze.
+_Like the previous tutorial, this needs more of a basis in some actual
+clinical/research question to motivate it_
+
h3. Prerequisites
* Log in to a VM "using SSH":ssh-access.html
ssh-add -l # (run this in your VM account to confirm forwarding works)
</pre>
+_This discussion about ssh should probably go under the "ssh" section_
+
With PuTTY under Windows, run "pageant", add your key to the agent, and turn on agent forwarding in your PuTTY settings.
*Option 2:* Edit code on your workstation and push code to your Arvados repository from there instead of your VM account. Depending on your @.ssh/config@ file, you will use names like @my_vm_name.arvados@ instead of @my_vm_name.{{ site.arvados_api_host }}@ in git and ssh commands.
R W your_repo_name
</pre>
+_You need to have a git repository set up already, which is not
+necessarily the case for new users, so this should link to the git
+section about setting up a new repo_
+
h3. Set some variables
Adjust these to match your login account name and the URL of your Arvados repository. The Access→VMs and Access→Repositories pages on Workbench will show the specifics.
git remote add origin $repo_url
</pre>
+_Should explain each step_
+_Creating an empty branch in an empty repository makes git do weird
+things, need to fix using
+<pre>
+git branch --set-upstream pipeline_tutorial origin/pipeline_tutorial
+</pre>
+but I don't know what this means._
+
h3. Write the create-collection-by-trait script
<pre>
nano crunch_scripts/create-collection-by-trait
</pre>
+_the -p to mkdir isn't necessary here_
+
Here is the script:
<pre>
this_task.set_output(out.finish())
</pre>
+_This should probably match the code we ran the user through in the
+previous tutorial, with the only difference being that the prior
+tutorial is interactive, and this tutorial is demonstrating how to
+create a job.
+
h3. Commit your new code
<pre>
}
</pre>
+_This desparately needs to be explained, since this is the actual
+pipeline definition_
+
h3. Store the pipeline template in Arvados
<pre>
The new pipeline template will also appear on the Workbench→Compute→Pipeline templates page.
+_Storing the pipeline in arvados as well as in git seems redundant_
+
h3. Invoke the pipeline using "arv pipeline run"
Replace the UUID here with the UUID of your own new pipeline template:
It is also displayed on the pipeline instance detail page: go to Workbench→Compute→Pipeline instances and click the UUID of your pipeline instance.
+_There needs to be an easier way to get the output from the workbench_
+
h3. Compute a summary statistic from the output collection
For this step we will use python to read the output manifest and count how many of the inputs produced hits.
4 had the variant, 3 did not.
</pre>
-h3. Run the pipeline again using different parameters
+_Explain each step_
+
+_h3. Run the pipeline again using different parameters
We can use the same pipeline template to run the jobs again, this time overriding the "trait_name" parameter with a different value:
Here you will use the Python SDK to find public WGS data for people who have a certain medical condition.
+_Define WGS_
+
+_Explain the motivation in this example a little better. If I'm
+reading this right, the workflow is
+traits -> people with those traits -> presense of a specific genetic
+variant in the people with the reported traits_
+
+_Rather than having the user do this through the Python command line,
+it might be easier to write a file that is going to do each step_
+
h3. Prerequisites
* Log in to a VM "using SSH":ssh-access.html
</pre>
+_Should break this down into steps instead of being clever and making
+it a python one-liner_
+
↓
<pre>
})).execute()['items']
</pre>
+_Same comment, break this out and describe each step_
+
The "tail_uuid" attribute of each of these Links refers to a Human.
<pre>