-*This tutorial assumes that you are logged into an Arvados VM instance (instructions for "Unix":{{site.baseurl}}/user/getting_started/ssh-access-unix.html#login or "Windows":{{site.baseurl}}/user/getting_started/ssh-access-windows.html#login), and have a "working environment.":{{site.baseurl}}/user/getting_started/check-environment.html*
+*This tutorial assumes either that you are logged into an Arvados VM instance (instructions for "Unix":{{site.baseurl}}/user/getting_started/ssh-access-unix.html#login or "Windows":{{site.baseurl}}/user/getting_started/ssh-access-windows.html#login) or you have installed the Arvados "Command line SDK":{{site.baseurl}}/sdk/cli/index.html and "Python SDK":{{site.baseurl}}/sdk/python/index.html on your workstation and have a "working environment.":{{site.baseurl}}/user/getting_started/check-environment.html*
---
layout: default
navsection: userguide
-title: "List of examples included with Arvados"
+title: "Scripts provided by Arvados"
...
Several crunch scripts are included with Arvados in the "/crunch_scripts directory":https://arvados.org/projects/arvados/repository/revisions/master/show/crunch_scripts. They are intended to provide examples and starting points for writing your own scripts.
title: "Checking your environment"
...
-First you should log into an Arvados VM instance ("Unix":{{site.baseurl}}/user/getting_started/ssh-access-unix.html#login or "Windows":{{site.baseurl}}/user/getting_started/ssh-access-windows.html#login) if you have not already done so.
+First, log into an Arvados VM instance (instructions for "Unix":{{site.baseurl}}/user/getting_started/ssh-access-unix.html#login or "Windows":{{site.baseurl}}/user/getting_started/ssh-access-windows.html#login) or install the Arvados "Command line SDK":{{site.baseurl}}/sdk/cli/index.html and "Python SDK":{{site.baseurl}}/sdk/python/index.html on your workstation.
-If @arv user current@ is able to access the API server, it will print out information about your account. Check that you are able to access the Arvados API server using the following command:
+Check that you are able to access the Arvados API server using @arv user current@. If it is able to access the API server, it will print out information about your account:
<notextile>
<pre><code>$ <span class="userinput">arv user current</span>
You may be asked to log in using a Google account. Arvados uses only your name and email address from Google services for identification, and will never access any personal information. If you are accessing Arvados for the first time, the Workbench may indicate your account status is *New / inactive*. If this is the case, contact the administrator of the Arvados instance to request activation of your account.
-Once your account is active, logging in to the Workbench will present you with an account dashboard. This gives a summary of your projects and recent activity in the Arvados instance. "You are now ready to run your first pipeline.":{{ site.baseurl }}/user/tutorials/tutorial-pipeline-workbench.html
+Once your account is active, logging in to the Workbench will present you with the *dashboard*. This gives a summary of your projects and recent activity in the Arvados instance. "You are now ready to run your first pipeline.":{{ site.baseurl }}/user/tutorials/tutorial-pipeline-workbench.html
!{{ site.baseurl }}/images/workbench-dashboard.png!
title: Welcome to Arvados!
...
-_If you are new to Arvados and want to get started quickly, go to "Run a pipeline using Workbench.":{{site.baseurl}}/user/getting_started/workbench.html_
+_If you are new to Arvados and want to get started quickly, go to "Accessing Arvados Workbench.":{{site.baseurl}}/user/getting_started/workbench.html_
This guide provides an introduction to using Arvados to solve big data bioinformatics problems, including:
-* Robust storage of very large files, such as whole genome sequences hundreds of gigabytes in size using the "Arvados Keep":{{site.baseurl}}/user/tutorials/tutorial-keep.html content-addressable cluster file system.
+* Robust storage of very large files, such as whole genome sequences, using the "Arvados Keep":{{site.baseurl}}/user/tutorials/tutorial-keep.html content-addressable cluster file system.
* Running compute-intensive genomic analysis pipelines, such as alignment and variant calls using the "Arvados Crunch":{{site.baseurl}}/user/tutorials/intro-crunch.html cluster compute engine.
-* Storing and querying metadata about genome sequence files, such as human subjects and their phenotypic traits using the "Arvados Metadata Database":{{site.baseurl}}/user/topics/tutorial-trait-search.html .
+* Storing and querying metadata about genome sequence files, such as human subjects and their phenotypic traits using the "Arvados Metadata Database.":{{site.baseurl}}/user/topics/tutorial-trait-search.html
* Accessing, organizing, and sharing data, pipelines and results using the "Arvados Workbench":{{site.baseurl}}/user/getting_started/workbench.html web application.
The examples in this guide use the Arvados instance located at <a href="https://{{ site.arvados_workbench_host }}/" target="_blank">https://{{ site.arvados_workbench_host }}</a>. If you are using a different Arvados instance replace @{{ site.arvados_workbench_host }}@ with your private instance in all of the examples in this guide.
title: "How Keep works"
...
-The Arvados distributed file system is called *Keep*. Keep is a content-addressable file system. This means that files are managed using special unique identifiers derived from the _contents_ of the file, rather than human-assigned file names (specifically, the MD5 hash). This has a number of advantages:
+The Arvados distributed file system is called *Keep*. Keep is a content-addressable file system. This means that files are managed using special unique identifiers derived from the _contents_ of the file (specifically, the MD5 hash), rather than human-assigned file names. This has a number of advantages:
* Files can be stored and replicated across a cluster of servers without requiring a central name server.
* Both the server and client systematically validate data integrity because the checksum is built into the identifier.
* Data duplication is minimized—two files with the same contents will have in the same identifier, and will not be stored twice.
In order to reassemble the file, Keep stores a *collection* data block which lists in sequence the data blocks that make up the original file. A collection data block may store the information for multiple files, including a directory structure.
-In this example we will use @c1bad4b39ca5a924e481008009d94e32+210@ which we added to Keep in "the first Keep tutorial":{{ site.baseurl }}/user/tutorials/tutorial-keep.html. First let us examine the contents of this collection using @arv keep get@:
+In this example we will use @c1bad4b39ca5a924e481008009d94e32+210@, which we added to Keep in "how to upload data":{{ site.baseurl }}/user/tutorials/tutorial-keep.html. First let us examine the contents of this collection using @arv keep get@:
<notextile>
<pre><code>~$ <span class="userinput">arv keep get c1bad4b39ca5a924e481008009d94e32+210</span>
title: "Running a pipeline on the command line"
...
-{% include 'tutorial_expectations' %}
-
This tutorial demonstrates how to use the command line to run the same pipeline as described in "running a pipeline using Workbench.":{{site.baseurl}}/user/tutorials/tutorial-pipeline-workbench.html
+{% include 'tutorial_expectations' %}
+
When you use the command line, you must use Arvados unique identifiers to refer to objects. The identifiers in this example correspond to the following Arvados objects:
* <i class="fa fa-fw fa-gear"></i> "Tutorial align using bwa mem (qr1hi-p5p6p-itzkwxblfermlwv)":https://{{ site.arvados_workbench_host }}/pipeline_templates/qr1hi-p5p6p-itzkwxblfermlwv
title: Introduction to Crunch
...
-h2. Prerequisites
-
The Arvados "Crunch" framework is designed to support processing very large data batches (gigabytes to terabytes) efficiently, and provides the following benefits:
* Increase concurrency by running tasks asynchronously, using many CPUs and network interfaces at once (especially beneficial for CPU-bound and I/O-bound tasks respectively).
* Track inputs, outputs, and settings so you can verify that the inputs, settings, and sequence of programs you used to arrive at an output is really what you think it was.
* Interrupt and resume long-running jobs consisting of many short tasks.
* Maintain timing statistics automatically, so they're there when you want them.
-To get the most value out of this guide, you should be comfortable with the following:
+h2. Prerequisites
+
+To get the most value out of this section, you should be comfortable with the following:
# Using a secure shell client such as SSH or PuTTY to log on to a remote server
# Using the Unix command line shell, Bash
---
layout: default
navsection: userguide
-title: "Writing a pipeline"
+title: "Writing a pipeline template"
...
This tutorial demonstrates how to construct a two stage pipeline template that uses the "bwa mem":http://bio-bwa.sourceforge.net/ tool to produce a "Sequence Alignment/Map (SAM)":https://samtools.github.io/ file, then uses the "Picard SortSam tool":http://picard.sourceforge.net/command-line-overview.shtml#SortSam to produce a BAM (Binary Alignment/Map) file.
{% include 'tutorial_expectations' %}
-Use the following command to create a new empty template using @arv pipeline_template create@ and then open the template record in an interactive text editor (as specified by $EDITOR or $VISUAL, otherwise defaults to @nano@) using @arv edit@.
+Use the following command to create a new empty template using @arv pipeline_template create@, then open the template record in an interactive text editor (as specified by $EDITOR or $VISUAL, otherwise defaults to @nano@) using @arv edit@.
<notextile>
<pre><code>~$ <span class="userinput">arv edit $(arv --format=uuid pipeline_template create --pipeline-template '{}') name components </span></code></pre>
layout: default
navsection: userguide
navmenu: Tutorials
-title: "Writing a script"
+title: "Writing a Crunch script"
...
This tutorial demonstrates how to create a new Arvados pipeline using the Arvados Python SDK. The Arvados SDK supports access to advanced features not available using the @run-command@ wrapper, such as scheduling parallel tasks across nodes.
---
layout: default
navsection: userguide
-title: "Getting data from Keep"
+title: "Downloading data"
...
-This tutorial covers using @arv-ls@ and @arv-get@ to access Keep from the command line. It is also possible to download a file from a collection from the Workbench page for the collection, covered in "running a pipeline using Workbench":{{site.baseurl}}/user/tutorials/tutorial-pipeline-workbench.html
+This tutorial describes how to list and download Arvados data collections using the command line tools @arv-ls@ and @arv-get@. It is also possible to download files from a collection from the Workbench page for the collection, covered in "running a pipeline using Workbench":{{site.baseurl}}/user/tutorials/tutorial-pipeline-workbench.html
{% include 'tutorial_expectations' %}
title: "Mounting Keep as a filesystem"
...
-This tutoral describes how to use @arv-mount@ to mount Keep as a read-only file system access it using traditional filesystem tools.
+This tutoral describes how to access Arvados collections using traditional filesystem tools by mounting Keep as a read-only file system using @arv-mount@.
+
+{% include 'tutorial_expectations' %}
+
+h2. Arv-mount
@arv-mount@ provides several features:
* It is easy for existing tools to access files in Keep.
* Data is downloaded on demand. It is not necessary to download an entire file or collection to start processing.
-{% include 'tutorial_expectations' %}
-
-The default mode permits browsing any collection in Arvados as a subdirectory under the mount directory. To avoid having to fetch a potentially very large list of all collections, collection directories only come into existence when explicitly accessed by their keep locator.
+The default mode permits browsing any collection in Arvados as a subdirectory under the mount directory. To avoid having to fetch a potentially large list of all collections, collection directories only come into existence when explicitly accessed by their keep locator.
<notextile>
<pre><code>~$ <span class="userinput">mkdir -p keep</span>
---
layout: default
navsection: userguide
-title: "Putting data into Keep"
+title: "Uploading data"
...
-This tutorial describes how to upload data to the Arvados file storage system, Keep. This example uses a freely available TSV file containing variant annotations from "Personal Genome Project (PGP)":http://www.personalgenomes.org subject "hu599905":https://my.personalgenomes.org/profile/hu599905 and demonstrates how to use @arv-put@ to add it to Keep.
+This tutorial describes how to to upload new Arvados data collections using the command line tool @arv-put@. This example uses a freely available TSV file containing variant annotations from "Personal Genome Project (PGP)":http://www.personalgenomes.org subject "hu599905.":https://my.personalgenomes.org/profile/hu599905
notextile. <div class="spaced-out">
-# Begin by installing the "Arvados Python SDK":{{site.baseurl}}/sdk/python/sdk-python.html on the system from which you will upload the data (such as your workstation, or a server containing data from your sequencer). This will install the Arvados file upload tool, @arv-put@. Alternately, you can log into an Arvados VM (instructions for "Unix":{{site.baseurl}}/user/getting_started/ssh-access-unix.html#login or "Windows":{{site.baseurl}}/user/getting_started/ssh-access-windows.html#login) and skip to step 4).
-# On system from which you will upload data, configure the environment with the Arvados instance host name and authentication token as decribed in "Getting an API token.":{{site.baseurl}}/user/reference/api-tokens.html
-# Check to see if @arv-put@ is installed correctly by putting an empty file. If you receive an error such as "ARVADOS_API_HOST is not set", your environment is not properly configured.
-<notextile>
-<pre><code>~$ <span class="userinput">arv-put < /dev/null</span>
-0
-4c62f0a0d2344608e5f197894beb6fb5+47
-</code></pre>
-</notextile>
-# Download the following example file. (If you are uploading your own data, you can skip this step)
+# Begin by installing the "Arvados Python SDK":{{site.baseurl}}/sdk/python/sdk-python.html on the system from which you will upload the data (such as your workstation, or a server containing data from your sequencer). This will install the Arvados file upload tool, @arv-put@. Alternately, you can log into an Arvados VM (instructions for "Unix":{{site.baseurl}}/user/getting_started/ssh-access-unix.html#login or "Windows":{{site.baseurl}}/user/getting_started/ssh-access-windows.html#login).
+# On system from which you will upload data, configure the environment with the Arvados instance host name and authentication token as decribed in "Getting an API token.":{{site.baseurl}}/user/reference/api-tokens.html (If you are logged into an Arvados VM, you can skip this step.)
+# Download the following example file. (If you are uploading your own data, you can skip this step.)
<notextile>
<pre><code>~$ <span class="userinput">curl -o var-GS000016015-ASM.tsv.bz2 'https://warehouse.personalgenomes.org/warehouse/f815ec01d5d2f11cb12874ab2ed50daa+234+K@ant/var-GS000016015-ASM.tsv.bz2'</span>
% Total % Received % Xferd Average Speed Time Time Time Current
100 216M 100 216M 0 0 10.0M 0 0:00:21 0:00:21 --:--:-- 9361k
</code></pre>
</notextile>
-# Use @arv-put@ actually add the file to Keep:
+# Now upload the file to Keep using @arv-put@:
<notextile>
<pre><code>~$ <span class="userinput">arv-put var-GS000016015-ASM.tsv.bz2</span>
216M / 216M 100.0%
</code></pre>
</notextile>
-The output value @c1bad4b39ca5a924e481008009d94e32+210@ is the Arvados collection locator that uniquely describes this file. In order to place your newly uploaded file into a Project, visit the workbench page for your new collection: <a href="https://{{ site.arvados_workbench_host }}/collections/c1bad4b39ca5a924e481008009d94e32+210" target="_blank">https://{{ site.arvados_workbench_host }}/collections/c1bad4b39ca5a924e481008009d94e32+210</a>. On that page, click on <span class="btn btn-xs btn-primary" ><i class="fa fa-fw fa-folder"></i> Choose a project...</span> to open a modal dialog allowing you to select a destination project for your collection.
+* The output value @c1bad4b39ca5a924e481008009d94e32+210@ is the Arvados collection locator that uniquely describes this file.
+
+Now go to the workbench collections page: <a href="https://{{ site.arvados_workbench_host }}/collections" target="_blank">https://{{ site.arvados_workbench_host }}/collections</a>. Your newly uploaded collection should appear near the top, with the value in the *uuid* column matching the Arvados collection locator that was printed by @arv-put@. Click on the *<i class="fa fa-fw fa-archive"></i> Show* button to go to the workbench page for your collection. Alternately, you can paste the Arvados collection locator into the *Search* box of the collections page to find your collection.
+
+The show collection page allows you to view the contents of the collection, download files from the collection, and set sharing options. To put your collection into a project, click on <span class="btn btn-xs btn-primary" ><i class="fa fa-fw fa-folder"></i> Choose a project...</span>. This will open a modal dialog allowing you to select a destination project for your collection.
notextile. </div>
notextile. <div class="spaced-out">
-# Starting from the Arvados Dashboard, click on <span class="btn btn-sm btn-primary" > <i class="fa fa-fw fa-plus"></i> Add new project</span>. This will direct you to the page for the new project.
-# Click on the pencil icon <i class="fa fa-fw fa-pencil"></i> next to *New project* to pop up a text box and change the project title to *Tutorial output*. Click on <span class="btn btn-xs btn-primary" ><i class="glyphicon glyphicon-ok"></i></span> to save the new name.
+# Start from the *Workbench Dashboard*. You can return to the dashboard by clicking on *<i class="fa fa-lg fa-fw fa-home"></i> Home* in the upper left corner of any Workbench page.
+# Click on <span class="btn btn-sm btn-primary" > <i class="fa fa-fw fa-plus"></i> Add new project</span> on the "My projects" panel. This will direct you to the page for the new project.
+# On the new project page, click on the pencil icon <i class="fa fa-fw fa-pencil"></i> next to *New project* to pop up a text box and change the project title to *Tutorial output*. Click on <span class="btn btn-xs btn-primary" ><i class="glyphicon glyphicon-ok"></i></span> to save the new name.
# Click on <span class="btn btn-sm btn-primary"><i class="fa fa-fw fa-gear"></i> Run a pipeline...</span> This will open a modal dialog box titled *Choose a pipeline to run*.
# Click on *<i class="fa fa-lg fa-fw fa-home"></i> Projects <span class="caret"></span>*. Under *Projects shared with me* select *<i class="fa fa-fw fa-share-alt"></i> Arvados Tutorial*.
# Select *<i class="fa fa-fw fa-gear"></i> Tutorial align using bwa mem* and click on <span class="btn btn-sm btn-primary" >Next: choose inputs <i class="fa fa-fw fa-arrow-circle-right"></i></span>. This will load a new page where you will supply the inputs for the pipeline.