X-Git-Url: https://git.arvados.org/arvados.git/blobdiff_plain/2183113c4c357e07719251854e3d249cdcd394dd..d5ba0e97f8522ba3ce6ad36edf099c661a43f6b7:/doc/user/tutorials/tutorial-keep.textile diff --git a/doc/user/tutorials/tutorial-keep.textile b/doc/user/tutorials/tutorial-keep.textile index a65665f12d..6683498e86 100644 --- a/doc/user/tutorials/tutorial-keep.textile +++ b/doc/user/tutorials/tutorial-keep.textile @@ -1,47 +1,136 @@ --- layout: default navsection: userguide -title: "Adding Data to Keep" -navorder: 120 +navmenu: Tutorials +title: "Storing and Retrieving data using Arvados Keep" +navorder: 11 --- -h1. Tutorial: Adding Data to Keep +h1. Storing and Retrieving data using Arvados Keep -Now that you've run a Crunch job on sample data, we'll walk you through the process of uploading your own research data into Keep, the distributed storage service. +This tutorial introduces you to the Arvados file storage system. -h2. Prerequisites -You should have already "run your first job":tutorial-job1.html using sample data on an Arvados shell VM. If you haven't, go do that first. +*This tutorial assumes that you are "logged into an Arvados VM instance":{{site.basedoc}}/user/getting_started/ssh-access.html#login, and have a "working environment.":{{site.basedoc}}/user/getting_started/check-environment.html* -h2. Adding Data to Keep +The Arvados distributed file system is called *Keep*. Keep is a content-addressable file system. This means that files are managed using special unique identifiers derived from the _contents_ of the file, rather than human-assigned file names (specifically, the md5 hash). This has a number of advantages: +* Files can be stored and replicated across a cluster of servers without requiring a central name server. +* Systematic validation of data integrity by both server and client because the checksum is built into the identifier. +* Minimizes data duplication (two files with the same contents will result in the same identifier, and will not be stored twice.) +* Avoids data race conditions (an identifier always points to the same data.) -Let's suppose you have a VCF file, @MyExome.vcf@ and want to run an Arvados pipeline on this data. Copy it to the Arvados shell VM with @rsync@: +h1. Putting Data into Keep -bc. rsync MyExome.vcf shell.arvados:MyExome.vcf +We will start with downloading a freely available VCF file from the "Personal Genome Project (PGP)":http://www.personalgenomes.org subject "hu599905":https://my.personalgenomes.org/profile/hu599905 to a staging directory on the VM, and then add it to Keep. -If you don't already have VCF data ready to go, you can download a VCF exome from "PersonalGenomes.org":http://www.personalgenomes.org (["example":https://my.personalgenomes.org/user_file/download/825]). +First, log into the Arvados VM instance and set up the staging area: -bc.. $ ssh shell.arvados +notextile.
$ mkdir /scratch/you
-shell.arvados$ wget -o MyExome.vcf https://my.personalgenomes.org/user_file/download/825
---2013-12-10 21:25:18-- https://my.personalgenomes.org/user_file/download/825
-Resolving my.personalgenomes.org (my.personalgenomes.org)... 134.174.150.6
-Connecting to my.personalgenomes.org (my.personalgenomes.org)|134.174.150.6|:443... connected.
-...
-HTTP request sent, awaiting response... 200 OK
-Length: 39814813 (38M) [text/x-vcard]
-Saving to: âMyExome.vcfâ
+Next, download the file:
-100% [=====================================>] 39,814,813 193KB/s in 4m 42s
+$ mkdir /scratch/you
+$ cd /scratch/you
+$ curl -o var-GS000016015-ASM.tsv.bz2 'https://warehouse.personalgenomes.org/warehouse/f815ec01d5d2f11cb12874ab2ed50daa+234+K@ant/var-GS000016015-ASM.tsv.bz2'
+ % Total % Received % Xferd Average Speed Time Time Time Current
+ Dload Upload Total Spent Left Speed
+100 216M 100 216M 0 0 10.0M 0 0:00:21 0:00:21 --:--:-- 9361k
+
+$ scp MyData.vcf you@shell.arvados:/scratch/you/MyData.vcf
-Use the @arv keep@ command to add your VCF data to Keep:
+Now use @arv keep put@ to add your VCF data to Keep:
-bc. shell.arvados$ arv keep put MyExome.vcf
-9845d870ebe27036ba101a3bee10fb3f+234+K@ant
+$ cd /scratch/you
+$ arv keep put var-GS000016015-ASM.tsv.bz2
+c1bad4b39ca5a924e481008009d94e32+210
+
+$ mkdir tmp
+$ echo "hello alice" > tmp/alice.txt
+$ echo "hello bob" > tmp/bob.txt
+$ echo "hello carol" > tmp/carol.txt
+$ arv keep put tmp
+0M / 0M 100.0%
+887cd41e9c613463eab2f0d885c6dd96+83
+
+$ arv keep get c1bad4b39ca5a924e481008009d94e32+210
+. 204e43b8a1185621ca55a94839582e6f+67108864 b9677abbac956bd3e86b1deb28dfac03+67108864 fc15aff2a762b13f521baf042140acec+67108864 323d2a3ce20370c4ca1d3462a344f8fd+25885655 0:227212247:var-GS000016015-ASM.tsv.bz2
+
+204e43b8a1185621ca55a94839582e6f+67108864
, b9677abbac956bd3e86b1deb28dfac03+67108864
, fc15aff2a762b13f521baf042140acec+67108864
, 323d2a3ce20370c4ca1d3462a344f8fd+25885655
.
+
+Let's use @arv keep get@ to download the first datablock:
+
+notextile. $ arv keep get 204e43b8a1185621ca55a94839582e6f+67108864 > block1
+
+Let's look at the size and compute the md5 hash of @block1@:
+
+$ ls -l block1
+-rw-r--r-- 1 you group 67108864 Dec 9 20:14 block1
+$ md5sum block1
+204e43b8a1185621ca55a94839582e6f block1
+
+204e43b8a1185621ca55a94839582e6f+67108864
of:
+* the md5 hash @204e43b8a1185621ca55a94839582e6f@ which matches the md5 hash of @block1@
+* a size hint @67108864@ which matches the size of @block1@
+
+Next, let's use @arv keep get@ to download and reassemble @var-GS000016015-ASM.tsv.bz2@ using the following command:
+
+notextile. $ arv keep get c1bad4b39ca5a924e481008009d94e32+210/var-GS000016015-ASM.tsv.bz2 .
+
+This downloads the file @var-GS000016015-ASM.tsv.bz2@ described by collection @c1bad4b39ca5a924e481008009d94e32+210@ from Keep and places it into the local directory. Now that we have the file, we can compute the md5 hash of the complete file:
+
+$ md5sum var-GS000016015-ASM.tsv.bz2
+44b8ae3fde7a8a88d2f7ebd237625b4f var-GS000016015-ASM.tsv.bz2
+
+$ arv keep ls c1bad4b39ca5a924e481008009d94e32+210
+var-GS000016015-ASM.tsv.bz2
+$ arv keep ls -s c1bad4b39ca5a924e481008009d94e32+210
+221887 var-GS000016015-ASM.tsv.bz2
+
+