X-Git-Url: https://git.arvados.org/arvados.git/blobdiff_plain/956093639196f92f3ad4b9e0795bcc7e30520e0d..0456c65364c2189ac64775a40ac6279f8ef61802:/doc/user/tutorials/tutorial-keep.html.textile.liquid?ds=sidebyside diff --git a/doc/user/tutorials/tutorial-keep.html.textile.liquid b/doc/user/tutorials/tutorial-keep.html.textile.liquid index 01ef78ad12..fac3530373 100644 --- a/doc/user/tutorials/tutorial-keep.html.textile.liquid +++ b/doc/user/tutorials/tutorial-keep.html.textile.liquid @@ -1,136 +1,166 @@ --- layout: default navsection: userguide -navmenu: Tutorials -title: "Storing and Retrieving data using Arvados Keep" - +title: "Storing and Retrieving data using Keep" ... -h1. Storing and Retrieving data using Arvados Keep - This tutorial introduces you to the Arvados file storage system. -*This tutorial assumes that you are "logged into an Arvados VM instance":{{site.basedoc}}/user/getting_started/ssh-access.html#login, and have a "working environment.":{{site.basedoc}}/user/getting_started/check-environment.html* +*This tutorial assumes that you are "logged into an Arvados VM instance":{{site.baseurl}}/user/getting_started/ssh-access.html#login, and have a "working environment.":{{site.baseurl}}/user/getting_started/check-environment.html* -The Arvados distributed file system is called *Keep*. Keep is a content-addressable file system. This means that files are managed using special unique identifiers derived from the _contents_ of the file, rather than human-assigned file names (specifically, the md5 hash). This has a number of advantages: +The Arvados distributed file system is called *Keep*. Keep is a content-addressable file system. This means that files are managed using special unique identifiers derived from the _contents_ of the file, rather than human-assigned file names (specifically, the MD5 hash). This has a number of advantages: * Files can be stored and replicated across a cluster of servers without requiring a central name server. -* Systematic validation of data integrity by both server and client because the checksum is built into the identifier. -* Minimizes data duplication (two files with the same contents will result in the same identifier, and will not be stored twice.) -* Avoids data race conditions (an identifier always points to the same data.) +* Both the server and client systematically validate data integrity because the checksum is built into the identifier. +* Data duplication is minimizedâtwo files with the same contents will have in the same identifier, and will not be stored twice. +* It avoids data race conditions, since an identifier always points to the same data. h1. Putting Data into Keep -We will start with downloading a freely available VCF file from the "Personal Genome Project (PGP)":http://www.personalgenomes.org subject "hu599905":https://my.personalgenomes.org/profile/hu599905 to a staging directory on the VM, and then add it to Keep. +We will start by downloading a freely available VCF file from "Personal Genome Project (PGP)":http://www.personalgenomes.org subject "hu599905":https://my.personalgenomes.org/profile/hu599905 to a staging directory on the VM, and adding it to Keep. In the following commands, replace *@you@* with your login name. -First, log into the Arvados VM instance and set up the staging area: +First, log into your Arvados VM and set up the staging area: -notextile.
$ mkdir /scratch/you
+notextile. ~$ mkdir /scratch/you
Next, download the file:
$ mkdir /scratch/you
-$ cd /scratch/you
-$ curl -o var-GS000016015-ASM.tsv.bz2 'https://warehouse.personalgenomes.org/warehouse/f815ec01d5d2f11cb12874ab2ed50daa+234+K@ant/var-GS000016015-ASM.tsv.bz2'
+~$ cd /scratch/you
+/scratch/you$ curl -o var-GS000016015-ASM.tsv.bz2 'https://warehouse.personalgenomes.org/warehouse/f815ec01d5d2f11cb12874ab2ed50daa+234+K@ant/var-GS000016015-ASM.tsv.bz2'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 216M 100 216M 0 0 10.0M 0 0:00:21 0:00:21 --:--:-- 9361k
~$ scp MyData.vcf you@shell.arvados:/scratch/you/MyData.vcf
-notextile. $ scp MyData.vcf you@shell.arvados:/scratch/you/MyData.vcf
+{% include 'notebox_end' %}
-Now use @arv keep put@ to add your VCF data to Keep:
+Now use @arv keep put@ to add your VCF data to Keep, then delete the local copy of the file:
$ cd /scratch/you
-$ arv keep put var-GS000016015-ASM.tsv.bz2
+/scratch/you$ arv keep put var-GS000016015-ASM.tsv.bz2
c1bad4b39ca5a924e481008009d94e32+210
+/scratch/you$ rm var-GS000016015-ASM.tsv.bz2
$ mkdir tmp
-$ echo "hello alice" > tmp/alice.txt
-$ echo "hello bob" > tmp/bob.txt
-$ echo "hello carol" > tmp/carol.txt
-$ arv keep put tmp
-0M / 0M 100.0%
+/scratch/you$ mkdir tmp
+/scratch/you$ echo "hello alice" > tmp/alice.txt
+/scratch/you$ echo "hello bob" > tmp/bob.txt
+/scratch/you$ echo "hello carol" > tmp/carol.txt
+/scratch/you$ arv keep put tmp
+0M / 0M 100.0%
887cd41e9c613463eab2f0d885c6dd96+83
$ arv keep get c1bad4b39ca5a924e481008009d94e32+210
-. 204e43b8a1185621ca55a94839582e6f+67108864 b9677abbac956bd3e86b1deb28dfac03+67108864 fc15aff2a762b13f521baf042140acec+67108864 323d2a3ce20370c4ca1d3462a344f8fd+25885655 0:227212247:var-GS000016015-ASM.tsv.bz2
+/scratch/you$ arv keep ls c1bad4b39ca5a924e481008009d94e32+210
+var-GS000016015-ASM.tsv.bz2
-
204e43b8a1185621ca55a94839582e6f+67108864
, b9677abbac956bd3e86b1deb28dfac03+67108864
, fc15aff2a762b13f521baf042140acec+67108864
, 323d2a3ce20370c4ca1d3462a344f8fd+25885655
.
+/scratch/you$ arv keep ls 887cd41e9c613463eab2f0d885c6dd96+83
+alice.txt
+bob.txt
+carol.txt
+
+
-Let's use @arv keep get@ to download the first datablock:
+Use @-s@ to print file sizes rounded up to the nearest kilobyte:
-notextile. $ arv keep get 204e43b8a1185621ca55a94839582e6f+67108864 > block1
+/scratch/you$ arv keep ls -s c1bad4b39ca5a924e481008009d94e32+210
+221887 var-GS000016015-ASM.tsv.bz2
+
+$ ls -l block1
--rw-r--r-- 1 you group 67108864 Dec 9 20:14 block1
-$ md5sum block1
-204e43b8a1185621ca55a94839582e6f block1
+/scratch/you$ arv keep get c1bad4b39ca5a924e481008009d94e32+210/ .
+/scratch/you$ ls var-GS000016015-ASM.tsv.bz2
+var-GS000016015-ASM.tsv.bz2
204e43b8a1185621ca55a94839582e6f+67108864
of:
-* the md5 hash @204e43b8a1185621ca55a94839582e6f@ which matches the md5 hash of @block1@
-* a size hint @67108864@ which matches the size of @block1@
+You can also download individual files:
-Next, let's use @arv keep get@ to download and reassemble @var-GS000016015-ASM.tsv.bz2@ using the following command:
-
-notextile. $ arv keep get c1bad4b39ca5a924e481008009d94e32+210/var-GS000016015-ASM.tsv.bz2 .
+/scratch/you$ arv keep get 887cd41e9c613463eab2f0d885c6dd96+83/alice.txt .
+
+$ md5sum var-GS000016015-ASM.tsv.bz2
+/scratch/you$ md5sum var-GS000016015-ASM.tsv.bz2
44b8ae3fde7a8a88d2f7ebd237625b4f var-GS000016015-ASM.tsv.bz2
$ arv keep ls c1bad4b39ca5a924e481008009d94e32+210
+/scratch/you$ mkdir -p mnt
+/scratch/you$ arv-mount --collection c1bad4b39ca5a924e481008009d94e32+210 mnt &
+/scratch/you$ cd mnt
+/scratch/you/mnt$ ls
var-GS000016015-ASM.tsv.bz2
-$ arv keep ls -s c1bad4b39ca5a924e481008009d94e32+210
-221887 var-GS000016015-ASM.tsv.bz2
+/scratch/you/mnt$ md5sum var-GS000016015-ASM.tsv.bz2
+44b8ae3fde7a8a88d2f7ebd237625b4f var-GS000016015-ASM.tsv.bz2
+/scratch/you/mnt$ cd ..
+/scratch/you$ fusermount -u mnt
/scratch/you$ mkdir -p mnt
+/scratch/you$ arv-mount mnt &
+/scratch/you$ cd mnt/c1bad4b39ca5a924e481008009d94e32+210
+/scratch/you/mnt/c1bad4b39ca5a924e481008009d94e32+210$ ls
+var-GS000016015-ASM.tsv.bz2
+/scratch/you/mnt/c1bad4b39ca5a924e481008009d94e32+210$ md5sum var-GS000016015-ASM.tsv.bz2
+44b8ae3fde7a8a88d2f7ebd237625b4f var-GS000016015-ASM.tsv.bz2
+/scratch/you/mnt/c1bad4b39ca5a924e481008009d94e32+210$ cd ../..
+/scratch/you$ fusermount -u mnt
+
+