--- layout: default navsection: userguide title: "Storing and Retrieving data using Keep" ... This tutorial introduces you to the Arvados file storage system. *This tutorial assumes that you are "logged into an Arvados VM instance":{{site.baseurl}}/user/getting_started/ssh-access.html#login, and have a "working environment.":{{site.baseurl}}/user/getting_started/check-environment.html* The Arvados distributed file system is called *Keep*. Keep is a content-addressable file system. This means that files are managed using special unique identifiers derived from the _contents_ of the file, rather than human-assigned file names (specifically, the md5 hash). This has a number of advantages: * Files can be stored and replicated across a cluster of servers without requiring a central name server. * Both the server and client systematically validate data integrity because the checksum is built into the identifier. * Data duplication is minimized—two files with the same contents will have in the same identifier, and will not be stored twice. * It avoids data race conditions, since an identifier always points to the same data. h1. Putting Data into Keep We will start by downloading a freely available VCF file from "Personal Genome Project (PGP)":http://www.personalgenomes.org subject "hu599905":https://my.personalgenomes.org/profile/hu599905 to a staging directory on the VM, and adding it to Keep. In the following commands, replace *@you@* with your login name. First, log into your Arvados VM and set up the staging area: notextile.
~$ mkdir /scratch/you
Next, download the file:
~$ cd /scratch/you
/scratch/you$ curl -o var-GS000016015-ASM.tsv.bz2 'https://warehouse.personalgenomes.org/warehouse/f815ec01d5d2f11cb12874ab2ed50daa+234+K@ant/var-GS000016015-ASM.tsv.bz2'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  216M  100  216M    0     0  10.0M      0  0:00:21  0:00:21 --:--:-- 9361k
{% include 'notebox_begin' %} If you have your own data, for example @MyData.vcf@, you can use @scp@ or @rsync@ to copy from your local workstation to the shell VM (run this on your local workstation): notextile.
~$ scp MyData.vcf you@shell.arvados:/scratch/you/MyData.vcf
{% include 'notebox_end' %} Now use @arv keep put@ to add your VCF data to Keep, then delete the local copy of the file:
/scratch/you$ arv keep put var-GS000016015-ASM.tsv.bz2
c1bad4b39ca5a924e481008009d94e32+210
/scratch/you$ rm var-GS000016015-ASM.tsv.bz2
The output value @c1bad4b39ca5a924e481008009d94e32+210@ from @arv keep put@ is the Keep locator. This enables you to access the file you just uploaded, and is explained in the next section. h2(#dir). Putting a directory You can also use @arv keep put@ to add an entire directory:
/scratch/you$ mkdir tmp
/scratch/you$ echo "hello alice" > tmp/alice.txt
/scratch/you$ echo "hello bob" > tmp/bob.txt
/scratch/you$ echo "hello carol" > tmp/carol.txt
/scratch/you$ arv keep put tmp
0M / 0M 100.0%
887cd41e9c613463eab2f0d885c6dd96+83
The locator @887cd41e9c613463eab2f0d885c6dd96+83@ represents a collection with multiple files. h1. Getting Data from Keep h2. Using Workbench You may access collections through the "Collections section of Arvados Workbench":https://{{ site.arvados_workbench_host }}/collections at *Data* %(rarr)→% *Collections (data files)*. You can also access individual files within a collection. Some examples: * "https://{{ site.arvados_workbench_host }}/collections/c1bad4b39ca5a924e481008009d94e32+210":https://{{ site.arvados_workbench_host }}/collections/c1bad4b39ca5a924e481008009d94e32+210 * "https://{{ site.arvados_workbench_host }}/collections/887cd41e9c613463eab2f0d885c6dd96+83/alice.txt":https://{{ site.arvados_workbench_host }}/collections/887cd41e9c613463eab2f0d885c6dd96+83/alice.txt h2(#arv-get). Using the command line You can view the contents of a collection using @arv keep ls@:
/scratch/you$ arv keep ls c1bad4b39ca5a924e481008009d94e32+210
var-GS000016015-ASM.tsv.bz2
/scratch/you$ arv keep ls 887cd41e9c613463eab2f0d885c6dd96+83
alice.txt
bob.txt
carol.txt
Use @-s@ to print file sizes rounded up to the nearest kilobyte:
/scratch/you$ arv keep ls -s c1bad4b39ca5a924e481008009d94e32+210
221887 var-GS000016015-ASM.tsv.bz2
Use @arv keep get@ to download the contents of a collection and place it in the directory specified in the second argument (in this example, @.@ for the current directory):
/scratch/you$ arv keep get c1bad4b39ca5a924e481008009d94e32+210/ .
/scratch/you$ ls var-GS000016015-ASM.tsv.bz2
var-GS000016015-ASM.tsv.bz2
You can also download individual files:
/scratch/you$ arv keep get 887cd41e9c613463eab2f0d885c6dd96+83/alice.txt .
With a local copy of the file, we can do some computation, for example computing the md5 hash of the complete file:
/scratch/you$ md5sum var-GS000016015-ASM.tsv.bz2
44b8ae3fde7a8a88d2f7ebd237625b4f  var-GS000016015-ASM.tsv.bz2
h2. Using arv-mount Use @arv-mount@ to mount a Keep collection and access it using traditional filesystem tools.
/scratch/you$ mkdir -p mnt
/scratch/you$ arv-mount --collection c1bad4b39ca5a924e481008009d94e32+210 mnt &
/scratch/you$ cd mnt
/scratch/you/mnt$ ls
var-GS000016015-ASM.tsv.bz2
/scratch/you/mnt$ md5sum var-GS000016015-ASM.tsv.bz2
44b8ae3fde7a8a88d2f7ebd237625b4f  var-GS000016015-ASM.tsv.bz2
/scratch/you/mnt$ cd ..
/scratch/you$ fusermount -u mnt
You can also mount the entire Keep namespace in "magic directory" mode:
/scratch/you$ mkdir -p mnt
/scratch/you$ arv-mount mnt &
/scratch/you$ cd mnt/c1bad4b39ca5a924e481008009d94e32+210
/scratch/you/mnt/c1bad4b39ca5a924e481008009d94e32+210$ ls
var-GS000016015-ASM.tsv.bz2
/scratch/you/mnt/c1bad4b39ca5a924e481008009d94e32+210$ md5sum var-GS000016015-ASM.tsv.bz2
44b8ae3fde7a8a88d2f7ebd237625b4f  var-GS000016015-ASM.tsv.bz2
/scratch/you/mnt/c1bad4b39ca5a924e481008009d94e32+210$ cd ../..
/scratch/you$ fusermount -u mnt
@arv-mount@ provides several features: * You can browse, open and read Keep entries as if they are regular files. * It is easy for existing tools to access files in Keep. * Data is downloaded on demand. It is not necessary to download an entire file or collection to start processing.