From: Ward Vandewege Date: Tue, 23 Mar 2021 20:27:35 +0000 (-0400) Subject: 17495: Merge branch 'master' into 17495-document-dedup-report X-Git-Tag: 2.2.0~93^2 X-Git-Url: https://git.arvados.org/arvados.git/commitdiff_plain/5aa7b2ef565348e637af8dfd9351f82c8cc5b5e6?hp=9e806b0c387cf7146e0d0cf3169fd227963c19b9 17495: Merge branch 'master' into 17495-document-dedup-report Arvados-DCO-1.1-Signed-off-by: Ward Vandewege --- diff --git a/doc/_config.yml b/doc/_config.yml index 0d957eb2aa..191016ec43 100644 --- a/doc/_config.yml +++ b/doc/_config.yml @@ -191,6 +191,7 @@ navbar: - admin/workbench2-vocabulary.html.textile.liquid - admin/storage-classes.html.textile.liquid - admin/keep-recovering-data.html.textile.liquid + - admin/keep-measuring-deduplication.html.textile.liquid - Cloud: - admin/spot-instances.html.textile.liquid - admin/cloudtest.html.textile.liquid diff --git a/doc/admin/keep-measuring-deduplication.html.textile.liquid b/doc/admin/keep-measuring-deduplication.html.textile.liquid new file mode 100644 index 0000000000..76b477d096 --- /dev/null +++ b/doc/admin/keep-measuring-deduplication.html.textile.liquid @@ -0,0 +1,80 @@ +--- +layout: default +navsection: admin +title: "Measuring deduplication" +... + +{% comment %} +Copyright (C) The Arvados Authors. All rights reserved. + +SPDX-License-Identifier: CC-BY-SA-3.0 +{% endcomment %} + +The @arvados-client@ tool can be used to generate a deduplication report across an arbitrary number of collections. It can be installed from packages (@apt install arvados-client@ or @yum install arvados-client@). + +h2(#syntax). Syntax + + +
~$ arvados-client deduplication-report -h
+Usage:
+  arvados-client deduplication-report [options ...]   ...
+
+  arvados-client deduplication-report [options ...] , \
+     , ...
+
+  This program analyzes the overlap in blocks used by 2 or more collections. It
+  prints a deduplication report that shows the nominal space used by the
+  collections, as well as the actual size and the amount of space that is saved
+  by Keep's deduplication.
+
+  The list of collections may be provided in two ways. A list of collection
+  uuids is sufficient. Alternatively, the PDH for each collection may also be
+  provided. This is will greatly speed up operation when the list contains
+  multiple collections with the same PDH.
+
+  Exit status will be zero if there were no errors generating the report.
+
+Example:
+
+  Use the 'arv' and 'jq' commands to get the list of the 100
+  largest collections and generate the deduplication report:
+
+  arv collection list --order 'file_size_total desc' --limit 100 | \
+    jq -r '.items[] | [.portable_data_hash,.uuid] |@csv' | \
+    sed -e 's/"//g'|tr '\n' ' ' | \
+    xargs arvados-client deduplication-report
+
+Options:
+  -config file
+      Site configuration file (default may be overridden by setting an ARVADOS_CONFIG environment variable) (default "/etc/arvados/config.yml")
+  -log-level string
+      logging level (debug, info, ...) (default "info")
+
+
+
+ +The usual environment variables (@ARVADOS_API_HOST@ and @ARVADOS_API_TOKEN@) need to be set for the deduplication report to be be generated. To get cluster-wide results, an admin token will need to be supplied. Users can also run this report, but only collections their token is able to read will be included. + +Example output (with uuids and portable data hashes obscured) from a small Arvados cluster: + + +
~$ arv collection list --order 'file_size_total desc' --limit 10 | jq -r '.items[] | [.portable_data_hash,.uuid] |@csv' |sed -e 's/"//g'|tr '\n' ' ' |xargs arvados-client deduplication-report
+Collection _____-_____-_______________: pdh ________________________________+5003343; nominal size 7382073267640 (6.7 TiB); file count 2796
+Collection _____-_____-_______________: pdh ________________________________+4961919; nominal size 6989909625775 (6.4 TiB); file count 5592
+Collection _____-_____-_______________: pdh ________________________________+1903643; nominal size 2677933564052 (2.4 TiB); file count 2796
+Collection _____-_____-_______________: pdh ________________________________+1903643; nominal size 2677933564052 (2.4 TiB); file count 2796
+Collection _____-_____-_______________: pdh ________________________________+137710; nominal size 191858151583 (179 GiB); file count 201
+Collection _____-_____-_______________: pdh ________________________________+137636; nominal size 191858101962 (179 GiB); file count 200
+Collection _____-_____-_______________: pdh ________________________________+135350; nominal size 191715427388 (178 GiB); file count 201
+Collection _____-_____-_______________: pdh ________________________________+135276; nominal size 191715384167 (178 GiB); file count 200
+Collection _____-_____-_______________: pdh ________________________________+135350; nominal size 191707276684 (178 GiB); file count 201
+Collection _____-_____-_______________: pdh ________________________________+135276; nominal size 191707233463 (178 GiB); file count 200
+
+Collections:                              10
+Nominal size of stored data:  20878411596766 bytes (19 TiB)
+Actual size of stored data:   17053104444050 bytes (16 TiB)
+Saved by Keep deduplication:   3825307152716 bytes (3.5 TiB)
+
+
+
+
diff --git a/lib/deduplicationreport/report.go b/lib/deduplicationreport/report.go index 8bb3fc4e57..8759df080c 100644 --- a/lib/deduplicationreport/report.go +++ b/lib/deduplicationreport/report.go @@ -60,7 +60,7 @@ Example: arv collection list --order 'file_size_total desc' --limit 100 | \ jq -r '.items[] | [.portable_data_hash,.uuid] |@csv' | \ - tail -n+2 |sed -e 's/"//g'|tr '\n' ' ' | \ + sed -e 's/"//g'|tr '\n' ' ' | \ xargs %s Options: