4 title: "Measuring deduplication"
8 Copyright (C) The Arvados Authors. All rights reserved.
10 SPDX-License-Identifier: CC-BY-SA-3.0
13 The @arvados-client@ tool can be used to generate a deduplication report across an arbitrary number of collections. It can be installed from packages (@apt install arvados-client@ or @yum install arvados-client@).
18 <pre><code>~$ <span class="userinput">arvados-client deduplication-report -h</span>
20 arvados-client deduplication-report [options ...] <collection-uuid> <collection-uuid> ...
22 arvados-client deduplication-report [options ...] <collection-pdh>,<collection_uuid> \
23 <collection-pdh>,<collection_uuid> ...
25 This program analyzes the overlap in blocks used by 2 or more collections. It
26 prints a deduplication report that shows the nominal space used by the
27 collections, as well as the actual size and the amount of space that is saved
28 by Keep's deduplication.
30 The list of collections may be provided in two ways. A list of collection
31 uuids is sufficient. Alternatively, the PDH for each collection may also be
32 provided. This is will greatly speed up operation when the list contains
33 multiple collections with the same PDH.
35 Exit status will be zero if there were no errors generating the report.
39 Use the 'arv' and 'jq' commands to get the list of the 100
40 largest collections and generate the deduplication report:
42 arv collection list --order 'file_size_total desc' --limit 100 | \
43 jq -r '.items[] | [.portable_data_hash,.uuid] |@csv' | \
44 sed -e 's/"//g'|tr '\n' ' ' | \
45 xargs arvados-client deduplication-report
49 logging level (debug, info, ...) (default "info")
54 The usual environment variables (@ARVADOS_API_HOST@ and @ARVADOS_API_TOKEN@) need to be set for the deduplication report to be be generated. To get cluster-wide results, an admin token will need to be supplied. Users can also run this report, but only collections their token is able to read will be included.
56 Example output (with uuids and portable data hashes obscured) from a small Arvados cluster:
59 <pre><code>~$ <span class="userinput">arv collection list --order 'file_size_total desc' --limit 10 | jq -r '.items[] | [.portable_data_hash,.uuid] |@csv' |sed -e 's/"//g'|tr '\n' ' ' |xargs arvados-client deduplication-report</span>
60 Collection _____-_____-_______________: pdh ________________________________+5003343; nominal size 7382073267640 (6.7 TiB); file count 2796
61 Collection _____-_____-_______________: pdh ________________________________+4961919; nominal size 6989909625775 (6.4 TiB); file count 5592
62 Collection _____-_____-_______________: pdh ________________________________+1903643; nominal size 2677933564052 (2.4 TiB); file count 2796
63 Collection _____-_____-_______________: pdh ________________________________+1903643; nominal size 2677933564052 (2.4 TiB); file count 2796
64 Collection _____-_____-_______________: pdh ________________________________+137710; nominal size 191858151583 (179 GiB); file count 201
65 Collection _____-_____-_______________: pdh ________________________________+137636; nominal size 191858101962 (179 GiB); file count 200
66 Collection _____-_____-_______________: pdh ________________________________+135350; nominal size 191715427388 (178 GiB); file count 201
67 Collection _____-_____-_______________: pdh ________________________________+135276; nominal size 191715384167 (178 GiB); file count 200
68 Collection _____-_____-_______________: pdh ________________________________+135350; nominal size 191707276684 (178 GiB); file count 201
69 Collection _____-_____-_______________: pdh ________________________________+135276; nominal size 191707233463 (178 GiB); file count 200
72 Nominal size of stored data: 20878411596766 bytes (19 TiB)
73 Actual size of stored data: 17053104444050 bytes (16 TiB)
74 Saved by Keep deduplication: 3825307152716 bytes (3.5 TiB)