~$ arvados-client deduplication-report -h
Usage:
arvados-client deduplication-report [options ...] ...
arvados-client deduplication-report [options ...] , \
, ...
This program analyzes the overlap in blocks used by 2 or more collections. It
prints a deduplication report that shows the nominal space used by the
collections, as well as the actual size and the amount of space that is saved
by Keep's deduplication.
The list of collections may be provided in two ways. A list of collection
uuids is sufficient. Alternatively, the PDH for each collection may also be
provided. This is will greatly speed up operation when the list contains
multiple collections with the same PDH.
Exit status will be zero if there were no errors generating the report.
Example:
Use the 'arv' and 'jq' commands to get the list of the 100
largest collections and generate the deduplication report:
arv collection list --order 'file_size_total desc' --limit 100 | \
jq -r '.items[] | [.portable_data_hash,.uuid] |@csv' | \
sed -e 's/"//g'|tr '\n' ' ' | \
xargs arvados-client deduplication-report
Options:
-log-level string
logging level (debug, info, ...) (default "info")
The usual environment variables (@ARVADOS_API_HOST@ and @ARVADOS_API_TOKEN@) need to be set for the deduplication report to be be generated. To get cluster-wide results, an admin token will need to be supplied. Users can also run this report, but only collections their token is able to read will be included.
Example output (with uuids and portable data hashes obscured) from a small Arvados cluster: