---
layout: default
navsection: userguide
title: "Querying the Metadata Database"
navorder: 116
---
h1. Tutorial: Querying the Metadata Database
This tutorial introduces the Arvados Metadata Database. The Metadata Database stores information about files in Keep. This example will use the Python SDK to find public WGS (Whole Genome Sequencing) data for people who have reported a certain medical condition.
*This tutorial assumes that you are "logged into an Arvados VM instance":{{site.basedoc}}/user/getting_started/ssh-access.html#login, and have a "working environment.":{{site.basedoc}}/user/getting_started/check-environment.html*
In tutorial example, three angle brackets (>>>) will be used to denote code to enter at the Python prompt.
Start by running Python.
$ python
Python 2.7.3 (default, Jan 2 2013, 13:56:14)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> import arvados
This tutorial will also use the regular expression (re) and json python modules:
>>> import re
>>> import json
>>> all_traits = arvados.api().traits().list(limit=1000).execute()
* @arvados.api()@ gets an object that provides access to the Arvados API server
* @.traits()@ gets an object that provides access to the "traits" resource on the Arvados API server
* @.list(limit=1000)@ constructs a query to list all elements of the "traits" resource, with a limit of 1000 entries returned
* @.execute()@ executes the query and returns the result, which we assign to "all_traits"
notextile. >>> cancer_traits = filter(lambda t: re.search('cancer', t['name']), all_traits['items'])
* @lambda t: re.search('cancer', t['name'])@ is an inline function that takes a parameter @t@ and uses a simple regular expression to test if @t['name']@ contains the substring 'cancer'
* @all_traits['items']@ is the input sequence of traits
* @filter@ tests each element @t@ and constructs a new sequence consisting only of the elements that pass the filter
* @cancer_traits@ gets the result of @filter@
>>> for t in cancer_traits: print(t['uuid'], t['name'])
...
qr1hi-q1cn2-8q57g2diohwnzm0 Cervical cancer
qr1hi-q1cn2-vqp4243janpjbyj Breast cancer
qr1hi-q1cn2-v6usijujcpwqrn1 Non-melanoma skin cancer
...
>>> non_melanoma_cancer = 'qr1hi-q1cn2-v6usijujcpwqrn1'
h2. Finding humans with the selected trait
We query the "links" resource to find humans that report the selected trait. Links are directional connections between Arvados data items, for example, from a human to their reported traits.
>>> trait_query = json.dumps({
'link_class': 'human_trait',
'tail_kind': 'arvados#human',
'head_uuid': non_melanoma_cancer
})
>>> trait_links = arvados.api().links().list(limit=1000, where=trait_query).execute()
>>> human_uuids = map(lambda l: l['tail_uuid'], trait_links['items'])
>>> human_uuids
[u'1h9kt-7a9it-c0uqa4kcdh29wdf', u'1h9kt-7a9it-x4tru6mn40hc6ah',
u'1h9kt-7a9it-yqb8m5s9cpy88i8', u'1h9kt-7a9it-46sm75w200ngwny',
u'1h9kt-7a9it-gx85a4tdkpzsg3w', u'1h9kt-7a9it-8cvlaa8909lgeo9',
u'1h9kt-7a9it-as37qum2pq8vizb', u'1h9kt-7a9it-14fph66z2baqxb9',
u'1h9kt-7a9it-e9zc7i4crmw3v69', u'1h9kt-7a9it-np7f35hlijlxdmt',
u'1h9kt-7a9it-j9hqyjwbvo9cojn', u'1h9kt-7a9it-lqxdtm1gynmsv13',
u'1h9kt-7a9it-zkhhxjfg2o22ywq', u'1h9kt-7a9it-nsjoxqd33lzldw9',
u'1h9kt-7a9it-ytect4smzcgd4kg', u'1h9kt-7a9it-y6tl353b3jc4tos',
u'1h9kt-7a9it-98f8qave4f8vbs5', u'1h9kt-7a9it-gd72sh15q0p4wq3',
u'1h9kt-7a9it-zlx25dscak94q9h', u'1h9kt-7a9it-8gronw4rbgmim01',
u'1h9kt-7a9it-wclfkjcb23tr5es', u'1h9kt-7a9it-rvp2qe7szfz4dy6',
u'1h9kt-7a9it-50iffhmpzsktwjm', u'1h9kt-7a9it-ul412id5y31a5o8',
u'1h9kt-7a9it-732kwkfzylmt4ik', u'1h9kt-7a9it-v9zqxegpblsbtai',
u'1h9kt-7a9it-kmaraqduit1v5wd', u'1h9kt-7a9it-t1nwtlo1hru5vvq',
u'1h9kt-7a9it-q3w6j9od4ibpoyl', u'1h9kt-7a9it-qz8vzkuuz97ezwv',
u'1h9kt-7a9it-t1v8sjz6dm9jmjf', u'1h9kt-7a9it-qe8wrbyvuqs5jew']
>>> human_query = json.dumps({
"link_class": "identifier",
"head_uuid": human_uuids
})
>>> pgpid_links = arvados.service.links().list(limit=1000, where=human_query).execute()
>>> map(lambda l: l['name'], pgpid_links['items'])
[u'hu01024B', u'hu11603C', u'hu15402B', u'hu174334', u'hu1BD549', u'hu237A50',
u'hu34A921', u'hu397733', u'hu414115', u'hu43860C', u'hu474789', u'hu553620',
u'hu56B3B6', u'hu5917F3', u'hu599905', u'hu5E55F5', u'hu602487', u'hu633787',
u'hu68F245', u'hu6C3F34', u'hu7260DD', u'hu7A2F1D', u'hu94040B', u'hu9E356F',
u'huAB8707', u'huB1FD55', u'huB4883B', u'huD09050', u'huD09534', u'huD3A569',
u'huDF04CC', u'huE2E371']
>>> provenance_links = arvados.api().links().list(limit=1000, where=json.dumps({
"link_class": "provenance",
"name": "provided",
"tail_uuid": human_uuids
})).execute()
collection_uuids = map(lambda l: l['head_uuid'], provenance_links['items'])
# build map of human uuid -> PGP ID
pgpid = {}
for pgpid_link in pgpid_links['items']:
pgpid[pgpid_link['head_uuid']] = pgpid_link['name']
# build map of collection uuid -> PGP ID
for p_link in provenance_links['items']:
pgpid[p_link['head_uuid']] = pgpid[p_link['tail_uuid']]
# get details (e.g., list of files) of each collection
collections = arvados.service.collections().list(where=json.dumps({
"uuid": collection_uuids
})).execute()
# print PGP public profile links with file locators
for c in collections['items']:
for f in c['files']:
print "https://my.personalgenomes.org/profile/%s %s %s%s" % (pgpid[c['uuid']], c['uuid'], ('' if f[0] == '.' else f[0]+'/'), f[1])
https://my.personalgenomes.org/profile/hu43860C a58dca7609fa84c8c38a7e926a97b2fc+302+K@qr1hi var-GS00253-DNA_A01_200_37-ASM.tsv.bz2
https://my.personalgenomes.org/profile/huB1FD55 ea30eb9e46eedf7f05ed6e348c2baf5d+291+K@qr1hi var-GS000010320-ASM.tsv.bz2
https://my.personalgenomes.org/profile/huDF04CC 4ab0df8f22f595d1747a22c476c05873+242+K@qr1hi var-GS000010427-ASM.tsv.bz2
https://my.personalgenomes.org/profile/hu7A2F1D 756d0ada29b376140f64e7abfe6aa0e7+242+K@qr1hi var-GS000014566-ASM.tsv.bz2
https://my.personalgenomes.org/profile/hu553620 7ed4e425bb1c7cc18387cbd9388181df+242+K@qr1hi var-GS000015272-ASM.tsv.bz2
https://my.personalgenomes.org/profile/huD09534 542112e210daff30dd3cfea4801a9f2f+242+K@qr1hi var-GS000016374-ASM.tsv.bz2
https://my.personalgenomes.org/profile/hu599905 33a9f3842b01ea3fdf27cc582f5ea2af+242+K@qr1hi var-GS000016015-ASM.tsv.bz2
https://my.personalgenomes.org/profile/hu599905 d6e2e57cd60ba5979006d0b03e45e726+81+K@qr1hi Witch_results.zip
https://my.personalgenomes.org/profile/hu553620 ea4f2d325592a1272f989d141a917fdd+85+K@qr1hi Devenwood_results.zip
https://my.personalgenomes.org/profile/hu7A2F1D 4580f6620bb15b25b18373766e14e4a7+85+K@qr1hi Innkeeper_results.zip
https://my.personalgenomes.org/profile/huD09534 fee37be9440b912eb90f5e779f272416+82+K@qr1hi Hallet_results.zip
>>> job = {}
for c in collections['items']:
if [] != filter(lambda f: re.search('^var-.*\.tsv\.bz2', f[1]), c['files']):
job[c['uuid']] = arvados.service.jobs().create(body={
'script': 'grep',
'script_parameters': {'input': c['uuid'], 'pattern': "rs1126809\\b"},
'script_version': 'e7aeb42'
}).execute()
print "%s %s" % (pgpid[c['uuid']], job[c['uuid']]['uuid'])
hu43860C qr1hi-8i9sb-wbf3uthbhkcy8ji
huB1FD55 qr1hi-8i9sb-scklkiy8dc27dab
huDF04CC qr1hi-8i9sb-pg0w4rfrwfd9srg
hu7A2F1D qr1hi-8i9sb-n7u0u0rj8b47168
hu553620 qr1hi-8i9sb-k7gst7vyhg20pt1
huD09534 qr1hi-8i9sb-4w65pm48123fte5
hu599905 qr1hi-8i9sb-wmwa5b5r3eghnev
hu43860C qr1hi-8i9sb-j1mngmakdh8iv9o
huB1FD55 qr1hi-8i9sb-4j6ehiatcolaoxb
huDF04CC qr1hi-8i9sb-n6lcmcr3lowqr5u
hu7A2F1D qr1hi-8i9sb-0hwsdtojfcxjo40
hu553620 qr1hi-8i9sb-cvvqzqea7jhwb0i
huD09534 qr1hi-8i9sb-d0y0qtzuwzbrjj0
hu599905 qr1hi-8i9sb-i9ec9g8d7rt70xg
>>> map(lambda j: arvados.service.jobs().get(uuid=j['uuid']).execute()['success'], job.values())
[None, True, None, None, None, None, None, None, None, None, None, None, None, None]
>>> total = 0
for collection_uuid in job:
job_uuid = job[collection_uuid]['uuid']
job_output = arvados.service.jobs().get(uuid=job_uuid).execute()['output']
output_files = arvados.service.collections().get(uuid=job_output).execute()['files']
# Test the output size. If greater than zero, that means 'grep' found the variant
if output_files[0][2] > 0:
print "%s has variant rs1126809" % (pgpid[collection_uuid])
total += 1
else:
print "%s does not have variant rs1126809" % (pgpid[collection_uuid])
print "rs1126809 is found in", total, "out of", len(job), "participants reporting non-melanoma skin cancer"
hu599905 80 5644238bfb2a1925d423f2c264819cfb+75+K@qr1hi
huD09534 80 f98f92573cf521333607910d320cc33b+75+K@qr1hi
huB1FD55 0 c10e07d8d90b51ee7f3b0a5855dc77c3+65+K@qr1hi
hu7A2F1D 80 922c4ce8d3dab3268edf8b9312cc63d4+75+K@qr1hi
hu553620 0 66da988f45a7ee16b6058fcbe9859d69+65+K@qr1hi
huDF04CC 80 bbe919451a437dde236a561d4e469ad2+75+K@qr1hi
hu43860C 0 45797e38410de9b9ddef2f4f0ec41a93+76+K@qr1hi