--- layout: default navsection: userguide title: "Tutorial 4: Search PGP data by trait" navorder: 14 --- h1. Tutorial 4: Search PGP data by trait Here you will use the Python SDK to find public WGS data for people who have a certain medical condition. h3. Prerequisites * Log in to a VM "using SSH":ssh-access.html * Put an "API token":api-tokens.html in your @ARVADOS_API_TOKEN@ environment variable * Put the API host name in your @ARVADOS_API_HOST@ environment variable * Run the @python@ interactive shell. If everything is set up correctly, you will be able to import the arvados SDK:
import arvados...and display your account information:
arvados.service.users().current().execute()h3. More prerequisites
import re import jsonh3. Find traits. List traits containing the term "cancer":
for t in filter(lambda t: re.search('cancer', t['name']), arvados.service.traits().list(limit=1000).execute()['items']): print t['uuid'], t['name']%(darr)↓%
... qr1hi-q1cn2-8q57g2diohwnzm0 Cervical cancer qr1hi-q1cn2-vqp4243janpjbyj Breast cancer qr1hi-q1cn2-v6usijujcpwqrn1 Non-melanoma skin cancer ...We will use the "Non-melanoma skin cancer" trait with uuid @qr1hi-q1cn2-v6usijujcpwqrn1@.
trait_uuid = 'qr1hi-q1cn2-v6usijujcpwqrn1'h3. Find humans. List humans who report this condition:
trait_links = arvados.service.links().list(limit=1000,where=json.dumps({ 'link_class': 'human_trait', 'tail_kind': 'arvados#human', 'head_uuid': trait_uuid })).execute()['items']The "tail_uuid" attribute of each of these Links refers to a Human.
map(lambda l: l['tail_uuid'], trait_links)%(darr)↓%
[u'1h9kt-7a9it-c0uqa4kcdh29wdf', u'1h9kt-7a9it-x4tru6mn40hc6ah', u'1h9kt-7a9it-yqb8m5s9cpy88i8', u'1h9kt-7a9it-46sm75w200ngwny', u'1h9kt-7a9it-gx85a4tdkpzsg3w', u'1h9kt-7a9it-8cvlaa8909lgeo9', u'1h9kt-7a9it-as37qum2pq8vizb', u'1h9kt-7a9it-14fph66z2baqxb9', u'1h9kt-7a9it-e9zc7i4crmw3v69', u'1h9kt-7a9it-np7f35hlijlxdmt', u'1h9kt-7a9it-j9hqyjwbvo9cojn', u'1h9kt-7a9it-lqxdtm1gynmsv13', u'1h9kt-7a9it-zkhhxjfg2o22ywq', u'1h9kt-7a9it-nsjoxqd33lzldw9', u'1h9kt-7a9it-ytect4smzcgd4kg', u'1h9kt-7a9it-y6tl353b3jc4tos', u'1h9kt-7a9it-98f8qave4f8vbs5', u'1h9kt-7a9it-gd72sh15q0p4wq3', u'1h9kt-7a9it-zlx25dscak94q9h', u'1h9kt-7a9it-8gronw4rbgmim01', u'1h9kt-7a9it-wclfkjcb23tr5es', u'1h9kt-7a9it-rvp2qe7szfz4dy6', u'1h9kt-7a9it-50iffhmpzsktwjm', u'1h9kt-7a9it-ul412id5y31a5o8', u'1h9kt-7a9it-732kwkfzylmt4ik', u'1h9kt-7a9it-v9zqxegpblsbtai', u'1h9kt-7a9it-kmaraqduit1v5wd', u'1h9kt-7a9it-t1nwtlo1hru5vvq', u'1h9kt-7a9it-q3w6j9od4ibpoyl', u'1h9kt-7a9it-qz8vzkuuz97ezwv', u'1h9kt-7a9it-t1v8sjz6dm9jmjf', u'1h9kt-7a9it-qe8wrbyvuqs5jew']h3. Find PGP IDs. For now we don't need to look up the Human objects themselves. As an aside, we will look up "identifier" links to find PGP-assigned participant identifiers:
human_uuids = map(lambda l: l['tail_uuid'], trait_links) pgpid_links = arvados.service.links().list(limit=1000,where=json.dumps({ "link_class": "identifier", "head_uuid": human_uuids })).execute()['items'] map(lambda l: l['name'], pgpid_links)%(darr)↓%
[u'hu01024B', u'hu11603C', u'hu15402B', u'hu174334', u'hu1BD549', u'hu237A50', u'hu34A921', u'hu397733', u'hu414115', u'hu43860C', u'hu474789', u'hu553620', u'hu56B3B6', u'hu5917F3', u'hu599905', u'hu5E55F5', u'hu602487', u'hu633787', u'hu68F245', u'hu6C3F34', u'hu7260DD', u'hu7A2F1D', u'hu94040B', u'hu9E356F', u'huAB8707', u'huB1FD55', u'huB4883B', u'huD09050', u'huD09534', u'huD3A569', u'huDF04CC', u'huE2E371']These PGP IDs let us find public profiles: * "https://my.personalgenomes.org/profile/huE2E371":https://my.personalgenomes.org/profile/huE2E371 * "https://my.personalgenomes.org/profile/huDF04CC":https://my.personalgenomes.org/profile/huDF04CC * ... h3. Find data. Find Collections that were provided by these Humans.
provenance_links = arvados.service.links().list(where=json.dumps({ "link_class": "provenance", "name": "provided", "tail_uuid": human_uuids })).execute()['items'] collection_uuids = map(lambda l: l['head_uuid'], provenance_links) # build map of human uuid -> PGP ID pgpid = {} for pgpid_link in pgpid_links: pgpid[pgpid_link['head_uuid']] = pgpid_link['name'] # build map of collection uuid -> PGP ID for p_link in provenance_links: pgpid[p_link['head_uuid']] = pgpid[p_link['tail_uuid']] # get details (e.g., list of files) of each collection collections = arvados.service.collections().list(where=json.dumps({ "uuid": collection_uuids })).execute()['items'] # print PGP public profile links with file locators for c in collections: for f in c['files']: print "https://my.personalgenomes.org/profile/%s %s %s%s" % (pgpid[c['uuid']], c['uuid'], ('' if f[0] == '.' else f[0]+'/'), f[1])%(darr)↓%
https://my.personalgenomes.org/profile/hu43860C a58dca7609fa84c8c38a7e926a97b2fc+302+K@qr1hi var-GS00253-DNA_A01_200_37-ASM.tsv.bz2 https://my.personalgenomes.org/profile/huB1FD55 ea30eb9e46eedf7f05ed6e348c2baf5d+291+K@qr1hi var-GS000010320-ASM.tsv.bz2 https://my.personalgenomes.org/profile/huDF04CC 4ab0df8f22f595d1747a22c476c05873+242+K@qr1hi var-GS000010427-ASM.tsv.bz2 https://my.personalgenomes.org/profile/hu7A2F1D 756d0ada29b376140f64e7abfe6aa0e7+242+K@qr1hi var-GS000014566-ASM.tsv.bz2 https://my.personalgenomes.org/profile/hu553620 7ed4e425bb1c7cc18387cbd9388181df+242+K@qr1hi var-GS000015272-ASM.tsv.bz2 https://my.personalgenomes.org/profile/huD09534 542112e210daff30dd3cfea4801a9f2f+242+K@qr1hi var-GS000016374-ASM.tsv.bz2 https://my.personalgenomes.org/profile/hu599905 33a9f3842b01ea3fdf27cc582f5ea2af+242+K@qr1hi var-GS000016015-ASM.tsv.bz2 https://my.personalgenomes.org/profile/hu599905 d6e2e57cd60ba5979006d0b03e45e726+81+K@qr1hi Witch_results.zip https://my.personalgenomes.org/profile/hu553620 ea4f2d325592a1272f989d141a917fdd+85+K@qr1hi Devenwood_results.zip https://my.personalgenomes.org/profile/hu7A2F1D 4580f6620bb15b25b18373766e14e4a7+85+K@qr1hi Innkeeper_results.zip https://my.personalgenomes.org/profile/huD09534 fee37be9440b912eb90f5e779f272416+82+K@qr1hi Hallet_results.ziph3. Search for a variant. Look for variant rs1126809 in each of the "var" files (these contain variant calls from WGS data).
job = {} for c in collections: if [] != filter(lambda f: re.search('^var-.*\.tsv\.bz2', f[1]), c['files']): job[c['uuid']] = arvados.service.jobs().create(job=json.dumps({ 'script': 'grep', 'script_parameters': {'input': c['uuid'], 'pattern': "rs1126809\\b"}, 'script_version': 'e7aeb42' })).execute() print "%s %s" % (pgpid[c['uuid']], job[c['uuid']]['uuid'])↓
hu43860C qr1hi-8i9sb-wyqq2eji4ehiwkq huB1FD55 qr1hi-8i9sb-ep68uf0jkj3je7q huDF04CC qr1hi-8i9sb-4ts4cvx6mbtcrsk hu7A2F1D qr1hi-8i9sb-5lkiu9sh7vdgven hu553620 qr1hi-8i9sb-nu4p6hjmziic022 huD09534 qr1hi-8i9sb-bt9389e9g3ff0m1 hu599905 qr1hi-8i9sb-ocg0i8r75luvke3Monitor job progress by refreshing the Jobs page in Workbench, or by using the API:
map(lambda j: arvados.service.jobs().get(uuid=j['uuid']).execute()['success'], job.values())%(darr)↓%
[True, True, True, True, True, True, True](Unfinished jobs will appear as None, failed jobs as False, and completed jobs as True.) After the jobs have completed, check output file sizes.
for collection_uuid in job: job_uuid = job[collection_uuid]['uuid'] job_output = arvados.service.jobs().get(uuid=job_uuid).execute()['output'] output_files = arvados.service.collections().get(uuid=job_output).execute()['files'] print "%s %3d %s" % (pgpid[collection_uuid], output_files[0][2], job_output)%(darr)↓%
hu599905 80 5644238bfb2a1925d423f2c264819cfb+75+K@qr1hi huD09534 80 f98f92573cf521333607910d320cc33b+75+K@qr1hi huB1FD55 0 c10e07d8d90b51ee7f3b0a5855dc77c3+65+K@qr1hi hu7A2F1D 80 922c4ce8d3dab3268edf8b9312cc63d4+75+K@qr1hi hu553620 0 66da988f45a7ee16b6058fcbe9859d69+65+K@qr1hi huDF04CC 80 bbe919451a437dde236a561d4e469ad2+75+K@qr1hi hu43860C 0 45797e38410de9b9ddef2f4f0ec41a93+76+K@qr1hiThus, of the 7 WGS results available for PGP participants reporting non-melanoma skin cancer, 4 include the rs1126809 / TYR-R402Q variant.