---
layout: default
navsection: userguide
title: "Tutorial: Search PGP data by trait"
navorder: 14
---

h1. Tutorial: Search PGP data by trait

Here you will use the Python SDK to find public WGS data for people who have a certain medical condition.

<!-- _Define WGS_ -->

<!-- _Explain the motivation in this example a little better.  If I'm
reading this right, the workflow is 
traits -> people with those traits -> presense of a specific genetic
variant in the people with the reported traits_ -->

<!-- _Rather than having the user do this through the Python command line,
it might be easier to write a file that is going to do each step_ -->


h3. Prerequisites

* Log in to a VM "using SSH":ssh-access.html
* Put an "API token":api-tokens.html in your @ARVADOS_API_TOKEN@ environment variable
* Put the API host name in your @ARVADOS_API_HOST@ environment variable
* Run the @python@ interactive shell.

If everything is set up correctly, you will be able to import the arvados SDK:

<pre>
import arvados
</pre>

...and display your account information:

<pre>
arvados.service.users().current().execute()
</pre>

h3. More prerequisites

<pre>
import re
import json
</pre>

h3. Find traits.

List traits containing the term "cancer":

<pre>
for t in filter(lambda t: re.search('cancer', t['name']),
                arvados.service.traits().list(limit=1000).execute()['items']):
  print t['uuid'], t['name']

</pre>

<!-- _Should break this down into steps instead of being clever and making
it a python one-liner_ -->

%(darr)&darr;%

<pre>
...
qr1hi-q1cn2-8q57g2diohwnzm0 Cervical cancer
qr1hi-q1cn2-vqp4243janpjbyj Breast cancer
qr1hi-q1cn2-v6usijujcpwqrn1 Non-melanoma skin cancer
...
</pre>

We will use the "Non-melanoma skin cancer" trait with uuid @qr1hi-q1cn2-v6usijujcpwqrn1@.

<pre>
trait_uuid = 'qr1hi-q1cn2-v6usijujcpwqrn1'
</pre>

h3. Find humans.

List humans who report this condition:

<pre>
trait_links = arvados.service.links().list(limit=1000,where=json.dumps({
    'link_class': 'human_trait',
    'tail_kind': 'arvados#human',
    'head_uuid': trait_uuid
  })).execute()['items']
</pre>

<!-- _Same comment, break this out and describe each step_ -->

The "tail_uuid" attribute of each of these Links refers to a Human.

<pre>
map(lambda l: l['tail_uuid'], trait_links)
</pre>

%(darr)&darr;%

<pre>
[u'1h9kt-7a9it-c0uqa4kcdh29wdf', u'1h9kt-7a9it-x4tru6mn40hc6ah',
u'1h9kt-7a9it-yqb8m5s9cpy88i8', u'1h9kt-7a9it-46sm75w200ngwny',
u'1h9kt-7a9it-gx85a4tdkpzsg3w', u'1h9kt-7a9it-8cvlaa8909lgeo9',
u'1h9kt-7a9it-as37qum2pq8vizb', u'1h9kt-7a9it-14fph66z2baqxb9',
u'1h9kt-7a9it-e9zc7i4crmw3v69', u'1h9kt-7a9it-np7f35hlijlxdmt',
u'1h9kt-7a9it-j9hqyjwbvo9cojn', u'1h9kt-7a9it-lqxdtm1gynmsv13',
u'1h9kt-7a9it-zkhhxjfg2o22ywq', u'1h9kt-7a9it-nsjoxqd33lzldw9',
u'1h9kt-7a9it-ytect4smzcgd4kg', u'1h9kt-7a9it-y6tl353b3jc4tos',
u'1h9kt-7a9it-98f8qave4f8vbs5', u'1h9kt-7a9it-gd72sh15q0p4wq3',
u'1h9kt-7a9it-zlx25dscak94q9h', u'1h9kt-7a9it-8gronw4rbgmim01',
u'1h9kt-7a9it-wclfkjcb23tr5es', u'1h9kt-7a9it-rvp2qe7szfz4dy6',
u'1h9kt-7a9it-50iffhmpzsktwjm', u'1h9kt-7a9it-ul412id5y31a5o8',
u'1h9kt-7a9it-732kwkfzylmt4ik', u'1h9kt-7a9it-v9zqxegpblsbtai',
u'1h9kt-7a9it-kmaraqduit1v5wd', u'1h9kt-7a9it-t1nwtlo1hru5vvq',
u'1h9kt-7a9it-q3w6j9od4ibpoyl', u'1h9kt-7a9it-qz8vzkuuz97ezwv',
u'1h9kt-7a9it-t1v8sjz6dm9jmjf', u'1h9kt-7a9it-qe8wrbyvuqs5jew']
</pre>

h3. Find PGP IDs.

For now we don't need to look up the Human objects themselves.

As an aside, we will look up "identifier" links to find PGP-assigned participant identifiers:

<pre>
human_uuids = map(lambda l: l['tail_uuid'], trait_links)
pgpid_links = arvados.service.links().list(limit=1000,where=json.dumps({
    "link_class": "identifier",
    "head_uuid": human_uuids
  })).execute()['items']
map(lambda l: l['name'], pgpid_links)
</pre>

%(darr)&darr;%

<pre>
[u'hu01024B', u'hu11603C', u'hu15402B', u'hu174334', u'hu1BD549', u'hu237A50',
 u'hu34A921', u'hu397733', u'hu414115', u'hu43860C', u'hu474789', u'hu553620',
 u'hu56B3B6', u'hu5917F3', u'hu599905', u'hu5E55F5', u'hu602487', u'hu633787',
 u'hu68F245', u'hu6C3F34', u'hu7260DD', u'hu7A2F1D', u'hu94040B', u'hu9E356F',
 u'huAB8707', u'huB1FD55', u'huB4883B', u'huD09050', u'huD09534', u'huD3A569',
 u'huDF04CC', u'huE2E371']
</pre>

These PGP IDs let us find public profiles:

* "https://my.personalgenomes.org/profile/huE2E371":https://my.personalgenomes.org/profile/huE2E371
* "https://my.personalgenomes.org/profile/huDF04CC":https://my.personalgenomes.org/profile/huDF04CC
* ...

h3. Find data.

Find Collections that were provided by these Humans.

<pre>
provenance_links = arvados.service.links().list(where=json.dumps({
    "link_class": "provenance",
    "name": "provided",
    "tail_uuid": human_uuids
  })).execute()['items']
collection_uuids = map(lambda l: l['head_uuid'], provenance_links)

# build map of human uuid -> PGP ID
pgpid = {}
for pgpid_link in pgpid_links:
  pgpid[pgpid_link['head_uuid']] = pgpid_link['name']

# build map of collection uuid -> PGP ID
for p_link in provenance_links:
  pgpid[p_link['head_uuid']] = pgpid[p_link['tail_uuid']]

# get details (e.g., list of files) of each collection
collections = arvados.service.collections().list(where=json.dumps({
    "uuid": collection_uuids
  })).execute()['items']

# print PGP public profile links with file locators
for c in collections:
  for f in c['files']:
    print "https://my.personalgenomes.org/profile/%s %s %s%s" % (pgpid[c['uuid']], c['uuid'], ('' if f[0] == '.' else f[0]+'/'), f[1])

</pre>

%(darr)&darr;%

<pre>
https://my.personalgenomes.org/profile/hu43860C a58dca7609fa84c8c38a7e926a97b2fc+302+K@qr1hi var-GS00253-DNA_A01_200_37-ASM.tsv.bz2
https://my.personalgenomes.org/profile/huB1FD55 ea30eb9e46eedf7f05ed6e348c2baf5d+291+K@qr1hi var-GS000010320-ASM.tsv.bz2
https://my.personalgenomes.org/profile/huDF04CC 4ab0df8f22f595d1747a22c476c05873+242+K@qr1hi var-GS000010427-ASM.tsv.bz2
https://my.personalgenomes.org/profile/hu7A2F1D 756d0ada29b376140f64e7abfe6aa0e7+242+K@qr1hi var-GS000014566-ASM.tsv.bz2
https://my.personalgenomes.org/profile/hu553620 7ed4e425bb1c7cc18387cbd9388181df+242+K@qr1hi var-GS000015272-ASM.tsv.bz2
https://my.personalgenomes.org/profile/huD09534 542112e210daff30dd3cfea4801a9f2f+242+K@qr1hi var-GS000016374-ASM.tsv.bz2
https://my.personalgenomes.org/profile/hu599905 33a9f3842b01ea3fdf27cc582f5ea2af+242+K@qr1hi var-GS000016015-ASM.tsv.bz2
https://my.personalgenomes.org/profile/hu599905 d6e2e57cd60ba5979006d0b03e45e726+81+K@qr1hi Witch_results.zip
https://my.personalgenomes.org/profile/hu553620 ea4f2d325592a1272f989d141a917fdd+85+K@qr1hi Devenwood_results.zip
https://my.personalgenomes.org/profile/hu7A2F1D 4580f6620bb15b25b18373766e14e4a7+85+K@qr1hi Innkeeper_results.zip
https://my.personalgenomes.org/profile/huD09534 fee37be9440b912eb90f5e779f272416+82+K@qr1hi Hallet_results.zip
</pre>

h3. Search for a variant.

Look for variant rs1126809 in each of the "var" files (these contain variant calls from WGS data).

<pre>
job = {}
for c in collections:
  if [] != filter(lambda f: re.search('^var-.*\.tsv\.bz2', f[1]), c['files']):
    job[c['uuid']] = arvados.service.jobs().create(body={
      'script': 'grep',
      'script_parameters': {'input': c['uuid'], 'pattern': "rs1126809\\b"},
      'script_version': 'e7aeb42'
    }).execute()
    print "%s %s" % (pgpid[c['uuid']], job[c['uuid']]['uuid'])

</pre>

&darr;

<pre>
hu43860C qr1hi-8i9sb-wyqq2eji4ehiwkq
huB1FD55 qr1hi-8i9sb-ep68uf0jkj3je7q
huDF04CC qr1hi-8i9sb-4ts4cvx6mbtcrsk
hu7A2F1D qr1hi-8i9sb-5lkiu9sh7vdgven
hu553620 qr1hi-8i9sb-nu4p6hjmziic022
huD09534 qr1hi-8i9sb-bt9389e9g3ff0m1
hu599905 qr1hi-8i9sb-ocg0i8r75luvke3
</pre>

Monitor job progress by refreshing the Jobs page in Workbench, or by using the API:

<pre>
map(lambda j: arvados.service.jobs().get(uuid=j['uuid']).execute()['success'], job.values())
</pre>

%(darr)&darr;%

<pre>
[True, True, True, True, True, True, True]
</pre>

(Unfinished jobs will appear as None, failed jobs as False, and completed jobs as True.)

After the jobs have completed, check output file sizes.

<pre>
for collection_uuid in job:
  job_uuid = job[collection_uuid]['uuid']
  job_output = arvados.service.jobs().get(uuid=job_uuid).execute()['output']
  output_files = arvados.service.collections().get(uuid=job_output).execute()['files']
  print "%s %3d %s" % (pgpid[collection_uuid], output_files[0][2], job_output)

</pre>

%(darr)&darr;%

<pre>
hu599905  80 5644238bfb2a1925d423f2c264819cfb+75+K@qr1hi
huD09534  80 f98f92573cf521333607910d320cc33b+75+K@qr1hi
huB1FD55   0 c10e07d8d90b51ee7f3b0a5855dc77c3+65+K@qr1hi
hu7A2F1D  80 922c4ce8d3dab3268edf8b9312cc63d4+75+K@qr1hi
hu553620   0 66da988f45a7ee16b6058fcbe9859d69+65+K@qr1hi
huDF04CC  80 bbe919451a437dde236a561d4e469ad2+75+K@qr1hi
hu43860C   0 45797e38410de9b9ddef2f4f0ec41a93+76+K@qr1hi
</pre>

Thus, of the 7 WGS results available for PGP participants reporting non-melanoma skin cancer, 4 include the rs1126809 / TYR-R402Q variant.