extend pgp data tutorial
[arvados.git] / doc / user / tutorial-trait-search.textile
1 ---
2 layout: default
3 navsection: userguide
4 title: "Tutorial: Search PGP data by trait"
5 navorder: 22
6 ---
7
8 h1. Tutorial: Search PGP data by trait
9
10 Here you will use the Python SDK to find public WGS data for people who have a certain medical condition.
11
12 h3. Prerequisites
13
14 * Log in to a VM "using SSH":ssh-access.html
15 * Put an "API token":api-tokens.html in your @ARVADOS_API_TOKEN@ environment variable
16 * Put the API host name in your @ARVADOS_API_HOST@ environment variable
17 * Run the @python@ interactive shell.
18
19 If everything is set up correctly, you will be able to import the arvados SDK:
20
21 <pre>
22 import arvados
23 </pre>
24
25 ...and display your account information:
26
27 <pre>
28 arvados.service.users().current().execute()
29 </pre>
30
31 h3. More prerequisites
32
33 <pre>
34 import re
35 import json
36 </pre>
37
38 h3. Find traits.
39
40 List traits containing the term "cancer":
41
42 <pre>
43 for t in filter(lambda t: re.search('cancer', t['name']),
44                 arvados.service.traits().list(limit=1000).execute()['items']):
45   print t['uuid'], t['name']
46
47 </pre>
48
49 &darr;
50
51 <pre>
52 ...
53 qr1hi-q1cn2-8q57g2diohwnzm0 Cervical cancer
54 qr1hi-q1cn2-vqp4243janpjbyj Breast cancer
55 qr1hi-q1cn2-v6usijujcpwqrn1 Non-melanoma skin cancer
56 ...
57 </pre>
58
59 We will use the "Non-melanoma skin cancer" trait with uuid @qr1hi-q1cn2-v6usijujcpwqrn1@.
60
61 <pre>
62 trait_uuid = 'qr1hi-q1cn2-v6usijujcpwqrn1'
63 </pre>
64
65 h3. Find humans.
66
67 List humans who report this condition:
68
69 <pre>
70 trait_links = arvados.service.links().list(limit=1000,where=json.dumps({
71     'link_class': 'human_trait',
72     'tail_kind': 'arvados#human',
73     'head_uuid': trait_uuid
74   })).execute()['items']
75 </pre>
76
77 The "tail_uuid" attribute of each of these Links refers to a Human.
78
79 <pre>
80 map(lambda l: l['tail_uuid'], trait_links)
81 </pre>
82
83 &darr;
84
85 <pre>
86 [u'1h9kt-7a9it-c0uqa4kcdh29wdf', u'1h9kt-7a9it-x4tru6mn40hc6ah',
87 u'1h9kt-7a9it-yqb8m5s9cpy88i8', u'1h9kt-7a9it-46sm75w200ngwny',
88 u'1h9kt-7a9it-gx85a4tdkpzsg3w', u'1h9kt-7a9it-8cvlaa8909lgeo9',
89 u'1h9kt-7a9it-as37qum2pq8vizb', u'1h9kt-7a9it-14fph66z2baqxb9',
90 u'1h9kt-7a9it-e9zc7i4crmw3v69', u'1h9kt-7a9it-np7f35hlijlxdmt',
91 u'1h9kt-7a9it-j9hqyjwbvo9cojn', u'1h9kt-7a9it-lqxdtm1gynmsv13',
92 u'1h9kt-7a9it-zkhhxjfg2o22ywq', u'1h9kt-7a9it-nsjoxqd33lzldw9',
93 u'1h9kt-7a9it-ytect4smzcgd4kg', u'1h9kt-7a9it-y6tl353b3jc4tos',
94 u'1h9kt-7a9it-98f8qave4f8vbs5', u'1h9kt-7a9it-gd72sh15q0p4wq3',
95 u'1h9kt-7a9it-zlx25dscak94q9h', u'1h9kt-7a9it-8gronw4rbgmim01',
96 u'1h9kt-7a9it-wclfkjcb23tr5es', u'1h9kt-7a9it-rvp2qe7szfz4dy6',
97 u'1h9kt-7a9it-50iffhmpzsktwjm', u'1h9kt-7a9it-ul412id5y31a5o8',
98 u'1h9kt-7a9it-732kwkfzylmt4ik', u'1h9kt-7a9it-v9zqxegpblsbtai',
99 u'1h9kt-7a9it-kmaraqduit1v5wd', u'1h9kt-7a9it-t1nwtlo1hru5vvq',
100 u'1h9kt-7a9it-q3w6j9od4ibpoyl', u'1h9kt-7a9it-qz8vzkuuz97ezwv',
101 u'1h9kt-7a9it-t1v8sjz6dm9jmjf', u'1h9kt-7a9it-qe8wrbyvuqs5jew']
102 </pre>
103
104 h3. Find PGP IDs.
105
106 For now we don't need to look up the Human objects themselves.
107
108 As an aside, we will look up "identifier" links to find PGP-assigned participant identifiers:
109
110 <pre>
111 human_uuids = map(lambda l: l['tail_uuid'], trait_links)
112 pgpid_links = arvados.service.links().list(limit=1000,where=json.dumps({
113     "link_class": "identifier",
114     "head_uuid": human_uuids
115   })).execute()['items']
116 map(lambda l: l['name'], pgpid_links)
117 </pre>
118
119 &darr;
120
121 <pre>
122 [u'hu01024B', u'hu11603C', u'hu15402B', u'hu174334', u'hu1BD549', u'hu237A50',
123  u'hu34A921', u'hu397733', u'hu414115', u'hu43860C', u'hu474789', u'hu553620',
124  u'hu56B3B6', u'hu5917F3', u'hu599905', u'hu5E55F5', u'hu602487', u'hu633787',
125  u'hu68F245', u'hu6C3F34', u'hu7260DD', u'hu7A2F1D', u'hu94040B', u'hu9E356F',
126  u'huAB8707', u'huB1FD55', u'huB4883B', u'huD09050', u'huD09534', u'huD3A569',
127  u'huDF04CC', u'huE2E371']
128 </pre>
129
130 These PGP IDs let us find public profiles:
131
132 * "https://my.personalgenomes.org/profile/huE2E371":https://my.personalgenomes.org/profile/huE2E371
133 * "https://my.personalgenomes.org/profile/huDF04CC":https://my.personalgenomes.org/profile/huDF04CC
134 * ...
135
136 h3. Find data.
137
138 Find Collections that were provided by these Humans.
139
140 <pre>
141 provenance_links = arvados.service.links().list(where=json.dumps({
142     "link_class": "provenance",
143     "name": "provided",
144     "tail_uuid": human_uuids
145   })).execute()['items']
146 collection_uuids = map(lambda l: l['head_uuid'], provenance_links)
147
148 # build map of human uuid -> PGP ID
149 pgpid = {}
150 for pgpid_link in pgpid_links:
151   pgpid[pgpid_link['head_uuid']] = pgpid_link['name']
152
153 # build map of collection uuid -> PGP ID
154 for p_link in provenance_links:
155   pgpid[p_link['head_uuid']] = pgpid[p_link['tail_uuid']]
156
157 # get details (e.g., list of files) of each collection
158 collections = arvados.service.collections().list(where=json.dumps({
159     "uuid": collection_uuids
160   })).execute()['items']
161
162 # print PGP public profile links with file locators
163 for c in collections:
164   for f in c['files']:
165     print "https://my.personalgenomes.org/profile/%s %s %s%s" % (pgpid[c['uuid']], c['uuid'], ('' if f[0] == '.' else f[0]+'/'), f[1])
166
167 </pre>
168
169 &darr;
170
171 <pre>
172 https://my.personalgenomes.org/profile/hu43860C a58dca7609fa84c8c38a7e926a97b2fc+302+K@qr1hi var-GS00253-DNA_A01_200_37-ASM.tsv.bz2
173 https://my.personalgenomes.org/profile/huB1FD55 ea30eb9e46eedf7f05ed6e348c2baf5d+291+K@qr1hi var-GS000010320-ASM.tsv.bz2
174 https://my.personalgenomes.org/profile/huDF04CC 4ab0df8f22f595d1747a22c476c05873+242+K@qr1hi var-GS000010427-ASM.tsv.bz2
175 https://my.personalgenomes.org/profile/hu7A2F1D 756d0ada29b376140f64e7abfe6aa0e7+242+K@qr1hi var-GS000014566-ASM.tsv.bz2
176 https://my.personalgenomes.org/profile/hu553620 7ed4e425bb1c7cc18387cbd9388181df+242+K@qr1hi var-GS000015272-ASM.tsv.bz2
177 https://my.personalgenomes.org/profile/huD09534 542112e210daff30dd3cfea4801a9f2f+242+K@qr1hi var-GS000016374-ASM.tsv.bz2
178 https://my.personalgenomes.org/profile/hu599905 33a9f3842b01ea3fdf27cc582f5ea2af+242+K@qr1hi var-GS000016015-ASM.tsv.bz2
179 https://my.personalgenomes.org/profile/hu599905 d6e2e57cd60ba5979006d0b03e45e726+81+K@qr1hi Witch_results.zip
180 https://my.personalgenomes.org/profile/hu553620 ea4f2d325592a1272f989d141a917fdd+85+K@qr1hi Devenwood_results.zip
181 https://my.personalgenomes.org/profile/hu7A2F1D 4580f6620bb15b25b18373766e14e4a7+85+K@qr1hi Innkeeper_results.zip
182 https://my.personalgenomes.org/profile/huD09534 fee37be9440b912eb90f5e779f272416+82+K@qr1hi Hallet_results.zip
183 </pre>
184
185 h3. Search for a variant.
186
187 Look for variant rs1126809 in each of the "var" files (these contain variant calls from WGS data).
188
189 <pre>
190 job = {}
191 for c in collections:
192   if [] != filter(lambda f: re.search('^var-.*\.tsv\.bz2', f[1]), c['files']):
193     job[c['uuid']] = arvados.service.jobs().create(job=json.dumps({
194       'script': 'grep',
195       'script_parameters': {'input': c['uuid'], 'pattern': "rs1126809\\b"},
196       'script_version': 'e7aeb42'
197     })).execute()
198     print "%s %s" % (pgpid[c['uuid']], job[c['uuid']]['uuid'])
199
200 </pre>
201
202 &darr;
203
204 <pre>
205 hu43860C qr1hi-8i9sb-wyqq2eji4ehiwkq
206 huB1FD55 qr1hi-8i9sb-ep68uf0jkj3je7q
207 huDF04CC qr1hi-8i9sb-4ts4cvx6mbtcrsk
208 hu7A2F1D qr1hi-8i9sb-5lkiu9sh7vdgven
209 hu553620 qr1hi-8i9sb-nu4p6hjmziic022
210 huD09534 qr1hi-8i9sb-bt9389e9g3ff0m1
211 hu599905 qr1hi-8i9sb-ocg0i8r75luvke3
212 </pre>
213
214 Check job progress:
215
216 <pre>
217 map(lambda j: arvados.service.jobs().get(uuid=j['uuid']).execute()['success'], job.values())
218 </pre>
219
220 &darr;
221
222 <pre>
223 [True, True, True, True, True, True, True]
224 </pre>
225
226 After the jobs have completed, check output file sizes.
227
228 <pre>
229 for collection_uuid in job:
230   job_uuid = job[collection_uuid]['uuid']
231   job_output = arvados.service.jobs().get(uuid=job_uuid).execute()['output']
232   output_files = arvados.service.collections().get(uuid=job_output).execute()['files']
233   print "%s %3d %s" % (pgpid[collection_uuid], output_files[0][2], job_output)
234
235 </pre>
236
237 &darr;
238
239 <pre>
240 hu599905  80 5644238bfb2a1925d423f2c264819cfb+75+K@qr1hi
241 huD09534  80 f98f92573cf521333607910d320cc33b+75+K@qr1hi
242 huB1FD55   0 c10e07d8d90b51ee7f3b0a5855dc77c3+65+K@qr1hi
243 hu7A2F1D  80 922c4ce8d3dab3268edf8b9312cc63d4+75+K@qr1hi
244 hu553620   0 66da988f45a7ee16b6058fcbe9859d69+65+K@qr1hi
245 huDF04CC  80 bbe919451a437dde236a561d4e469ad2+75+K@qr1hi
246 hu43860C   0 45797e38410de9b9ddef2f4f0ec41a93+76+K@qr1hi
247 </pre>
248
249 Thus, of the 7 WGS results available for PGP participants reporting non-melanoma skin cancer, 4 include the rs1126809 / TYR-R402Q variant.