Merge branch '1608-api-documentation' of git.clinicalfuture.com:arvados into 1608...
[arvados.git] / doc / user / tutorials / tutorial-gatk-variantfiltration.textile
1 ---
2 layout: default
3 navsection: userguide
4 title: "Using GATK with Arvados"
5 navorder: 116
6 ---
7
8 h1. Tutorial: Using GATK with Arvados
9
10 This tutorials demonstrates how to use the Genome Analysis Toolkit (GATK) with Arvados. In this example we will install GATK and then create a VariantFiltration job to assign pass/fail scores to variants in a VCF file.
11
12 *This tutorial assumes that you are "logged into an Arvados VM instance":{{site.basedoc}}/user/getting_started/ssh-access.html#login, and have a "working environment.":{{site.basedoc}}/user/getting_started/check-environment.html*
13
14 h2. Installing GATK
15
16 Download the GATK binary tarball[1] -- e.g., @GenomeAnalysisTK-2.6-4.tar.bz2@ -- and "copy it to your Arvados VM":tutorial-keep.html.
17
18 <notextile>
19 <pre><code>$ <span class="userinput">arv keep put GenomeAnalysisTK-2.6-4.tar.bz2</span>
20 c905c8d8443a9c44274d98b7c6cfaa32+94
21 </code></pre>
22 </notextile>
23
24 Next, you need the GATK Resource Bundle[2].  This may already be available in Arvados.  If not, you will need to download the files listed below and put them into Keep.
25
26 <notextile>
27 <pre><code>$ <span class="userinput">arv keep ls -s d237a90bae3870b3b033aea1e99de4a9+10820</span>
28   50342 1000G_omni2.5.b37.vcf.gz
29       1 1000G_omni2.5.b37.vcf.gz.md5
30     464 1000G_omni2.5.b37.vcf.idx.gz
31       1 1000G_omni2.5.b37.vcf.idx.gz.md5
32   43981 1000G_phase1.indels.b37.vcf.gz
33       1 1000G_phase1.indels.b37.vcf.gz.md5
34     326 1000G_phase1.indels.b37.vcf.idx.gz
35       1 1000G_phase1.indels.b37.vcf.idx.gz.md5
36  537210 CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.gz
37       1 CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.gz.md5
38    3473 CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.idx.gz
39       1 CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.idx.gz.md5
40   19403 Mills_and_1000G_gold_standard.indels.b37.vcf.gz
41       1 Mills_and_1000G_gold_standard.indels.b37.vcf.gz.md5
42     536 Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz
43       1 Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz.md5
44   29291 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.sites.vcf.gz
45       1 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.sites.vcf.gz.md5
46     565 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.sites.vcf.idx.gz
47       1 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.sites.vcf.idx.gz.md5
48   37930 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf.gz
49       1 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf.gz.md5
50     592 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf.idx.gz
51       1 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf.idx.gz.md5
52 5898484 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam
53     112 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam.bai.gz
54       1 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam.bai.gz.md5
55       1 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam.md5
56    3837 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.vcf.gz
57       1 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.vcf.gz.md5
58      65 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.vcf.idx.gz
59       1 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.vcf.idx.gz.md5
60  275757 dbsnp_137.b37.excluding_sites_after_129.vcf.gz
61       1 dbsnp_137.b37.excluding_sites_after_129.vcf.gz.md5
62    3735 dbsnp_137.b37.excluding_sites_after_129.vcf.idx.gz
63       1 dbsnp_137.b37.excluding_sites_after_129.vcf.idx.gz.md5
64  998153 dbsnp_137.b37.vcf.gz
65       1 dbsnp_137.b37.vcf.gz.md5
66    3890 dbsnp_137.b37.vcf.idx.gz
67       1 dbsnp_137.b37.vcf.idx.gz.md5
68   58418 hapmap_3.3.b37.vcf.gz
69       1 hapmap_3.3.b37.vcf.gz.md5
70     999 hapmap_3.3.b37.vcf.idx.gz
71       1 hapmap_3.3.b37.vcf.idx.gz.md5
72       3 human_g1k_v37.dict.gz
73       1 human_g1k_v37.dict.gz.md5
74       2 human_g1k_v37.fasta.fai.gz
75       1 human_g1k_v37.fasta.fai.gz.md5
76  849537 human_g1k_v37.fasta.gz
77       1 human_g1k_v37.fasta.gz.md5
78       1 human_g1k_v37.stats.gz
79       1 human_g1k_v37.stats.gz.md5
80       3 human_g1k_v37_decoy.dict.gz
81       1 human_g1k_v37_decoy.dict.gz.md5
82       2 human_g1k_v37_decoy.fasta.fai.gz
83       1 human_g1k_v37_decoy.fasta.fai.gz.md5
84  858592 human_g1k_v37_decoy.fasta.gz
85       1 human_g1k_v37_decoy.fasta.gz.md5
86       1 human_g1k_v37_decoy.stats.gz
87       1 human_g1k_v37_decoy.stats.gz.md5
88 </code></pre>
89 </notextile>
90
91 h2. Submit a GATK job
92
93 The Arvados distribution includes an example crunch script ("crunch_scripts/GATK2-VariantFiltration":https://arvados.org/projects/arvados/repository/revisions/master/entry/crunch_scripts/GATK2-VariantFiltration) that runs the GATK VariantFiltration tool with some default settings.
94
95 <notextile>
96 <pre><code>$ <span class="userinput">src_version=76588bfc57f33ea1b36b82ca7187f465b73b4ca4</span>
97 $ <span class="userinput">vcf_input=5ee633fe2569d2a42dd81b07490d5d13+82</span>
98 $ <span class="userinput">gatk_binary=c905c8d8443a9c44274d98b7c6cfaa32+94</span>
99 $ <span class="userinput">gatk_bundle=d237a90bae3870b3b033aea1e99de4a9+10820</span>
100 $ <span class="userinput">cat &gt;the_job &lt;&lt;EOF
101 {
102  "script":"GATK2-VariantFiltration",
103  "script_version":"$src_version",
104  "script_parameters":
105  {
106   "input":"$vcf_input",
107   "gatk_binary_tarball":"$gatk_binary",
108   "gatk_bundle":"$gatk_bundle"
109  }
110 }
111 EOF</span>
112 </code></pre>
113 </notextile>
114
115 * @"input"@ is collection containing the source VCF data. Here we are using an exome report from PGP participant hu34D5B9.
116 * @"gatk_binary_tarball"@ is a Keep collection containing the GATK 2 binary distribution[1] tar file.
117 * @"gatk_bundle"@ is a Keep collection containing the GATK resource bundle[2].
118
119 Now start a job:
120
121 <notextile>
122 <pre><code>$ <span class="userinput">arv -h job create --job "$(cat the_job)"</span>
123 {
124  "href":"https://qr1hi.arvadosapi.com/arvados/v1/jobs/qr1hi-8i9sb-n9k7qyp7bs5b9d4",
125  "kind":"arvados#job",
126  "etag":"9j99n1feoxw3az448f8ises12",
127  "uuid":"qr1hi-8i9sb-n9k7qyp7bs5b9d4",
128  "owner_uuid":"qr1hi-tpzed-9zdpkpni2yddge6",
129  "created_at":"2013-12-17T19:02:15Z",
130  "modified_by_client_uuid":"qr1hi-ozdt8-obw7foaks3qjyej",
131  "modified_by_user_uuid":"qr1hi-tpzed-9zdpkpni2yddge6",
132  "modified_at":"2013-12-17T19:02:15Z",
133  "updated_at":"2013-12-17T19:02:15Z",
134  "submit_id":null,
135  "priority":null,
136  "script":"GATK2-VariantFiltration",
137  "script_parameters":{
138   "input":"5ee633fe2569d2a42dd81b07490d5d13+82",
139   "gatk_binary_tarball":"c905c8d8443a9c44274d98b7c6cfaa32+94",
140   "gatk_bundle":"d237a90bae3870b3b033aea1e99de4a9+10820"
141  },
142  "script_version":"76588bfc57f33ea1b36b82ca7187f465b73b4ca4",
143  "cancelled_at":null,
144  "cancelled_by_client_uuid":null,
145  "cancelled_by_user_uuid":null,
146  "started_at":null,
147  "finished_at":null,
148  "output":null,
149  "success":null,
150  "running":null,
151  "is_locked_by_uuid":null,
152  "log":null,
153  "runtime_constraints":{},
154  "tasks_summary":{},
155  "dependencies":[
156   "5ee633fe2569d2a42dd81b07490d5d13+82",
157   "c905c8d8443a9c44274d98b7c6cfaa32+94",
158   "d237a90bae3870b3b033aea1e99de4a9+10820"
159  ],
160  "log_stream_href":"https://qr1hi.arvadosapi.com/arvados/v1/jobs/qr1hi-8i9sb-n9k7qyp7bs5b9d4/log_tail_follow"
161 }
162 $ <span class="userinput">arv job log_tail_follow --uuid qr1hi-8i9sb-n9k7qyp7bs5b9d4</span>
163 Tue Dec 17 19:02:16 2013 salloc: Granted job allocation 1251
164 Tue Dec 17 19:02:17 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  check slurm allocation
165 Tue Dec 17 19:02:17 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  node compute13 - 8 slots
166 Tue Dec 17 19:02:17 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  start
167 Tue Dec 17 19:02:17 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  Install revision 76588bfc57f33ea1b36b82ca7187f465b73b4ca4
168 Tue Dec 17 19:02:18 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  Clean-work-dir exited 0
169 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  Install exited 0
170 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  script GATK2-VariantFiltration
171 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  script_version 76588bfc57f33ea1b36b82ca7187f465b73b4ca4
172 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  script_parameters {"input":"5ee633fe2569d2a42dd81b07490d5d13+82","gatk_bundle":"d237a90bae3870b3b033aea1e99de4a9+10820","gatk_binary_tarball":"c905c8d8443a9c44274d98b7c6cfaa32+94"}
173 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  runtime_constraints {"max_tasks_per_node":0}
174 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  start level 0
175 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  status: 0 done, 0 running, 1 todo
176 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 0 job_task qr1hi-ot0gb-d3sjxerucfbvyev
177 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 0 child 4946 started on compute13.1
178 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  status: 0 done, 1 running, 0 todo
179 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 0 child 4946 on compute13.1 exit 0 signal 0 success=true
180 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 0 success in 1 seconds
181 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 0 output
182 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  wait for last 0 children to finish
183 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  status: 1 done, 0 running, 1 todo
184 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  start level 1
185 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  status: 1 done, 0 running, 1 todo
186 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 1 job_task qr1hi-ot0gb-w8ujbnulxjaamxf
187 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 1 child 4984 started on compute13.1
188 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  status: 1 done, 1 running, 0 todo
189 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 1 child 4984 on compute13.1 exit 0 signal 0 success=true
190 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 1 success in 110 seconds
191 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 1 output bedd6ff56b3ae9f90d873b1fcb72f9a3+91+K@qr1hi
192 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  wait for last 0 children to finish
193 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  status: 2 done, 0 running, 0 todo
194 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  release job allocation
195 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  Freeze not implemented
196 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  collate
197 Tue Dec 17 19:04:10 2013 salloc: Job allocation 1251 has been revoked.
198 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  output bedd6ff56b3ae9f90d873b1fcb72f9a3+91+K@qr1hi
199 Tue Dec 17 19:04:11 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  finish
200 Tue Dec 17 19:04:12 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867  log manifest is 1e77aaceee2df499e14dc5dde5c3d328+91+K@qr1hi
201 </code></pre>
202 </notextile>
203
204 Once the job completes, the output can be found in hu34D5B9-exome-filtered.vcf:
205
206 <notextile>
207 $ <span class="userinput">arv keep ls bedd6ff56b3ae9f90d873b1fcb72f9a3+91+K@qr1hi</span>
208 hu34D5B9-exome-filtered.vcf
209 </code></pre>
210 </notextile>
211
212 h2. Notes
213
214 fn1. "Download the GATK tools":http://www.broadinstitute.org/gatk/download
215
216 fn2. "Information about the GATK resource bundle":http://gatkforums.broadinstitute.org/discussion/1213/whats-in-the-resource-bundle-and-how-can-i-get-it and "direct download link":ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37/ (if prompted, submit an empty password)