4 title: "Using GATK with Arvados"
7 This tutorial demonstrates how to use the Genome Analysis Toolkit (GATK) with Arvados. In this example we will install GATK and then create a VariantFiltration job to assign pass/fail scores to variants in a VCF file.
9 *This tutorial assumes that you are "logged into an Arvados VM instance":{{site.baseurl}}/user/getting_started/ssh-access.html#login, and have a "working environment.":{{site.baseurl}}/user/getting_started/check-environment.html*
13 Download the GATK binary tarball[1] -- e.g., @GenomeAnalysisTK-2.6-4.tar.bz2@ -- and "copy it to your Arvados VM":{{site.baseurl}}/user/tutorials/tutorial-keep.html.
16 <pre><code>~$ <span class="userinput">arv keep put GenomeAnalysisTK-2.6-4.tar.bz2</span>
17 c905c8d8443a9c44274d98b7c6cfaa32+94
21 Next, you need the GATK Resource Bundle[2]. This may already be available in Arvados. If not, you will need to download the files listed below and put them into Keep.
24 <pre><code>~$ <span class="userinput">arv keep ls -s d237a90bae3870b3b033aea1e99de4a9+10820</span>
25 50342 1000G_omni2.5.b37.vcf.gz
26 1 1000G_omni2.5.b37.vcf.gz.md5
27 464 1000G_omni2.5.b37.vcf.idx.gz
28 1 1000G_omni2.5.b37.vcf.idx.gz.md5
29 43981 1000G_phase1.indels.b37.vcf.gz
30 1 1000G_phase1.indels.b37.vcf.gz.md5
31 326 1000G_phase1.indels.b37.vcf.idx.gz
32 1 1000G_phase1.indels.b37.vcf.idx.gz.md5
33 537210 CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.gz
34 1 CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.gz.md5
35 3473 CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.idx.gz
36 1 CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.idx.gz.md5
37 19403 Mills_and_1000G_gold_standard.indels.b37.vcf.gz
38 1 Mills_and_1000G_gold_standard.indels.b37.vcf.gz.md5
39 536 Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz
40 1 Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz.md5
41 29291 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.sites.vcf.gz
42 1 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.sites.vcf.gz.md5
43 565 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.sites.vcf.idx.gz
44 1 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.sites.vcf.idx.gz.md5
45 37930 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf.gz
46 1 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf.gz.md5
47 592 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf.idx.gz
48 1 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf.idx.gz.md5
49 5898484 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam
50 112 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam.bai.gz
51 1 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam.bai.gz.md5
52 1 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam.md5
53 3837 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.vcf.gz
54 1 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.vcf.gz.md5
55 65 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.vcf.idx.gz
56 1 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.vcf.idx.gz.md5
57 275757 dbsnp_137.b37.excluding_sites_after_129.vcf.gz
58 1 dbsnp_137.b37.excluding_sites_after_129.vcf.gz.md5
59 3735 dbsnp_137.b37.excluding_sites_after_129.vcf.idx.gz
60 1 dbsnp_137.b37.excluding_sites_after_129.vcf.idx.gz.md5
61 998153 dbsnp_137.b37.vcf.gz
62 1 dbsnp_137.b37.vcf.gz.md5
63 3890 dbsnp_137.b37.vcf.idx.gz
64 1 dbsnp_137.b37.vcf.idx.gz.md5
65 58418 hapmap_3.3.b37.vcf.gz
66 1 hapmap_3.3.b37.vcf.gz.md5
67 999 hapmap_3.3.b37.vcf.idx.gz
68 1 hapmap_3.3.b37.vcf.idx.gz.md5
69 3 human_g1k_v37.dict.gz
70 1 human_g1k_v37.dict.gz.md5
71 2 human_g1k_v37.fasta.fai.gz
72 1 human_g1k_v37.fasta.fai.gz.md5
73 849537 human_g1k_v37.fasta.gz
74 1 human_g1k_v37.fasta.gz.md5
75 1 human_g1k_v37.stats.gz
76 1 human_g1k_v37.stats.gz.md5
77 3 human_g1k_v37_decoy.dict.gz
78 1 human_g1k_v37_decoy.dict.gz.md5
79 2 human_g1k_v37_decoy.fasta.fai.gz
80 1 human_g1k_v37_decoy.fasta.fai.gz.md5
81 858592 human_g1k_v37_decoy.fasta.gz
82 1 human_g1k_v37_decoy.fasta.gz.md5
83 1 human_g1k_v37_decoy.stats.gz
84 1 human_g1k_v37_decoy.stats.gz.md5
90 The Arvados distribution includes an example crunch script ("crunch_scripts/GATK2-VariantFiltration":https://arvados.org/projects/arvados/repository/revisions/master/entry/crunch_scripts/GATK2-VariantFiltration) that runs the GATK VariantFiltration tool with some default settings.
93 <pre><code>~$ <span class="userinput">src_version=76588bfc57f33ea1b36b82ca7187f465b73b4ca4</span>
94 ~$ <span class="userinput">vcf_input=5ee633fe2569d2a42dd81b07490d5d13+82</span>
95 ~$ <span class="userinput">gatk_binary=c905c8d8443a9c44274d98b7c6cfaa32+94</span>
96 ~$ <span class="userinput">gatk_bundle=d237a90bae3870b3b033aea1e99de4a9+10820</span>
97 ~$ <span class="userinput">cat >the_job <<EOF
99 "script":"GATK2-VariantFiltration",
100 "repository":"arvados",
101 "script_version":"$src_version",
104 "input":"$vcf_input",
105 "gatk_binary_tarball":"$gatk_binary",
106 "gatk_bundle":"$gatk_bundle"
113 * @"input"@ is collection containing the source VCF data. Here we are using an exome report from PGP participant hu34D5B9.
114 * @"gatk_binary_tarball"@ is a Keep collection containing the GATK 2 binary distribution[1] tar file.
115 * @"gatk_bundle"@ is a Keep collection containing the GATK resource bundle[2].
120 <pre><code>~$ <span class="userinput">arv job create --job "$(cat the_job)"</span>
122 "href":"https://qr1hi.arvadosapi.com/arvados/v1/jobs/qr1hi-8i9sb-n9k7qyp7bs5b9d4",
123 "kind":"arvados#job",
124 "etag":"9j99n1feoxw3az448f8ises12",
125 "uuid":"qr1hi-8i9sb-n9k7qyp7bs5b9d4",
126 "owner_uuid":"qr1hi-tpzed-9zdpkpni2yddge6",
127 "created_at":"2013-12-17T19:02:15Z",
128 "modified_by_client_uuid":"qr1hi-ozdt8-obw7foaks3qjyej",
129 "modified_by_user_uuid":"qr1hi-tpzed-9zdpkpni2yddge6",
130 "modified_at":"2013-12-17T19:02:15Z",
131 "updated_at":"2013-12-17T19:02:15Z",
134 "script":"GATK2-VariantFiltration",
135 "script_parameters":{
136 "input":"5ee633fe2569d2a42dd81b07490d5d13+82",
137 "gatk_binary_tarball":"c905c8d8443a9c44274d98b7c6cfaa32+94",
138 "gatk_bundle":"d237a90bae3870b3b033aea1e99de4a9+10820"
140 "script_version":"76588bfc57f33ea1b36b82ca7187f465b73b4ca4",
142 "cancelled_by_client_uuid":null,
143 "cancelled_by_user_uuid":null,
149 "is_locked_by_uuid":null,
151 "runtime_constraints":{},
154 "5ee633fe2569d2a42dd81b07490d5d13+82",
155 "c905c8d8443a9c44274d98b7c6cfaa32+94",
156 "d237a90bae3870b3b033aea1e99de4a9+10820"
158 "log_stream_href":"https://qr1hi.arvadosapi.com/arvados/v1/jobs/qr1hi-8i9sb-n9k7qyp7bs5b9d4/log_tail_follow"
160 ~$ <span class="userinput">arv job log_tail_follow --uuid qr1hi-8i9sb-n9k7qyp7bs5b9d4</span>
161 Tue Dec 17 19:02:16 2013 salloc: Granted job allocation 1251
162 Tue Dec 17 19:02:17 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 check slurm allocation
163 Tue Dec 17 19:02:17 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 node compute13 - 8 slots
164 Tue Dec 17 19:02:17 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 start
165 Tue Dec 17 19:02:17 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 Install revision 76588bfc57f33ea1b36b82ca7187f465b73b4ca4
166 Tue Dec 17 19:02:18 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 Clean-work-dir exited 0
167 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 Install exited 0
168 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 script GATK2-VariantFiltration
169 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 script_version 76588bfc57f33ea1b36b82ca7187f465b73b4ca4
170 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 script_parameters {"input":"5ee633fe2569d2a42dd81b07490d5d13+82","gatk_bundle":"d237a90bae3870b3b033aea1e99de4a9+10820","gatk_binary_tarball":"c905c8d8443a9c44274d98b7c6cfaa32+94"}
171 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 runtime_constraints {"max_tasks_per_node":0}
172 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 start level 0
173 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 status: 0 done, 0 running, 1 todo
174 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 0 job_task qr1hi-ot0gb-d3sjxerucfbvyev
175 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 0 child 4946 started on compute13.1
176 Tue Dec 17 19:02:19 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 status: 0 done, 1 running, 0 todo
177 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 0 child 4946 on compute13.1 exit 0 signal 0 success=true
178 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 0 success in 1 seconds
179 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 0 output
180 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 wait for last 0 children to finish
181 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 status: 1 done, 0 running, 1 todo
182 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 start level 1
183 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 status: 1 done, 0 running, 1 todo
184 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 1 job_task qr1hi-ot0gb-w8ujbnulxjaamxf
185 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 1 child 4984 started on compute13.1
186 Tue Dec 17 19:02:20 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 status: 1 done, 1 running, 0 todo
187 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 1 child 4984 on compute13.1 exit 0 signal 0 success=true
188 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 1 success in 110 seconds
189 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 1 output bedd6ff56b3ae9f90d873b1fcb72f9a3+91
190 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 wait for last 0 children to finish
191 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 status: 2 done, 0 running, 0 todo
192 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 release job allocation
193 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 Freeze not implemented
194 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 collate
195 Tue Dec 17 19:04:10 2013 salloc: Job allocation 1251 has been revoked.
196 Tue Dec 17 19:04:10 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 output bedd6ff56b3ae9f90d873b1fcb72f9a3+91
197 Tue Dec 17 19:04:11 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 finish
198 Tue Dec 17 19:04:12 2013 qr1hi-8i9sb-n9k7qyp7bs5b9d4 4867 log manifest is 1e77aaceee2df499e14dc5dde5c3d328+91
202 Once the job completes, the output can be found in hu34D5B9-exome-filtered.vcf:
204 <notextile><pre><code>~$ <span class="userinput">arv keep ls bedd6ff56b3ae9f90d873b1fcb72f9a3+91</span>
205 hu34D5B9-exome-filtered.vcf
211 fn1. "Download the GATK tools":http://www.broadinstitute.org/gatk/download
213 fn2. "Information about the GATK resource bundle":http://gatkforums.broadinstitute.org/discussion/1213/whats-in-the-resource-bundle-and-how-can-i-get-it and "direct download link":ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37/ (if prompted, submit an empty password)