Merge branch '13609-r-sdk-copy' refs #13609
[arvados.git] / doc / user / topics / tutorial-gatk-variantfiltration.html.textile.liquid
1 ---
2 layout: default
3 navsection: userguide
4 title: "Using GATK with Arvados"
5 ...
6 {% comment %}
7 Copyright (C) The Arvados Authors. All rights reserved.
8
9 SPDX-License-Identifier: CC-BY-SA-3.0
10 {% endcomment %}
11
12 This tutorial demonstrates how to use the Genome Analysis Toolkit (GATK) with Arvados. In this example we will install GATK and then create a VariantFiltration job to assign pass/fail scores to variants in a VCF file.
13
14 {% include 'tutorial_expectations' %}
15
16 h2. Installing GATK
17
18 Download the GATK binary tarball[1] -- e.g., @GenomeAnalysisTK-2.6-4.tar.bz2@ -- and "copy it to your Arvados VM":{{site.baseurl}}/user/tutorials/tutorial-keep.html.
19
20 <notextile>
21 <pre><code>~$ <span class="userinput">arv keep put GenomeAnalysisTK-2.6-4.tar.bz2</span>
22 c905c8d8443a9c44274d98b7c6cfaa32+94
23 </code></pre>
24 </notextile>
25
26 Next, you need the GATK Resource Bundle[2].  This may already be available in Arvados.  If not, you will need to download the files listed below and put them into Keep.
27
28 <notextile>
29 <pre><code>~$ <span class="userinput">arv keep ls -s d237a90bae3870b3b033aea1e99de4a9+10820</span>
30   50342 1000G_omni2.5.b37.vcf.gz
31       1 1000G_omni2.5.b37.vcf.gz.md5
32     464 1000G_omni2.5.b37.vcf.idx.gz
33       1 1000G_omni2.5.b37.vcf.idx.gz.md5
34   43981 1000G_phase1.indels.b37.vcf.gz
35       1 1000G_phase1.indels.b37.vcf.gz.md5
36     326 1000G_phase1.indels.b37.vcf.idx.gz
37       1 1000G_phase1.indels.b37.vcf.idx.gz.md5
38  537210 CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.gz
39       1 CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.gz.md5
40    3473 CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.idx.gz
41       1 CEUTrio.HiSeq.WGS.b37.bestPractices.phased.b37.vcf.idx.gz.md5
42   19403 Mills_and_1000G_gold_standard.indels.b37.vcf.gz
43       1 Mills_and_1000G_gold_standard.indels.b37.vcf.gz.md5
44     536 Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz
45       1 Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz.md5
46   29291 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.sites.vcf.gz
47       1 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.sites.vcf.gz.md5
48     565 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.sites.vcf.idx.gz
49       1 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.sites.vcf.idx.gz.md5
50   37930 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf.gz
51       1 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf.gz.md5
52     592 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf.idx.gz
53       1 NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf.idx.gz.md5
54 5898484 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam
55     112 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam.bai.gz
56       1 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam.bai.gz.md5
57       1 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.bam.md5
58    3837 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.vcf.gz
59       1 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.vcf.gz.md5
60      65 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.vcf.idx.gz
61       1 NA12878.HiSeq.WGS.bwa.cleaned.recal.b37.20.vcf.idx.gz.md5
62  275757 dbsnp_137.b37.excluding_sites_after_129.vcf.gz
63       1 dbsnp_137.b37.excluding_sites_after_129.vcf.gz.md5
64    3735 dbsnp_137.b37.excluding_sites_after_129.vcf.idx.gz
65       1 dbsnp_137.b37.excluding_sites_after_129.vcf.idx.gz.md5
66  998153 dbsnp_137.b37.vcf.gz
67       1 dbsnp_137.b37.vcf.gz.md5
68    3890 dbsnp_137.b37.vcf.idx.gz
69       1 dbsnp_137.b37.vcf.idx.gz.md5
70   58418 hapmap_3.3.b37.vcf.gz
71       1 hapmap_3.3.b37.vcf.gz.md5
72     999 hapmap_3.3.b37.vcf.idx.gz
73       1 hapmap_3.3.b37.vcf.idx.gz.md5
74       3 human_g1k_v37.dict.gz
75       1 human_g1k_v37.dict.gz.md5
76       2 human_g1k_v37.fasta.fai.gz
77       1 human_g1k_v37.fasta.fai.gz.md5
78  849537 human_g1k_v37.fasta.gz
79       1 human_g1k_v37.fasta.gz.md5
80       1 human_g1k_v37.stats.gz
81       1 human_g1k_v37.stats.gz.md5
82       3 human_g1k_v37_decoy.dict.gz
83       1 human_g1k_v37_decoy.dict.gz.md5
84       2 human_g1k_v37_decoy.fasta.fai.gz
85       1 human_g1k_v37_decoy.fasta.fai.gz.md5
86  858592 human_g1k_v37_decoy.fasta.gz
87       1 human_g1k_v37_decoy.fasta.gz.md5
88       1 human_g1k_v37_decoy.stats.gz
89       1 human_g1k_v37_decoy.stats.gz.md5
90 </code></pre>
91 </notextile>
92
93 h2. Submit a GATK job
94
95 The Arvados distribution includes an example crunch script ("crunch_scripts/GATK2-VariantFiltration":https://dev.arvados.org/projects/arvados/repository/revisions/master/entry/crunch_scripts/GATK2-VariantFiltration) that runs the GATK VariantFiltration tool with some default settings.
96
97 <notextile>
98 <pre><code>~$ <span class="userinput">src_version=76588bfc57f33ea1b36b82ca7187f465b73b4ca4</span>
99 ~$ <span class="userinput">vcf_input=5ee633fe2569d2a42dd81b07490d5d13+82</span>
100 ~$ <span class="userinput">gatk_binary=c905c8d8443a9c44274d98b7c6cfaa32+94</span>
101 ~$ <span class="userinput">gatk_bundle=d237a90bae3870b3b033aea1e99de4a9+10820</span>
102 ~$ <span class="userinput">cat &gt;the_job &lt;&lt;EOF
103 {
104  "script":"GATK2-VariantFiltration",
105  "repository":"arvados",
106  "script_version":"$src_version",
107  "script_parameters":
108  {
109   "input":"$vcf_input",
110   "gatk_binary_tarball":"$gatk_binary",
111   "gatk_bundle":"$gatk_bundle"
112  }
113 }
114 EOF</span>
115 </code></pre>
116 </notextile>
117
118 * @"input"@ is collection containing the source VCF data. Here we are using an exome report from PGP participant hu34D5B9.
119 * @"gatk_binary_tarball"@ is a Keep collection containing the GATK 2 binary distribution[1] tar file.
120 * @"gatk_bundle"@ is a Keep collection containing the GATK resource bundle[2].
121
122 Now start a job:
123
124 <notextile>
125 <pre><code>~$ <span class="userinput">arv job create --job "$(cat the_job)"</span>
126 {
127  "href":"https://qr1hi.arvadosapi.com/arvados/v1/jobs/qr1hi-8i9sb-n9k7qyp7bs5b9d4",
128  "kind":"arvados#job",
129  "etag":"9j99n1feoxw3az448f8ises12",
130  "uuid":"qr1hi-8i9sb-n9k7qyp7bs5b9d4",
131  "owner_uuid":"qr1hi-tpzed-9zdpkpni2yddge6",
132  "created_at":"2013-12-17T19:02:15Z",
133  "modified_by_client_uuid":"qr1hi-ozdt8-obw7foaks3qjyej",
134  "modified_by_user_uuid":"qr1hi-tpzed-9zdpkpni2yddge6",
135  "modified_at":"2013-12-17T19:02:15Z",
136  "updated_at":"2013-12-17T19:02:15Z",
137  "submit_id":null,
138  "priority":null,
139  "script":"GATK2-VariantFiltration",
140  "script_parameters":{
141   "input":"5ee633fe2569d2a42dd81b07490d5d13+82",
142   "gatk_binary_tarball":"c905c8d8443a9c44274d98b7c6cfaa32+94",
143   "gatk_bundle":"d237a90bae3870b3b033aea1e99de4a9+10820"
144  },
145  "script_version":"76588bfc57f33ea1b36b82ca7187f465b73b4ca4",
146  "cancelled_at":null,
147  "cancelled_by_client_uuid":null,
148  "cancelled_by_user_uuid":null,
149  "started_at":null,
150  "finished_at":null,
151  "output":null,
152  "success":null,
153  "running":null,
154  "is_locked_by_uuid":null,
155  "log":null,
156  "runtime_constraints":{},
157  "tasks_summary":{}
158 }
159 </code></pre>
160 </notextile>
161
162 Once the job completes, the output can be found in hu34D5B9-exome-filtered.vcf:
163
164 <notextile><pre><code>~$ <span class="userinput">arv keep ls bedd6ff56b3ae9f90d873b1fcb72f9a3+91</span>
165 hu34D5B9-exome-filtered.vcf
166 </code></pre>
167 </notextile>
168
169 h2. Notes
170
171 fn1. "Download the GATK tools":http://www.broadinstitute.org/gatk/download
172
173 fn2. "Information about the GATK resource bundle":http://gatkforums.broadinstitute.org/discussion/1213/whats-in-the-resource-bundle-and-how-can-i-get-it and "direct download link":ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37/ (if prompted, submit an empty password)