---
layout: default
navsection: userguide
title: "Tutorial: GATK VariantFiltration"
navorder: 22
---

h1. Tutorial: GATK VariantFiltration

Here you will use the GATK VariantFiltration program to assign pass/fail scores to variants in a VCF file.

h3. Prerequisites

* Log in to a VM "using SSH":ssh-access.html
* Put an "API token":api-tokens.html in your @ARVADOS_API_TOKEN@ environment variable
* Put the API host name in your @ARVADOS_API_HOST@ environment variable

If everything is set up correctly, the command @arv -h user current@ will display your account information.

h3. Get the GATK binary distribution.

Download the GATK binary tarball[1] -- e.g., @GenomeAnalysisTK-2.6-4.tar.bz2@ -- and copy it to your Arvados VM.

Store it in Keep.

<pre>
arv keep put --in-manifest GenomeAnalysisTK-2.6-4.tar.bz2
</pre>

&darr;

<pre>
c905c8d8443a9c44274d98b7c6cfaa32+94+K@qr1hi
</pre>

h3. Get the GATK resource bundle.

This can take a while to download, and should already be available in Arvados. For now let's just list the files and sizes, to make sure we have the correct collection ID.

<pre>
arv keep ls -s d237a90bae3870b3b033aea1e99de4a9+10820+K@qr1hi
</pre>

&darr;

<pre>
  50342 1000G_omni2.5.b37.vcf.gz
      1 1000G_omni2.5.b37.vcf.gz.md5
    464 1000G_omni2.5.b37.vcf.idx.gz
      1 1000G_omni2.5.b37.vcf.idx.gz.md5
  43981 1000G_phase1.indels.b37.vcf.gz
...
</pre>

h3. Submit a job.

The Arvados distribution includes an example crunch script ("crunch_scripts/GATK2-VariantFiltration":https://arvados.org/projects/arvados/repository/revisions/master/entry/crunch_scripts/GATK2-VariantFiltration) that runs the GATK VariantFiltration tool with some default settings.

We will pass it the following parameters:

* input -- a collection containing the source VCF data. Here we will use an exome report from PGP participant hu34D5B9.
* gatk_binary_tarball -- a collection containing the GATK 2 tarball.
* gatk_bundle -- a collection containing the GATK resource bundle[2].

<pre>
src_version=76588bfc57f33ea1b36b82ca7187f465b73b4ca4
vcf_input=5ee633fe2569d2a42dd81b07490d5d13+82+K@qr1hi
gatk_binary=c905c8d8443a9c44274d98b7c6cfaa32+94+K@qr1hi
gatk_bundle=d237a90bae3870b3b033aea1e99de4a9+10820+K@qr1hi

read -rd $'\000' the_job <<EOF
{
 "script":"GATK2-VariantFiltration",
 "script_version":"$src_version",
 "script_parameters":
 {
  "input":"$vcf_input",
  "gatk_binary_tarball":"$gatk_binary",
  "gatk_bundle":"$gatk_bundle"
 }
}
EOF

arv -h job create --job "$the_job"
</pre>

Note the job UUID in the API response.

h3. Monitor job progress

There are three ways to monitor job progress:

# Go to Workbench, drop down the Compute menu, and click Jobs. The job you submitted should appear at the top of the list. Hit "Refresh" until it finishes.
# Run @arv -h job get --uuid JOB_UUID_HERE@ to see the job particulars, notably the "tasks_summary" attribute which indicates how many tasks are done/running/todo.
# Watch the crunch log messages and stderr from the job tasks:

<pre>
curl -s -H "Authorization: OAuth2 $ARVADOS_API_TOKEN" \
  https://{{ site.arvados_api_host }}/arvados/v1/jobs/JOB_UUID_HERE/log_tail_follow
</pre>

h3. Notes

fn1. Download the GATK tools &rarr; http://www.broadinstitute.org/gatk/download

fn2. Information about the GATK resource bundle &rarr; http://gatkforums.broadinstitute.org/discussion/1213/whats-in-the-resource-bundle-and-how-can-i-get-it