--- layout: default navsection: userguide title: "Tutorial: GATK VariantFiltration" navorder: 22 --- h1. Tutorial: GATK VariantFiltration Here you will use the GATK VariantFiltration program to assign pass/fail scores to variants in a VCF file. h3. Prerequisites * Log in to a VM "using SSH":ssh-access.html * Put an "API token":api-tokens.html in your @ARVADOS_API_TOKEN@ environment variable * Put the API host name in your @ARVADOS_API_HOST@ environment variable If everything is set up correctly, the command @arv -h user current@ will display your account information. h3. Get the GATK binary distribution. Download the GATK binary tarball[1] -- e.g., @GenomeAnalysisTK-2.6-4.tar.bz2@ -- and copy it to your Arvados VM. Store it in Keep.
arv keep put --in-manifest GenomeAnalysisTK-2.6-4.tar.bz2
c905c8d8443a9c44274d98b7c6cfaa32+94+K@qr1hi
h3. Get the GATK resource bundle. This can take a while to download, and should already be available in Arvados. For now let's just list the files and sizes, to make sure we have the correct collection ID.
arv keep ls -s d237a90bae3870b3b033aea1e99de4a9+10820+K@qr1hi
  50342 1000G_omni2.5.b37.vcf.gz
      1 1000G_omni2.5.b37.vcf.gz.md5
    464 1000G_omni2.5.b37.vcf.idx.gz
      1 1000G_omni2.5.b37.vcf.idx.gz.md5
  43981 1000G_phase1.indels.b37.vcf.gz
...
h3. Submit a job. The Arvados distribution includes an example crunch script ("crunch_scripts/GATK2-VariantFiltration":https://arvados.org/projects/arvados/repository/revisions/master/entry/crunch_scripts/GATK2-VariantFiltration) that runs the GATK VariantFiltration tool with some default settings. We will pass it the following parameters: * input -- a collection containing the source VCF data. Here we will use an exome report from PGP participant hu34D5B9. * gatk_binary_tarball -- a collection containing the GATK 2 tarball. * gatk_bundle -- a collection containing the GATK resource bundle[2].
src_version=76588bfc57f33ea1b36b82ca7187f465b73b4ca4
vcf_input=5ee633fe2569d2a42dd81b07490d5d13+82+K@qr1hi
gatk_binary=c905c8d8443a9c44274d98b7c6cfaa32+94+K@qr1hi
gatk_bundle=d237a90bae3870b3b033aea1e99de4a9+10820+K@qr1hi

read -rd $'\000' the_job <

Note the job UUID in the API response.

h3. Monitor job progress

There are three ways to monitor job progress:

# Go to Workbench, drop down the Compute menu, and click Jobs. The job you submitted should appear at the top of the list. Hit "Refresh" until it finishes.
# Run @arv -h job get --uuid JOB_UUID_HERE@ to see the job particulars, notably the "tasks_summary" attribute which indicates how many tasks are done/running/todo.
# Watch the crunch log messages and stderr from the job tasks:

curl -s -H "Authorization: OAuth2 $ARVADOS_API_TOKEN" \
  https://{{ site.arvados_api_host }}/arvados/v1/jobs/JOB_UUID_HERE/log_tail_follow
h3. Notes fn1. Download the GATK tools → http://www.broadinstitute.org/gatk/download fn2. Information about the GATK resource bundle → http://gatkforums.broadinstitute.org/discussion/1213/whats-in-the-resource-bundle-and-how-can-i-get-it