--- layout: default navsection: userguide title: "Tutorial: GATK VariantFiltration" navorder: 22 --- h1. Tutorial: GATK VariantFiltration Here you will use the GATK VariantFiltration program to assign pass/fail scores to variants in a VCF file. h3. Prerequisites * Log in to a VM "using SSH":ssh-access.html * Put an "API token":api-tokens.html in your @ARVADOS_API_TOKEN@ environment variable * Put the API host name in your @ARVADOS_API_HOST@ environment variable If everything is set up correctly, the command @arv -h user current@ will display your account information. h3. Get the GATK binary distribution. Download the GATK binary tarball[1] -- e.g., @GenomeAnalysisTK-2.6-4.tar.bz2@ -- and copy it to your Arvados VM. Store it in Keep. <pre> arv keep put --in-manifest GenomeAnalysisTK-2.6-4.tar.bz2 </pre> ↓ <pre> c905c8d8443a9c44274d98b7c6cfaa32+94+K@qr1hi </pre> h3. Get the GATK resource bundle. This can take a while to download, and should already be available in Arvados. For now let's just list the files and sizes, to make sure we have the correct collection ID. <pre> arv keep ls -s d237a90bae3870b3b033aea1e99de4a9+10820+K@qr1hi </pre> ↓ <pre> 50342 1000G_omni2.5.b37.vcf.gz 1 1000G_omni2.5.b37.vcf.gz.md5 464 1000G_omni2.5.b37.vcf.idx.gz 1 1000G_omni2.5.b37.vcf.idx.gz.md5 43981 1000G_phase1.indels.b37.vcf.gz ... </pre> h3. Submit a job. The Arvados distribution includes an example crunch script ("crunch_scripts/GATK2-VariantFiltration":https://arvados.org/projects/arvados/repository/revisions/master/entry/crunch_scripts/GATK2-VariantFiltration) that runs the GATK VariantFiltration tool with some default settings. We will pass it the following parameters: * input -- a collection containing the source VCF data. Here we will use an exome report from PGP participant hu34D5B9. * gatk_binary_tarball -- a collection containing the GATK 2 tarball. * gatk_bundle -- a collection containing the GATK resource bundle[2]. <pre> src_version=76588bfc57f33ea1b36b82ca7187f465b73b4ca4 vcf_input=5ee633fe2569d2a42dd81b07490d5d13+82+K@qr1hi gatk_binary=c905c8d8443a9c44274d98b7c6cfaa32+94+K@qr1hi gatk_bundle=d237a90bae3870b3b033aea1e99de4a9+10820+K@qr1hi read -rd $'\000' the_job <<EOF { "script":"GATK2-VariantFiltration", "script_version":"$src_version", "script_parameters": { "input":"$vcf_input", "gatk_binary_tarball":"$gatk_binary", "gatk_bundle":"$gatk_bundle" } } EOF arv -h job create --job "$the_job" </pre> Note the job UUID in the API response. h3. Monitor job progress There are three ways to monitor job progress: # Go to Workbench, drop down the Compute menu, and click Jobs. The job you submitted should appear at the top of the list. Hit "Refresh" until it finishes. # Run @arv -h job get --uuid JOB_UUID_HERE@ to see the job particulars, notably the "tasks_summary" attribute which indicates how many tasks are done/running/todo. # Watch the crunch log messages and stderr from the job tasks: <pre> curl -s -H "Authorization: OAuth2 $ARVADOS_API_TOKEN" \ https://{{ site.arvados_api_host }}/arvados/v1/jobs/JOB_UUID_HERE/log_tail_follow </pre> h3. Notes fn1. Download the GATK tools → http://www.broadinstitute.org/gatk/download fn2. Information about the GATK resource bundle → http://gatkforums.broadinstitute.org/discussion/1213/whats-in-the-resource-bundle-and-how-can-i-get-it