--- layout: default navsection: userguide title: "Processing Whole Genome Sequences" ... {% comment %} Copyright (C) The Arvados Authors. All rights reserved. SPDX-License-Identifier: CC-BY-SA-3.0 {% endcomment %}
$ git clone https://github.com/arvados/arvados-tutorial.git
$ cd arvados-tutorial/WGS-processing
Recall that CWL is a way to describe command line tools and connect them together to create workflows. YML files can be used to specify input values into these individual command line tools or overarching workflows.
The tutorial directories are as follows:
* @cwl@ - contains CWL descriptions of workflows and command line tools for the tutorial
* @yml@ - contains YML files for inputs for the main workflow or to test subworkflows command line tools
* @src@ - contains any source code necessary for the tutorial
* @docker@ - contains dockerfiles necessary to re-create any needed docker images used in the tutorial
Before we run the WGS processing workflow, we want to adjust the inputs to match those in your new project. The workflow that we want to submit is described by the file @/cwl/@ and the inputs are given by the file @/yml/@. Note: while all the cwl files are needed to describe the full workflow only the single yml with the workflow inputs is needed to run the workflow. The additional yml files (in the helper folder) are provided for testing purposes or if one might want to test or run an underlying subworkflow or cwl for a command line tool by itself.
Several of the inputs in the yml file point to original content addresses of collections that you make copies of in our New Project. These still work because even though we made copies of the collections into our new project we haven’t changed the underlying contents. However, by changing this file is in general how you would alter the inputs in the accompanying yml file for a given workflow.
The command to submit to the Arvados Playground Cluster is @arvados-cwl-runner@.
To submit the WGS processing workflow , you need to run the following command replacing YOUR_PROJECT_UUID with the UUID of the new project you created for this tutorial.
$ arvados-cwl-runner --no-wait --project-uuid YOUR_PROJECT_UUID ./cwl/wgs-processing-wf.cwl ./yml/wgs-processing-wf.yml
The @--no-wait@ option will submit the workflow to Arvados, print out the UUID of the job that was submitted to standard output, and exit instead of waiting until the job is finished to return the command prompt.
The @--project-uuid@ option specifies the project you want the workflow to run in, that means the outputs and log collections as well as the workflow process will be saved in that project
If the workflow submitted successfully, you should see the following at the end of the output to the screen
INFO Final process status is success
Now, you are ready to check the state of your submitted workflow.
h2. 5. Checking the State Of a Submitted Workflow
Once you have submitted your workflow, you can examine its state interactively using the Arvados Workbench. If you aren’t already viewing your workflow process on the workbench, there several ways to get to your submitted workflow. Here are two of the simplest ways:
* Via the Dashboard: It should be listed at the top of the list of “Recent Processes”. Just click on the name of your submitted workflow and it will take you to the submitted workflow information.
* Via Your Project: You will want to go back to your new project, using the Projects pulldown menu or searching for the project name. Note: You can mark a Project as a favorite (if/when you have multiple Projects) to make it easier to find on the pulldown menu using the star next to the project name on the project page.
The process you will be looking for will be titled “WGS processing workflow scattered over samples”(if you submitted via the command line) or NAME OF REGISTERED WORKFLOW container (if you submitted via the Registered Workflow).
Once you have found your workflow, you can clearly see the state of the overall workflow and underlying steps below by their label.
Common states you will see are as follows:
* Queued - Workflow or step is waiting to run
* Running or Active - Workflow is currently running
* Complete - Workflow or step has successfully completed
* Failed - Workflow or step has not successfully completed
* Failing -- Workflow is running but has steps that have failed
* Cancelled - Workflow or step has been either manually cancelled or has been canceled by Arvados due to an Issue
Since Arvados Crunch reuses steps and workflows if possible, this workflow should run relatively quickly since this workflow has been run before and you have access to those previously run steps. You may notice an initial period where the top level job shows the option of canceling while the other steps are filled in with already finished steps.
h2. 6. Examining a Finished Workflow
Once your workflow has finished, you can see how long it took the workflow to run, see scaling information, and examine the logs and outputs. Outputs will be only available for steps that have been successfully completed. Outputs will be saved for every step in the workflow and be saved for the workflow itself. Outputs are saved in collections. You can access each collection by clicking on the link corresponding to the output.
!{width: 100%}{{ site.baseurl }}/images/wgs-tutorial/image5.png!
_*Figure 6*: Screenshot of a completed workflow process in Arvados as viewed via the Arvados Workbench. You can click on the outputs link (highlighted in yellow) to view the outputs. Outputs of a workflow are stored in a collection._
If we click on the outputs of the workflow, we will see the output collection.
Contained in this collection, is the GVCF, tabix index file, and html ClinVar report for each analyzed sample (e.g. set of FASTQs). By clicking on the download button to the right of the file, you can download it to your local machine. You can also use the command line to download single files or whole collections to your machine. You can examine the outputs of a step similarly by using the arrow to expand the panel to see more details.
Logs for the main process can be found in the Log tab. There several logs available, so here is a basic summary of what some of the more commonly used logs contain. Let's first define a few terms that will help us understand what the logs are tracking.
As you may recall, Arvados Crunch manages the running of workflows. A _container request_ is an order sent to Arvados Crunch to perform some computational work. Crunch fulfils a request by either choosing a worker node to execute a container, or finding an identical/equivalent container that has already run. You can use _container request_ or _container_ to distinguish between a work order that is submitted to be run and a work order that is actually running or has been run. So our container request in this case is just the submitted workflow we sent to the Arvados cluster.
A _node_ is a compute resource where Arvardos can schedule work. In our case since the Arvados Playground is running on a cloud, our nodes are virtual machines. @arvados-cwl-runner@ (acr) executes CWL workflows by submitting the individual parts to Arvados as containers and crunch-run is an internal component that runs on nodes and executes containers.
* @stderr.txt@
** Captures everything written to standard error by the programs run by the executing container
* @node-info.txt@ and @node.json@
** Contains information about the nodes that executed this container. For the Arvados Playground, this gives information about the virtual machine instance that ran the container.
node.json gives a high level overview about the instance such as name, price, and RAM while node-info.txt gives more detailed information about the virtual machine (e.g. cpu of each processor)
* @crunch-run.txt@ and @crunchstat.txt@
** @crunch-run.txt@ has info about how the container's execution environment was set up (e.g., time spent loading the docker image) and timing/results of copying output data to Keep (if applicable)
** @crunchstat.txt@ has info about resource consumption (RAM, cpu, disk, network) by the container while it was running.
* @container.json@
** Describes the container (unit of work to be done), contains CWL code, runtime constraints (RAM, vcpus) amongst other details
* @arv-mount.txt@
** Contains information using Arvados Keep on the node executing the container
* @hoststat.txt@
** Contains about resource consumption (RAM, cpu, disk, network) on the node while it was running
This is different from the log crunchstat.txt because it includes resource consumption of Arvados components that run on the node outside the container such as crunch-run and other processes related to the Keep file system.
For the highest level logs, the logs are tracking the container that ran the arvados-cwl-runner process which you can think of as the “mastermind” behind tracking which parts of the CWL workflow need to be run when, which have been run already, what order they need to be run, which can be run simultaneously, and so forth and then sending out the related container requests. Each step then has their own logs related to containers running a CWL step of the workflow including a log of standard error that contains the standard error of the code run in that CWL step. Those logs can be found by expanding the steps and clicking on the link to the log collection.
Let’s take a peek at a few of these logs to get you more familiar with them. First, we can look at the stderr.txt of the highest level process. Again recall this should be of the “mastermind” arvados-cwl-runner process. You can click on the log to download it to your local machine, and when you look at the contents - you should see something like the following...
2020-06-22T20:30:04.737703197Z INFO /usr/bin/arvados-cwl-runner 2.0.3, arvados-python-client 2.0.3, cwltool 1.0.20190831161204
2020-06-22T20:30:04.743250012Z INFO Resolved '/var/lib/cwl/workflow.json#main' to 'file:///var/lib/cwl/workflow.json#main'
2020-06-22T20:30:20.749884298Z INFO Using empty collection d41d8cd98f00b204e9800998ecf8427e+0
[removing some log contents here for brevity]
2020-06-22T20:30:35.629783939Z INFO Running inside container su92l-dz642-uaqhoebfh91zsfd
2020-06-22T20:30:35.741778080Z INFO [workflow WGS processing workflow] start
2020-06-22T20:30:35.741778080Z INFO [workflow WGS processing workflow] starting step getfastq
2020-06-22T20:30:35.741778080Z INFO [step getfastq] start
2020-06-22T20:30:36.085839313Z INFO [step getfastq] completed success
2020-06-22T20:30:36.212789670Z INFO [workflow WGS processing workflow] starting step bwamem-gatk-report
2020-06-22T20:30:36.213545871Z INFO [step bwamem-gatk-report] start
2020-06-22T20:30:36.234224197Z INFO [workflow bwamem-gatk-report] start
2020-06-22T20:30:36.234892498Z INFO [workflow bwamem-gatk-report] starting step fastqc
2020-06-22T20:30:36.235154798Z INFO [step fastqc] start
2020-06-22T20:30:36.237328201Z INFO Using empty collection d41d8cd98f00b204e9800998ecf8427e+0
You can see the output of all the work that arvados-cwl-runner does by managing the execution of the CWL workflow and all the underlying steps and subworkflows.
Now, let’s explore the logs for a step in the workflow. Remember that those logs can be found by expanding the steps and clicking on the link to the log collection. Let’s look at the log for the step that does the alignment. That step is named bwamem-samtools-view. We can see there are 10 of them because we are aligning 10 genomes. Let’s look at bwamem-samtools-view2.
We click the arrow to open up the step, and then can click on the log collection to access the logs. You may notice there are two sets of seemingly identical logs. One listed under a directory named for a container and one up in the main directory. This is done in case your step had to be automatically re-run due to any issues and gives the logs of each re-run. The logs in the main directory are the logs for the successful run. In most cases this does not happen, you will just see one directory and one those logs will match the logs in the main directory. Let’s open the logs labeled node-info.txt and stderr.txt.
@node-info.txt@ gives us information about detailed information about the virtual machine this step was run on. The tail end of the log should look like the following:
Memory Information
MemTotal: 64465820 kB
MemFree: 61617620 kB
MemAvailable: 62590172 kB
Buffers: 15872 kB
Cached: 1493300 kB
SwapCached: 0 kB
Active: 1070868 kB
Inactive: 1314248 kB
Active(anon): 873716 kB
Inactive(anon): 8444 kB
Active(file): 197152 kB
Inactive(file): 1305804 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 952 kB
Writeback: 0 kB
AnonPages: 874968 kB
Mapped: 115352 kB
Shmem: 8604 kB
Slab: 251844 kB
SReclaimable: 106580 kB
SUnreclaim: 145264 kB
KernelStack: 5584 kB
PageTables: 3832 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 32232908 kB
Committed_AS: 2076668 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
Percpu: 5120 kB
AnonHugePages: 743424 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 155620 kB
DirectMap2M: 6703104 kB
DirectMap1G: 58720256 kB
Disk Space
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/nvme1n1p1 7874 1678 5778 23% /
/dev/mapper/tmp 381746 1496 380251 1% /tmp
Disk INodes
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/nvme1n1p1 516096 42253 473843 9% /
/dev/mapper/tmp 195549184 44418 195504766 1% /tmp
We can see all the details of the virtual machine used for this step, including that it has 16 cores and 64 GIB of RAM.
@stderr.txt@ gives us everything written to standard error by the programs run in this step. This step ran successfully so we don’t need to use this to debug our step currently. We are just taking a look for practice.
The tail end of our log should be similar to the following:
2020-08-04T04:37:19.674225566Z [main] CMD: /bwa-0.7.17/bwa mem -M -t 16 -R @RG\tID:sample\tSM:sample\tLB:sample\tPL:ILLUMINA\tPU:sample1 -c 250 /keep/18657d75efb4afd31a14bb204d073239+13611/GRCh38_no_alt_plus_hs38d1_analysis_set.fna /keep/a146a06222f9a66b7d141e078fc67660+376237/ERR2122554_1.fastq.gz /keep/a146a06222f9a66b7d141e078fc67660+376237/ERR2122554_2.fastq.gz
2020-08-04T04:37:19.674225566Z [main] Real time: 35859.344 sec; CPU: 553120.701 sec
This is the command we ran to invoke bwa-mem, and the scaling information for running bwa-mem multi-threaded across 16 cores (15.4x).
We hope that now that you have a bit more familiarity with the logs you can continue to use them to debug and optimize your own workflows as you move forward with using Arvados if your own work in the future.
h2. 7. Conclusion
Thank you for working through this walkthrough tutorial. Hopefully this tutorial has helped you get a feel for working with Arvados. This tutorial just covered the basic capabilities of Arvados. There are many more capabilities to explore. Please see the links featured at the end of Section 1 for ways to learn more about Arvados or get help while you are working with Arvados.
If you would like help setting up your own production instance of Arvados, please contact us at "info@curii.com.":info@curii.com