exercises: 0
questions:
- "What is CWL?"
+- "What are the requirements for this training?"
- "What is the goal of this training?"
objectives:
-- "First learning objective. (FIXME)"
+- "Understand how the training will be motivated by an example analysis."
keypoints:
-- "First key point. Brief Answer to questions. (FIXME)"
+- "Common Workflow Language is a standard for describing data analysis workflows"
+- "This training assumes some basic familiarity with editing text files, the Unix command line, and Unix shell scripts."
+- "We will use an bioinformatics RNA-seq analysis as an example workflow, but does not require in-depth knowledge of biology."
+- "After completing this training, you should be able to begin writing workflows for your own analysis, and know where to learn more."
---
# Introduction to Common Worklow Language
The Common Workflow Language (CWL) is an open standard for describing
-analysis workflows and tools in a way that makes them portable and
-scalable across a variety of software and hardware environments, from
-workstations to cluster, cloud, and high performance computing (HPC)
-environments. CWL is designed to meet the needs of data-intensive
-science, such as Bioinformatics, Medical Imaging, Astronomy, High
-Energy Physics, and Machine Learning.
+automated, batch data analysis workflows. Unlike many programming
+languages, CWL is a declarative language. This means it describes
+_what_ should happen, but not _how_ it should happen. This enables
+workflows written in CWL to be portable and scalable across a variety
+of software and hardware environments, from workstations to cluster,
+cloud, and high performance computing (HPC) environments. As a
+standard with multiple implementations, CWL is particularly well
+suited for research collaboration, publishing, and high-throughput
+production data analysis.
# Introduction to this training
The goal of this training is to walk the student through the
development of a best-practices CWL workflow, starting from an
-existing shell script that performs a common bioinformatics analysis.
+existing shell script that performs a simple RNA-seq bioinformatics
+analysis. At the conclusion of this training, you should have a grasp
+of the essential components of a workflow, and have a basis for
+learning more.
+
+This training assumes some basic familiarity with editing text files,
+the Unix command line, and Unix shell scripts.
Specific knowledge of the biology of RNA-seq is *not* a prerequisite
-for these lessons. CWL is not domain specific to bioinformatics. We
-hope that you will find this training useful even if you work in some
-other field of research.
+for these lessons. Although orignally developed to solve big data
+problems in genomics, CWL is not domain specific to bioinformatics,
+and is used in a number of other fields including medical imaging,
+astronomy, geospatial, and machine learning. We hope that you will
+find this training useful regardless of your area of research.
These lessons are based on [Introduction to RNA-seq using
high-performance computing
* Counting reads associated with genes
In this training, we are not attempting to develop the analysis from
-scratch, instead we we will be starting from an analysis written as a
-shell script. We will be using the following shell script as a guide to build
-our workflow.
-
-rnaseq_analysis_on_input_file.sh
-
-```
-#!/bin/bash
-
-# Based on
-# https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html
-#
-
-# This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it.
-# USAGE: sh rnaseq_analysis_on_input_file.sh <name of fastq file>
-
-set -e
-
-# initialize a variable with an intuitive name to store the name of the input fastq file
-fq=$1
-
-# grab base of filename for naming outputs
-base=`basename $fq .subset.fq`
-echo "Sample name is $base"
-
-# specify the number of cores to use
-cores=4
-
-# directory with genome reference FASTA and index files + name of the gene annotation file
-genome=rnaseq/reference_data
-gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
-
-# make all of the output directories
-# The -p option means mkdir will create the whole path if it
-# does not exist and refrain from complaining if it does exist
-mkdir -p rnaseq/results/fastqc
-mkdir -p rnaseq/results/STAR
-mkdir -p rnaseq/results/counts
-
-# set up output filenames and locations
-fastqc_out=rnaseq/results/fastqc
-align_out=rnaseq/results/STAR/${base}_
-counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam
-counts=rnaseq/results/counts/${base}_featurecounts.txt
-
-echo "Processing file $fq"
-
-# Run FastQC and move output to the appropriate folder
-fastqc $fq
-
-# Run STAR
-STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outFileNamePrefix $align_out --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard
-
-# Create BAM index
-samtools index $counts_input_bam
-
-# Count mapped reads
-featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
-```
-
+scratch, instead we we will be starting from an analysis already
+written in a shell script, which will be supplied in lesson 2.
{% include links.md %}