--- title: "Introduction" teaching: 10 exercises: 0 questions: - "What is CWL?" - "What are the requirements for this training?" - "What is the goal of this training?" objectives: - "Understand how the training will be motivated by an example analysis." keypoints: - "Common Workflow Language is a standard for describing data analysis workflows" - "This training assumes some basic familiarity with editing text files, the Unix command line, and Unix shell scripts." - "We will use an bioinformatics RNA-seq analysis as an example workflow, but does not require in-depth knowledge of biology." - "After completing this training, you should be able to begin writing workflows for your own analysis, and know where to learn more." --- # Introduction to Common Worklow Language The Common Workflow Language (CWL) is an open standard for describing automated, batch data analysis workflows. Unlike many programming languages, CWL is a declarative language. This means it describes _what_ should happen, but not _how_ it should happen. This enables workflows written in CWL to be portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. As a standard with multiple implementations, CWL is particularly well suited for research collaboration, publishing, and high-throughput production data analysis. # Introduction to this training The goal of this training is to walk the student through the development of a best-practices CWL workflow, starting from an existing shell script that performs a simple RNA-seq bioinformatics analysis. At the conclusion of this training, you should have a grasp of the essential components of a workflow, and have a basis for learning more. This training assumes some basic familiarity with editing text files, the Unix command line, and Unix shell scripts. Specific knowledge of the biology of RNA-seq is *not* a prerequisite for these lessons. Although orignally developed to solve big data problems in genomics, CWL is not domain specific to bioinformatics, and is used in a number of other fields including medical imaging, astronomy, geospatial, and machine learning. We hope that you will find this training useful regardless of your area of research. These lessons are based on [Introduction to RNA-seq using high-performance computing (HPC)](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2) lessons developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). The original training, which includes additional lectures about the biology of RNA-seq, can be found at that link. # Introduction to the example analysis RNA-seq is the process of sequencing RNA present in a biological sample. From the sequence reads, we want to measure the relative numbers of different RNA molecules appearing in the sample that were produced by particular genes. This analysis is called "differential gene expression". The entire process looks like this: ![](/assets/img/RNAseqWorkflow.png){: height="400px"} For this training, we are only concerned with the middle analytical steps (skipping adapter trimming). * Quality control (FASTQC) * Alignment (mapping) * Counting reads associated with genes In this training, we are not attempting to develop the analysis from scratch, instead we we will be starting from an analysis already written in a shell script, which will be supplied in lesson 2. {% include links.md %}