_episodes/01-introduction.md

   1 ---
   2 title: "Introduction"
   3 teaching: 10
   4 exercises: 0
   5 questions:
   6 - "What is CWL?"
   7 - "What are the requirements for this training?"
   8 - "What is the goal of this training?"
   9 objectives:
  10 - "Understand how the training will be motivated by an example analysis."
  11 keypoints:
  12 - "Common Workflow Language is a standard for describing data analysis workflows"
  13 - "This training assumes some basic familiarity with editing text files, the Unix command line, and Unix shell scripts."
  14 - "We will use an bioinformatics RNA-seq analysis as an example workflow, but does not require in-depth knowledge of biology."
  15 - "After completing this training, you should be able to begin writing workflows for your own analysis, and know where to learn more."
  16 ---
  17
  18 # Introduction to Common Worklow Language
  19
  20 The Common Workflow Language (CWL) is an open standard for describing
  21 automated, batch data analysis workflows.  Unlike many programming
  22 languages, CWL is a declarative language.  This means it describes
  23 _what_ should happen, but not _how_ it should happen.  This enables
  24 workflows written in CWL to be portable and scalable across a variety
  25 of software and hardware environments, from workstations to cluster,
  26 cloud, and high performance computing (HPC) environments.  As a
  27 standard with multiple implementations, CWL is particularly well
  28 suited for research collaboration, publishing, and high-throughput
  29 production data analysis.
  30
  31 # Introduction to this training
  32
  33 The goal of this training is to walk the student through the
  34 development of a best-practices CWL workflow, starting from an
  35 existing shell script that performs a simple RNA-seq bioinformatics
  36 analysis.  At the conclusion of this training, you should have a grasp
  37 of the essential components of a workflow, and have a basis for
  38 learning more.
  39
  40 This training assumes some basic familiarity with editing text files,
  41 the Unix command line, and Unix shell scripts.
  42
  43 Specific knowledge of the biology of RNA-seq is *not* a prerequisite
  44 for these lessons.  Although orignally developed to solve big data
  45 problems in genomics, CWL is not domain specific to bioinformatics,
  46 and is used in a number of other fields including medical imaging,
  47 astronomy, geospatial, and machine learning.  We hope that you will
  48 find this training useful regardless of your area of research.
  49
  50 These lessons are based on [Introduction to RNA-seq using
  51 high-performance computing
  52 (HPC)](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2) lessons
  53 developed by members of the teaching team at the Harvard Chan
  54 Bioinformatics Core (HBC).  The original training, which includes
  55 additional lectures about the biology of RNA-seq, can be found at that
  56 link.
  57
  58 # Introduction to the example analysis
  59
  60 RNA-seq is the process of sequencing RNA present in a biological
  61 sample.  From the sequence reads, we want to measure the relative
  62 numbers of different RNA molecules appearing in the sample that were
  63 produced by particular genes.  This analysis is called "differential
  64 gene expression".
  65
  66 The entire process looks like this:
  67
  68 ![](/assets/img/RNAseqWorkflow.png){: height="400px"}
  69
  70 For this training, we are only concerned with the middle analytical
  71 steps (skipping adapter trimming).
  72
  73 * Quality control (FASTQC)
  74 * Alignment (mapping)
  75 * Counting reads associated with genes
  76
  77 In this training, we are not attempting to develop the analysis from
  78 scratch, instead we we will be starting from an analysis already
  79 written in a shell script, which will be supplied in lesson 2.
  80
  81 {% include links.md %}