_episodes/01-introduction.md

   1 ---
   2 title: "Introduction"
   3 teaching: 10
   4 exercises: 0
   5 questions:
   6 - "What is CWL?"
   7 - "What is the goal of this training?"
   8 objectives:
   9 - "First learning objective. (FIXME)"
  10 keypoints:
  11 - "First key point. Brief Answer to questions. (FIXME)"
  12 ---
  13
  14 # Introduction to Common Worklow Language
  15
  16 The Common Workflow Language (CWL) is an open standard for describing
  17 analysis workflows and tools in a way that makes them portable and
  18 scalable across a variety of software and hardware environments, from
  19 workstations to cluster, cloud, and high performance computing (HPC)
  20 environments. CWL is designed to meet the needs of data-intensive
  21 science, such as Bioinformatics, Medical Imaging, Astronomy, High
  22 Energy Physics, and Machine Learning.
  23
  24 # Introduction to this training
  25
  26 The goal of this training is to walk the student through the
  27 development of a best-practices CWL workflow, starting from an
  28 existing shell script that performs a common bioinformatics analysis.
  29
  30 Specific knowledge of the biology of RNA-seq is *not* a prerequisite
  31 for these lessons.  CWL is not domain specific to bioinformatics.  We
  32 hope that you will find this training useful even if you work in some
  33 other field of research.
  34
  35 These lessons are based on [Introduction to RNA-seq using
  36 high-performance computing
  37 (HPC)](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2) lessons
  38 developed by members of the teaching team at the Harvard Chan
  39 Bioinformatics Core (HBC).  The original training, which includes
  40 additional lectures about the biology of RNA-seq, can be found at that
  41 link.
  42
  43 # Introduction to the example analysis
  44
  45 RNA-seq is the process of sequencing RNA present in a biological
  46 sample.  From the sequence reads, we want to measure the relative
  47 numbers of different RNA molecules appearing in the sample that were
  48 produced by particular genes.  This analysis is called "differential
  49 gene expression".
  50
  51 The entire process looks like this:
  52
  53 ![](/assets/img/RNAseqWorkflow.png){: height="400px"}
  54
  55 For this training, we are only concerned with the middle analytical
  56 steps (skipping adapter trimming).
  57
  58 * Quality control (FASTQC)
  59 * Alignment (mapping)
  60 * Counting reads associated with genes
  61
  62 In this training, we are not attempting to develop the analysis from
  63 scratch, instead we we will be starting from an analysis written as a
  64 shell script.  We will be using the following shell script as a guide to build
  65 our workflow.
  66
  67 rnaseq_analysis_on_input_file.sh
  68
  69 ```
  70 #!/bin/bash
  71
  72 # Based on
  73 # https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html
  74 #
  75
  76 # This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it.
  77 # USAGE: sh rnaseq_analysis_on_input_file.sh <name of fastq file>
  78
  79 set -e
  80
  81 # initialize a variable with an intuitive name to store the name of the input fastq file
  82 fq=$1
  83
  84 # grab base of filename for naming outputs
  85 base=`basename $fq .subset.fq`
  86 echo "Sample name is $base"
  87
  88 # specify the number of cores to use
  89 cores=4
  90
  91 # directory with genome reference FASTA and index files + name of the gene annotation file
  92 genome=rnaseq/reference_data
  93 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
  94
  95 # make all of the output directories
  96 # The -p option means mkdir will create the whole path if it
  97 # does not exist and refrain from complaining if it does exist
  98 mkdir -p rnaseq/results/fastqc
  99 mkdir -p rnaseq/results/STAR
 100 mkdir -p rnaseq/results/counts
 101
 102 # set up output filenames and locations
 103 fastqc_out=rnaseq/results/fastqc
 104 align_out=rnaseq/results/STAR/${base}_
 105 counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam
 106 counts=rnaseq/results/counts/${base}_featurecounts.txt
 107
 108 echo "Processing file $fq"
 109
 110 # Run FastQC and move output to the appropriate folder
 111 fastqc $fq
 112
 113 # Run STAR
 114 STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outFileNamePrefix $align_out --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard
 115
 116 # Create BAM index
 117 samtools index $counts_input_bam
 118
 119 # Count mapped reads
 120 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
 121 ```
 122
 123
 124 {% include links.md %}