_episodes/01-introduction.md

   1 ---
   2 title: "Introduction"
   3 teaching: 0
   4 exercises: 0
   5 questions:
   6 - "Key question (FIXME)"
   7 objectives:
   8 - "First learning objective. (FIXME)"
   9 keypoints:
  10 - "First key point. Brief Answer to questions. (FIXME)"
  11 ---
  12
  13 ## Introduction
  14
  15 The goal of this training is to walk through the development of a
  16 best-practices CWL workflow by translating an existing bioinformatics
  17 shell script into CWL.  Specific knowledge of the biology of RNA-seq
  18 is *not* a prerequisite for these lessons.
  19
  20 These lessons are based on "Introduction to RNA-seq using
  21 high-performance computing (HPC)" lessons developed by members of the
  22 teaching team at the Harvard Chan Bioinformatics Core (HBC).  The
  23 original training, which includes additional lectures about the
  24 biology of RNA-seq can be found here:
  25
  26 https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2
  27
  28 ## Background
  29
  30 RNA-seq is the process of sequencing RNA in a biological sample.  From
  31 the sequence reads, we want to measure the relative number of RNA
  32 molecules appearing in the sample that were produced by particular
  33 genes.  This analysis is called "differential gene expression".
  34
  35 The entire process looks like this:
  36
  37 ![](/assets/img/RNAseqWorkflow.png)
  38
  39 For this training, we are only concerned with the middle analytical
  40 steps (skipping adapter trimming).
  41
  42 * Quality control (FASTQC)
  43 * Alignment (mapping)
  44 * Counting reads associated with genes
  45
  46 ## Analysis shell script
  47
  48 This analysis is already available as a Unix shell script, which we
  49 will refer to in order to build the workflow.
  50
  51 Some of the reasons to use CWL over a plain shell script: portability,
  52 scalability, ability to run on platforms that are not traditional HPC.
  53
  54 rnaseq_analysis_on_input_file.sh
  55
  56 ```
  57 #!/bin/bash
  58
  59 # Based on
  60 # https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html
  61 #
  62
  63 # This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it.
  64 # USAGE: sh rnaseq_analysis_on_input_file.sh <name of fastq file>
  65
  66 set -e
  67
  68 # initialize a variable with an intuitive name to store the name of the input fastq file
  69 fq=$1
  70
  71 # grab base of filename for naming outputs
  72 base=`basename $fq .subset.fq`
  73 echo "Sample name is $base"
  74
  75 # specify the number of cores to use
  76 cores=4
  77
  78 # directory with genome reference FASTA and index files + name of the gene annotation file
  79 genome=rnaseq/reference_data
  80 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
  81
  82 # make all of the output directories
  83 # The -p option means mkdir will create the whole path if it
  84 # does not exist and refrain from complaining if it does exist
  85 mkdir -p rnaseq/results/fastqc
  86 mkdir -p rnaseq/results/STAR
  87 mkdir -p rnaseq/results/counts
  88
  89 # set up output filenames and locations
  90 fastqc_out=rnaseq/results/fastqc
  91 align_out=rnaseq/results/STAR/${base}_
  92 counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam
  93 counts=rnaseq/results/counts/${base}_featurecounts.txt
  94
  95 echo "Processing file $fq"
  96
  97 # Run FastQC and move output to the appropriate folder
  98 fastqc $fq
  99
 100 # Run STAR
 101 STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outFileNamePrefix $align_out --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard
 102
 103 # Create BAM index
 104 samtools index $counts_input_bam
 105
 106 # Count mapped reads
 107 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
 108 ```
 109
 110
 111 {% include links.md %}