lesson1/lesson1.md

   1 # Turning a shell script into a workflow by composing existing tools
   2
   3 ## Introduction
   4
   5 The goal of this training is to walk through the development of a
   6 best-practices CWL workflow by translating an existing bioinformatics
   7 shell script into CWL.  Specific knowledge of the biology of RNA-seq
   8 is *not* a prerequisite for these lessons.
   9
  10 These lessons are based on "Introduction to RNA-seq using
  11 high-performance computing (HPC)" lessons developed by members of the
  12 teaching team at the Harvard Chan Bioinformatics Core (HBC).  The
  13 original training, which includes additional lectures about the
  14 biology of RNA-seq can be found here:
  15
  16 https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2
  17
  18 ## Background
  19
  20 RNA-seq is the process of sequencing RNA in a biological sample.  From
  21 the sequence reads, we want to measure the relative number of RNA
  22 molecules appearing in the sample that were produced by particular
  23 genes.  This analysis is called "differential gene expression".
  24
  25 The entire process looks like this:
  26
  27 ![](RNAseqWorkflow.png)
  28
  29 For this training, we are only concerned with the middle analytical
  30 steps (skipping adapter trimming).
  31
  32 * Quality control (FASTQC)
  33 * Alignment (mapping)
  34 * Counting reads associated with genes
  35
  36 ## Analysis shell script
  37
  38 This analysis is already available as a Unix shell script, which we
  39 will refer to in order to build the workflow.
  40
  41 Some of the reasons to use CWL over a plain shell script: portability,
  42 scalability, ability to run on platforms that are not traditional HPC.
  43
  44 rnaseq_analysis_on_input_file.sh
  45
  46 ```
  47 #!/bin/bash
  48
  49 # Based on
  50 # https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html
  51 #
  52
  53 # This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it.
  54 # USAGE: sh rnaseq_analysis_on_input_file.sh <name of fastq file>
  55
  56 set -e
  57
  58 # initialize a variable with an intuitive name to store the name of the input fastq file
  59 fq=$1
  60
  61 # grab base of filename for naming outputs
  62 base=`basename $fq .subset.fq`
  63 echo "Sample name is $base"
  64
  65 # specify the number of cores to use
  66 cores=4
  67
  68 # directory with genome reference FASTA and index files + name of the gene annotation file
  69 genome=rnaseq/reference_data
  70 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
  71
  72 # make all of the output directories
  73 # The -p option means mkdir will create the whole path if it
  74 # does not exist and refrain from complaining if it does exist
  75 mkdir -p rnaseq/results/fastqc
  76 mkdir -p rnaseq/results/STAR
  77 mkdir -p rnaseq/results/counts
  78
  79 # set up output filenames and locations
  80 fastqc_out=rnaseq/results/fastqc
  81 align_out=rnaseq/results/STAR/${base}_
  82 counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam
  83 counts=rnaseq/results/counts/${base}_featurecounts.txt
  84
  85 echo "Processing file $fq"
  86
  87 # Run FastQC and move output to the appropriate folder
  88 fastqc $fq
  89
  90 # Run STAR
  91 STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outFileNamePrefix $align_out --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard
  92
  93 # Create BAM index
  94 samtools index $counts_input_bam
  95
  96 # Count mapped reads
  97 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
  98 ```
  99
 100 ## Setting up
 101
 102 We will create a new git repository and import a library of existing
 103 tool definitions that will help us build our workflow.
 104
 105 Create a new git repository to hold our workflow with this command:
 106
 107 ```
 108 git init rnaseq-cwl-training-exercises
 109 ```
 110
 111 On Arvados use this:
 112
 113 ```
 114 git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
 115 ```
 116
 117 Next, import bio-cwl-tools with this command:
 118
 119 ```
 120 git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
 121 ```
 122
 123 ## Writing the workflow
 124
 125 ### 1. File header
 126
 127 Create a new file "main.cwl"
 128
 129 Start with this header.
 130
 131
 132 ```
 133 cwlVersion: v1.2
 134 class: Workflow
 135 label: RNAseq CWL practice workflow
 136 ```
 137
 138 ### 2. Workflow Inputs
 139
 140 The purpose of a workflow is to consume some input parameters, run a
 141 series of steps, and produce output values.
 142
 143 For this analysis, the input parameters are the fastq file and the reference data required by STAR.
 144
 145 In the original shell script, the following variables are declared:
 146
 147 ```
 148 # initialize a variable with an intuitive name to store the name of the input fastq file
 149 fq=$1
 150
 151 # directory with genome reference FASTA and index files + name of the gene annotation file
 152 genome=rnaseq/reference_data
 153 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
 154 ```
 155
 156 In CWL, we will declare these variables in the `inputs` section.
 157
 158 The inputs section lists each input parameter and its type.  Valid
 159 types include `File`, `Directory`, `string`, `boolean`, `int`, and
 160 `float`.
 161
 162 In this case, the fastq and gene annotation file are individual files.  The STAR index is a directory.  We can describe these inputs in CWL like this:
 163
 164 ```
 165 inputs:
 166   fq: File
 167   genome: Directory
 168   gtf: File
 169 ```
 170
 171 ### 3. Workflow Steps
 172
 173 A workflow consists of one or more steps.  This is the `steps` section.
 174
 175 Now we need to describe the first step of the workflow.  This step is to run `fastqc`.
 176
 177 A workflow step consists of the name of the step, the tool to `run`,
 178 the input parameters to be passed to the tool in `in`, and the output
 179 parameters expected from the tool in `out`.
 180
 181 The value of `run` references the tool file.  Tip: while typing the
 182 file name, you can get suggestions and auto-completion on a partial
 183 name using control+space.
 184
 185 The `in` block lists input parameters to the tool and the workflow
 186 parameters that will be assigned to those inputs.
 187
 188 The `out` block lists output parameters to the tool that are used
 189 later in the workflow.
 190
 191 You need to know which input and output parameters are available for
 192 each tool.  In vscode, click on the value of `run` and select "Go to
 193 definition" to open the tool file.  Look for the `inputs` and
 194 `outputs` sections of the tool file to find out what parameters are
 195 defined.
 196
 197 ```
 198 steps:
 199   fastqc:
 200     run: bio-cwl-tools/fastqc/fastqc_2.cwl
 201     in:
 202       reads_file: fq
 203     out: [html_file]
 204 ```
 205
 206 ### 4. Running alignment with STAR
 207
 208 STAR has more parameters.  Sometimes we want to provide input values
 209 to a step without making them as workflow-level inputs.  We can do
 210 this with `{default: N}`
 211
 212
 213 ```
 214   STAR:
 215     requirements:
 216       ResourceRequirement:
 217         ramMin: 6000
 218     run: bio-cwl-tools/STAR/STAR-Align.cwl
 219     in:
 220       RunThreadN: {default: 4}
 221       GenomeDir: genome
 222       ForwardReads: fq
 223       OutSAMtype: {default: BAM}
 224       OutSAMunmapped: {default: Within}
 225     out: [alignment]
 226 ```
 227
 228 ### 5. Running samtools
 229
 230 The third step is to generate an index for the aligned BAM.
 231
 232 For this step, we need to use the output of a previous step as input
 233 to this step.  We refer the output of a step by with name of the step
 234 (STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment`
 235
 236 This creates a dependency between steps.  This means the `samtools`
 237 step will not run until the `STAR` step has completed successfully.
 238
 239 ```
 240   samtools:
 241     run: bio-cwl-tools/samtools/samtools_index.cwl
 242     in:
 243       bam_sorted: STAR/alignment
 244     out: [bam_sorted_indexed]
 245 ```
 246
 247 ### 6. featureCounts
 248
 249 As of this writing, the `subread` package that provides
 250 `featureCounts` is not available in bio-cwl-tools (and if it has been
 251 added since writing this, let's pretend that it isn't there.)  We will
 252 go over how to write a CWL wrapper for a command line tool in
 253 lesson 3.  For now, we will leave off the final step.
 254
 255 ### 7. Workflow Outputs
 256
 257 The last thing to do is declare the workflow outputs in the `outputs` section.
 258
 259 For each output, we need to declare the type of output, and what
 260 parameter has the output value.
 261
 262 Output types are the same as input types, valid types include `File`,
 263 `Directory`, `string`, `boolean`, `int`, and `float`.
 264
 265 The `outputSource` field refers the a step output in the same way that
 266 the `in` block does, the name of the step, a slash, and the name of
 267 the output parameter.
 268
 269 For our final outputs, we want the results from fastqc and the
 270 aligned, sorted and indexed BAM file.
 271
 272 ```
 273 outputs:
 274   qc_html:
 275     type: File
 276     outputSource: fastqc/html_file
 277   bam_sorted_indexed:
 278     type: File
 279     outputSource: samtools/bam_sorted_indexed
 280 ```