lesson1/lesson1.md

   1 # Turning a shell script into a workflow using existing tools
   2
   3 In this lesson we will turn `rnaseq_analysis_on_input_file.sh` into a workflow.
   4
   5 ## Setting up
   6
   7 We will create a new git repository and import a library of existing
   8 tool definitions that will help us build our workflow.
   9
  10 Create a new git repository to hold our workflow with this command:
  11
  12 ```
  13 git init rnaseq-cwl-training-exercises
  14 ```
  15
  16 On Arvados use this:
  17
  18 ```
  19 git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
  20 ```
  21
  22 Next, import bio-cwl-tools with this command:
  23
  24 ```
  25 git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
  26 ```
  27
  28 ## The shell script
  29
  30 ```
  31 #!/bin/bash
  32
  33 # Based on
  34 # https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html
  35 #
  36
  37 # This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it.
  38 # USAGE: sh rnaseq_analysis_on_input_file.sh <name of fastq file>
  39
  40 set -e
  41
  42 # initialize a variable with an intuitive name to store the name of the input fastq file
  43 fq=$1
  44
  45 # grab base of filename for naming outputs
  46 base=`basename $fq .subset.fq`
  47 echo "Sample name is $base"
  48
  49 # specify the number of cores to use
  50 cores=4
  51
  52 # directory with genome reference FASTA and index files + name of the gene annotation file
  53 genome=rnaseq/reference_data
  54 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
  55
  56 # make all of the output directories
  57 # The -p option means mkdir will create the whole path if it
  58 # does not exist and refrain from complaining if it does exist
  59 mkdir -p rnaseq/results/fastqc
  60 mkdir -p rnaseq/results/STAR
  61 mkdir -p rnaseq/results/counts
  62
  63 # set up output filenames and locations
  64 fastqc_out=rnaseq/results/fastqc
  65 align_out=rnaseq/results/STAR/${base}_
  66 counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam
  67 counts=rnaseq/results/counts/${base}_featurecounts.txt
  68
  69 echo "Processing file $fq"
  70
  71 # Run FastQC and move output to the appropriate folder
  72 fastqc $fq
  73
  74 # Run STAR
  75 STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outFileNamePrefix $align_out --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard
  76
  77 # Create BAM index
  78 samtools index $counts_input_bam
  79
  80 # Count mapped reads
  81 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
  82 ```
  83
  84 ## Writing the workflow
  85
  86 ### 1. File header
  87
  88 Create a new file "main.cwl"
  89
  90 Start with this header.
  91
  92
  93 ```
  94 cwlVersion: v1.2
  95 class: Workflow
  96 label: RNAseq CWL practice workflow
  97 ```
  98
  99 ### 2. Workflow Inputs
 100
 101 The purpose of a workflow is to consume some input parameters, run a
 102 series of steps, and produce output values.
 103
 104 For this analysis, the input parameters are the fastq file and the reference data required by STAR.
 105
 106 In the original shell script, the following variables are declared:
 107
 108 ```
 109 # initialize a variable with an intuitive name to store the name of the input fastq file
 110 fq=$1
 111
 112 # directory with genome reference FASTA and index files + name of the gene annotation file
 113 genome=rnaseq/reference_data
 114 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
 115 ```
 116
 117 In CWL, we will declare these variables in the `inputs` section.
 118
 119 The inputs section lists each input parameter and its type.  Valid
 120 types include `File`, `Directory`, `string`, `boolean`, `int`, and
 121 `float`.
 122
 123 In this case, the fastq and gene annotation file are individual files.  The STAR index is a directory.  We can describe these inputs in CWL like this:
 124
 125 ```
 126 inputs:
 127   fq: File
 128   genome: Directory
 129   gtf: File
 130 ```
 131
 132 ### 3. Workflow Steps
 133
 134 A workflow consists of one or more steps.  This is the `steps` section.
 135
 136 Now we need to describe the first step of the workflow.  This step is to run `fastqc`.
 137
 138 A workflow step consists of the name of the step, the tool to `run`,
 139 the input parameters to be passed to the tool in `in`, and the output
 140 parameters expected from the tool in `out`.
 141
 142 The value of `run` references the tool file.  Tip: while typing the
 143 file name, you can get suggestions and auto-completion on a partial
 144 name using control+space.
 145
 146 The `in` block lists input parameters to the tool and the workflow
 147 parameters that will be assigned to those inputs.
 148
 149 The `out` block lists output parameters to the tool that are used
 150 later in the workflow.
 151
 152 You need to know which input and output parameters are available for
 153 each tool.  In vscode, click on the value of `run` and select "Go to
 154 definition" to open the tool file.  Look for the `inputs` and
 155 `outputs` sections of the tool file to find out what parameters are
 156 defined.
 157
 158 ```
 159 steps:
 160   fastqc:
 161     run: bio-cwl-tools/fastqc/fastqc_2.cwl
 162     in:
 163           reads_file: fq
 164     out: [html_file]
 165 ```
 166
 167 ### 4. Running alignment with STAR
 168
 169 STAR has more parameters.  Sometimes we want to provide input values
 170 to a step without making them as workflow-level inputs.  We can do
 171 this with `{default: N}`
 172
 173
 174 ```
 175   STAR:
 176     requirements:
 177       ResourceRequirement:
 178         ramMin: 6000
 179     run: bio-cwl-tools/STAR/STAR-Align.cwl
 180     in:
 181       RunThreadN: {default: 4}
 182       GenomeDir: genome
 183       ForwardReads: fq
 184       OutSAMtype: {default: BAM}
 185       OutSAMunmapped: {default: Within}
 186     out: [alignment]
 187 ```
 188
 189 ### 5. Running samtools
 190
 191 The third step is to generate an index for the aligned BAM.
 192
 193 For this step, we need to use the output of a previous step as input
 194 to this step.  We refer the output of a step by with name of the step
 195 (STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment`
 196
 197 This creates a dependency between steps.  This means the `samtools`
 198 step will not run until the `STAR` step has completed successfully.
 199
 200 ```
 201   samtools:
 202     run: bio-cwl-tools/samtools/samtools_index.cwl
 203     in:
 204       bam_sorted: STAR/alignment
 205     out: [bam_sorted_indexed]
 206 ```
 207
 208 ### 6. featureCounts
 209
 210 As of this writing, the `subread` package that provides
 211 `featureCounts` is not available in bio-cwl-tools (and if it has been
 212 added since writing this, let's pretend that it isn't there.)  We will
 213 dive into how to write a CWL wrapper for a command line tool in
 214 lesson 2.  For now, we will leave off the final step.
 215
 216 ### 7. Workflow Outputs
 217
 218 The last thing to do is declare the workflow outputs in the `outputs` section.
 219
 220 For each output, we need to declare the type of output, and what
 221 parameter has the output value.
 222
 223 Output types are the same as input types, valid types include `File`,
 224 `Directory`, `string`, `boolean`, `int`, and `float`.
 225
 226 The `outputSource` field refers the a step output in the same way that
 227 the `in` block does, the name of the step, a slash, and the name of
 228 the output parameter.
 229
 230 For our final outputs, we want the results from fastqc and the
 231 aligned, sorted and indexed BAM file.
 232
 233 ```
 234 outputs:
 235   qc_html:
 236     type: File
 237     outputSource: fastqc/html_file
 238   bam_sorted_indexed:
 239     type: File
 240     outputSource: samtools/bam_sorted_indexed
 241 ```