_episodes/02-workflow.md

   1 ---
   2 title: "Create a Workflow by Composing Tools"
   3 teaching: 30
   4 exercises: 10
   5 questions:
   6 - "What is the syntax of CWL?"
   7 - "What are the key components of a workflow?"
   8 objectives:
   9 - "Write a workflow based on the source shell script, making use of existing tool wrappers."
  10 keypoints:
  11 - "CWL documents are written using a syntax called YAML."
  12 - "The key components of the workflow are: the header, the inputs, the steps, and the outputs."
  13 ---
  14
  15 # Source shell script
  16
  17 In this lesson, we will develop an initial workflow inspired by the
  18 following shell script.
  19
  20 rnaseq_analysis_on_input_file.sh
  21
  22 ```
  23 #!/bin/bash
  24
  25 # Based on
  26 # https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html
  27 #
  28
  29 # This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it.
  30 # USAGE: sh rnaseq_analysis_on_input_file.sh <name of fastq file>
  31
  32 set -e
  33
  34 # initialize a variable with an intuitive name to store the name of the input fastq file
  35 fq=$1
  36
  37 # grab base of filename for naming outputs
  38 base=`basename $fq .subset.fq`
  39 echo "Sample name is $base"
  40
  41 # specify the number of cores to use
  42 cores=4
  43
  44 # directory with genome reference FASTA and index files + name of the gene annotation file
  45 genome=rnaseq/reference_data
  46 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
  47
  48 # make all of the output directories
  49 # The -p option means mkdir will create the whole path if it
  50 # does not exist and refrain from complaining if it does exist
  51 mkdir -p rnaseq/results/fastqc
  52 mkdir -p rnaseq/results/STAR
  53 mkdir -p rnaseq/results/counts
  54
  55 # set up output filenames and locations
  56 fastqc_out=rnaseq/results/fastqc
  57 align_out=rnaseq/results/STAR/${base}_
  58 counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam
  59 counts=rnaseq/results/counts/${base}_featurecounts.txt
  60
  61 echo "Processing file $fq"
  62
  63 # Run FastQC and move output to the appropriate folder
  64 fastqc $fq
  65
  66 # Run STAR
  67 STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within
  68
  69 # Create BAM index
  70 samtools index $counts_input_bam
  71
  72 # Count mapped reads
  73 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
  74 ```
  75 {: .language-bash }
  76
  77 # CWL Syntax
  78
  79 CWL documents are written using a format called "YAML".  Here is a crash-course in YAML:
  80
  81 Data fields are written with the name, followed by a colon `:`, a space,
  82 and then the value.
  83
  84 ```
  85 fieldName: value
  86 ```
  87 {: .language-yaml }
  88
  89 The value is the remaining text to the end of the line.
  90
  91 Special characters in YAML include `:`, `{`, `}` `[`, `]`, `#`, `!`
  92 and `%`.  If your text begins with any of these characters, you must
  93 surround the string in single or double quotes.
  94
  95 ```
  96 fieldName: "#quoted-value"
  97 ```
  98 {: .language-yaml }
  99
 100 You can write multi-line text by putting `|-` and writing an indented
 101 block.  The leading whitespace will be removed from the actual value.
 102
 103 ```
 104 fieldName: |-
 105   This is a multi-
 106   line string.
 107   Horray!
 108 ```
 109 {: .language-yaml }
 110
 111 Nested sections are indented:
 112
 113 ```
 114 section1:
 115   field1: value1
 116   field2: value2
 117 ```
 118 {: .language-yaml }
 119
 120 Nested sections can _also_ be wrapped in curly brackets.  In this case, fields must be comma-separated.
 121
 122 ```
 123 section1: {field1: value1, field2, value2}
 124 ```
 125 {: .language-yaml }
 126
 127 When each item is on its own line starting with a dash `-`, it is a list.
 128
 129 ```
 130 section2:
 131   - value1
 132   - value2
 133 ```
 134 {: .language-yaml }
 135
 136 List can _also_ be wrapped in square brackets.  In this case, values must be comma-separated.
 137
 138 ```
 139 section2: [value1, value2]
 140 ```
 141 {: .language-yaml }
 142
 143 Comments start with `#`.
 144
 145 ```
 146 # This is a comment about field3
 147 field3: stuff
 148
 149 field4: stuff # This is a comment about field4
 150 ```
 151 {: .language-yaml }
 152
 153 Finally, YAML is a superset of JSON.  Valid JSON is also valid YAML,
 154 so you may sometimes see JSON format being used instead of YAML format
 155 for CWL documents.
 156
 157 # Workflow header
 158
 159 Create a new file "main.cwl"
 160
 161 Let's start with the header.
 162
 163 ```
 164 cwlVersion: v1.2
 165 class: Workflow
 166 label: RNAseq CWL practice workflow
 167 ```
 168 {: .language-yaml }
 169
 170 * cwlVersion - Every file must include this.  It declares the version of CWL in use.
 171 * class - This is the type of CWL document.  We will see other types in future lessons.
 172 * label - Optional title of your workflow.
 173
 174 # Workflow Inputs
 175
 176 The purpose of a workflow is to consume some input parameters, run a
 177 series of steps, and produce output values.
 178
 179 For this analysis, the input parameters are the fastq file and the reference data required by STAR.
 180
 181 In the source shell script, the following variables are declared:
 182
 183 ```
 184 # initialize a variable with an intuitive name to store the name of the input fastq file
 185 fq=$1
 186
 187 # directory with genome reference FASTA and index files + name of the gene annotation file
 188 genome=rnaseq/reference_data
 189 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
 190 ```
 191 {: .language-bash }
 192
 193 In CWL, we will declare these variables in the `inputs` section.
 194
 195 The inputs section lists each input parameter and its type.  Valid
 196 types include `File`, `Directory`, `string`, `boolean`, `int`, and
 197 `float`.
 198
 199 In this case, the fastq and gene annotation file are individual files.  The STAR index is a directory.  We can describe these inputs in CWL like this:
 200
 201 ```
 202 inputs:
 203   fq: File
 204   genome: Directory
 205   gtf: File
 206 ```
 207 {: .language-yaml }
 208
 209 # Workflow Steps
 210
 211 A workflow consists of one or more steps.  This is the `steps` section.
 212
 213 Now we need to describe the first step of the workflow.  In the source
 214 script, the first step is to run `fastqc`.
 215
 216 ```
 217 # Run FastQC and move output to the appropriate folder
 218 fastqc $fq
 219 ```
 220 {: .language-bash }
 221
 222 A workflow step consists of the name of the step, the tool to `run`,
 223 the input parameters to be passed to the tool in `in`, and the output
 224 parameters expected from the tool in `out`.
 225
 226 The value of `run` references the tool file.  The tool file describes
 227 how to run the tool (we will discuss how to write tool files in lesson
 228 4).  If we look in `bio-cwl-tools` (which you should have imported
 229 when setting up a practice repository in the initial setup
 230 instructions) we find `bio-cwl-tools/fastqc/fastqc_2.cwl`.
 231
 232 Next, the `in` block is mapping of input parameters to the tool and
 233 the workflow parameters that will be assigned to those inputs.  We
 234 need to know what input parameters the tool accepts.
 235
 236 Let's open up the tool file and take a look:
 237
 238 Find the `inputs` section of `bio-cwl-tools/fastqc/fastqc_2.cwl`:
 239
 240 ```
 241 inputs:
 242
 243   reads_file:
 244     type:
 245       - File
 246     inputBinding:
 247       position: 50
 248     doc: |
 249       Input bam,sam,bam_mapped,sam_mapped or fastq file
 250 ```
 251 {: .language-yaml }
 252
 253 Now we know we need to provide an input parameter called `reads_file`.
 254
 255 Next, the `out` section is a list of output parameters from the tool
 256 that will be used later in the workflow, or as workflow output.  We
 257 need to know what output parameters the tool produces.  Find the
 258 `outputs` section of `bio-cwl-tools/fastqc/fastqc_2.cwl`:
 259
 260 ```
 261 outputs:
 262
 263   zipped_file:
 264     type:
 265       - File
 266     outputBinding:
 267       glob: '*.zip'
 268   html_file:
 269     type:
 270       - File
 271     outputBinding:
 272       glob: '*.html'
 273   summary_file:
 274     type:
 275       - File
 276     outputBinding:
 277       glob: |
 278         ${
 279           return "*/summary.txt";
 280         }
 281 ```
 282 {: .language-yaml }
 283
 284 Now we know to expect an output parameter called `html_file`.
 285
 286 Putting this all together, the `fastq` step consists of a `run`, `in`
 287 and `out` subsections, and looks like this:
 288
 289 ```
 290 steps:
 291   fastqc:
 292     run: bio-cwl-tools/fastqc/fastqc_2.cwl
 293     in:
 294       reads_file: fq
 295     out: [html_file]
 296 ```
 297 {: .language-yaml }
 298
 299 # Running alignment with STAR
 300
 301 The next step is to run the STAR aligner.
 302
 303 ```
 304 # Run STAR
 305 STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within
 306 ```
 307 {: .language-bash }
 308
 309 We will go through the same process as the first section.  We find
 310 there is `bio-cwl-tools/STAR/STAR-Align.cwl`.  We will open the file
 311 and look at the `inputs` section to determine what input parameters
 312 correspond to the command line parmeters from our source script.
 313
 314 > ## Exercise
 315 >
 316 > Look at `STAR-Align.cwl` and identify the input parameters that
 317 > correspond to the command line arguments used in the source script:
 318 > `--runThreadN`, `--genomeDir`, `--outSAMtype`, and
 319 > `--outSAMunmapped`.  Also identify the name of the output parameter.
 320 >
 321 > > ## Solution
 322 > >
 323 > > * `--runThreadN` &rarr; `RunThreadN`
 324 > > * `--genomeDir` &rarr; `GenomeDir`
 325 > > * `--readFilesIn` &rarr; `ForwardReads`
 326 > > * `--outSAMtype` &rarr; `OutSAMtype`, `SortedByCoordinate`
 327 > > * `--outSAMunmapped` &rarr; `OutSAMunmapped`
 328 > >
 329 > > output parameter name: `alignment`
 330 > {: .solution}
 331 {: .challenge}
 332
 333 > ## Command line flags
 334 >
 335 > Command line flags generally appear appear in either the `arguments`
 336 > field, or the `prefix` field of the `inputBinding` section of an
 337 > input parameter declaration.  For example, this section of
 338 > `STAR-Align.cwl` tells us that the `GenomeDir` input parameter
 339 > corresponds to the `--genomeDir` command line parameter.
 340 >
 341 > ```
 342 >   GenomeDir:
 343 >     type: Directory
 344 >     inputBinding:
 345 >       prefix: "--genomeDir"
 346 > ```
 347 > {: .language-yaml }
 348 {: .callout}
 349
 350 > ## Default values
 351 >
 352 > Sometimes we want to provide input values to a step without making
 353 > them as workflow-level inputs.  We can do this with `{default: N}`.
 354 > For example:
 355 >
 356 > ```
 357 >    in:
 358 >      RunThreadN: {default: 4}
 359 > ```
 360 > {: .language-yaml }
 361 {: .callout}
 362
 363 > ## Exercise
 364 >
 365 > Using the input and output parameters identified in the previous
 366 > exercise, write the `run`, `in` and `out` sections of the STAR step.
 367 >
 368 > > ## Solution
 369 > >
 370 > > ```
 371 > >   STAR:
 372 > >     run: bio-cwl-tools/STAR/STAR-Align.cwl
 373 > >     in:
 374 > >       RunThreadN: {default: 4}
 375 > >       GenomeDir: genome
 376 > >       ForwardReads: fq
 377 > >       OutSAMtype: {default: BAM}
 378 > >       SortedByCoordinate: {default: true}
 379 > >       OutSAMunmapped: {default: Within}
 380 > >     out: [alignment]
 381 > > ```
 382 > > {: .language-yaml }
 383 > {: .solution}
 384 {: .challenge}
 385
 386
 387 # Running samtools
 388
 389 The third step is to generate an index for the aligned BAM.
 390
 391 ```
 392 # Create BAM index
 393 samtools index $counts_input_bam
 394 ```
 395 {: .language-bash }
 396
 397 For this step, we need to use the output of a previous step as input
 398 to this step.  We refer the output of a step by with name of the step
 399 (STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment`
 400
 401 This creates a dependency between steps.  This means the `samtools`
 402 step will not run until the `STAR` step has completed successfully.
 403
 404 ```
 405   samtools:
 406     run: bio-cwl-tools/samtools/samtools_index.cwl
 407     in:
 408       bam_sorted: STAR/alignment
 409     out: [bam_sorted_indexed]
 410 ```
 411 {: .language-yaml }
 412
 413 # featureCounts
 414
 415 ```
 416 # Count mapped reads
 417 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
 418 ```
 419 {: .language-bash }
 420
 421 As of this writing, the `subread` package that provides
 422 `featureCounts` is not available in `bio-cwl-tools` (and if it has been
 423 added since then, let's pretend that it isn't there.)  We will go over
 424 how to write a CWL wrapper for a command line tool in lesson 4.  For
 425 now, we will leave off the final step.
 426
 427 # Workflow Outputs
 428
 429 The last thing to do is declare the workflow outputs in the `outputs` section.
 430
 431 For each output, we need to declare the type of output, and what
 432 parameter has the output value.
 433
 434 Output types are the same as input types, valid types include `File`,
 435 `Directory`, `string`, `boolean`, `int`, and `float`.
 436
 437 The `outputSource` field refers the a step output in the same way that
 438 the `in` block does, the name of the step, a slash, and the name of
 439 the output parameter.
 440
 441 For our final outputs, we want the results from fastqc and the
 442 aligned, sorted and indexed BAM file.
 443
 444 ```
 445 outputs:
 446   qc_html:
 447     type: File
 448     outputSource: fastqc/html_file
 449   bam_sorted_indexed:
 450     type: File
 451     outputSource: samtools/bam_sorted_indexed
 452 ```
 453 {: .language-yaml }
 454
 455 > ## Episode solution
 456 > * <a href="../assets/answers/ep2/main.cwl">main.cwl</a>
 457 {: .solution}