2 title: "Create a Workflow by Composing Tools"
6 - "What is the syntax of CWL?"
7 - "What are the key components of a workflow?"
9 - "Write a workflow based on the source shell script, making use of existing tool wrappers."
11 - "CWL documents are written using a syntax called YAML."
12 - "The key components of the workflow are: the header, the inputs, the steps, and the outputs."
17 In this lesson, we will develop an initial workflow inspired by the
18 following shell script.
20 rnaseq_analysis_on_input_file.sh
26 # https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html
29 # This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it.
30 # USAGE: sh rnaseq_analysis_on_input_file.sh <name of fastq file>
34 # initialize a variable with an intuitive name to store the name of the input fastq file
37 # grab base of filename for naming outputs
38 base=`basename $fq .subset.fq`
39 echo "Sample name is $base"
41 # specify the number of cores to use
44 # directory with genome reference FASTA and index files + name of the gene annotation file
45 genome=rnaseq/reference_data
46 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
48 # make all of the output directories
49 # The -p option means mkdir will create the whole path if it
50 # does not exist and refrain from complaining if it does exist
51 mkdir -p rnaseq/results/fastqc
52 mkdir -p rnaseq/results/STAR
53 mkdir -p rnaseq/results/counts
55 # set up output filenames and locations
56 fastqc_out=rnaseq/results/fastqc
57 align_out=rnaseq/results/STAR/${base}_
58 counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam
59 counts=rnaseq/results/counts/${base}_featurecounts.txt
61 echo "Processing file $fq"
63 # Run FastQC and move output to the appropriate folder
67 STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within
70 samtools index $counts_input_bam
73 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
79 CWL documents are written using a format called "YAML". Here is a crash-course in YAML:
81 Data fields are written with the name, followed by a colon `:`, a space,
89 The value is the remaining text to the end of the line.
91 Special characters in YAML include `:`, `{`, `}` `[`, `]`, `#`, `!`
92 and `%`. If your text begins with any of these characters, you must
93 surround the string in single or double quotes.
96 fieldName: "#quoted-value"
100 You can write multi-line text by putting `|-` and writing an indented
101 block. The leading whitespace will be removed from the actual value.
111 Nested sections are indented:
120 Nested sections can _also_ be wrapped in curly brackets. In this case, fields must be comma-separated.
123 section1: {field1: value1, field2, value2}
127 When each item is on its own line starting with a dash `-`, it is a list.
136 List can _also_ be wrapped in square brackets. In this case, values must be comma-separated.
139 section2: [value1, value2]
143 Comments start with `#`.
146 # This is a comment about field3
149 field4: stuff # This is a comment about field4
153 Finally, YAML is a superset of JSON. Valid JSON is also valid YAML,
154 so you may sometimes see JSON format being used instead of YAML format
159 Create a new file "main.cwl"
161 Let's start with the header.
166 label: RNAseq CWL practice workflow
170 * cwlVersion - Every file must include this. It declares the version of CWL in use.
171 * class - This is the type of CWL document. We will see other types in future lessons.
172 * label - Optional title of your workflow.
176 The purpose of a workflow is to consume some input parameters, run a
177 series of steps, and produce output values.
179 For this analysis, the input parameters are the fastq file and the reference data required by STAR.
181 In the source shell script, the following variables are declared:
184 # initialize a variable with an intuitive name to store the name of the input fastq file
187 # directory with genome reference FASTA and index files + name of the gene annotation file
188 genome=rnaseq/reference_data
189 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
193 In CWL, we will declare these variables in the `inputs` section.
195 The inputs section lists each input parameter and its type. Valid
196 types include `File`, `Directory`, `string`, `boolean`, `int`, and
199 In this case, the fastq and gene annotation file are individual files. The STAR index is a directory. We can describe these inputs in CWL like this:
211 A workflow consists of one or more steps. This is the `steps` section.
213 Now we need to describe the first step of the workflow. In the source
214 script, the first step is to run `fastqc`.
217 # Run FastQC and move output to the appropriate folder
222 A workflow step consists of the name of the step, the tool to `run`,
223 the input parameters to be passed to the tool in `in`, and the output
224 parameters expected from the tool in `out`.
226 The value of `run` references the tool file. The tool file describes
227 how to run the tool (we will discuss how to write tool files in lesson
228 4). If we look in `bio-cwl-tools` (which you should have imported
229 when setting up a practice repository in the initial setup
230 instructions) we find `bio-cwl-tools/fastqc/fastqc_2.cwl`.
232 Next, the `in` block is mapping of input parameters to the tool and
233 the workflow parameters that will be assigned to those inputs. We
234 need to know what input parameters the tool accepts.
236 Let's open up the tool file and take a look:
238 Find the `inputs` section of `bio-cwl-tools/fastqc/fastqc_2.cwl`:
249 Input bam,sam,bam_mapped,sam_mapped or fastq file
253 Now we know we need to provide an input parameter called `reads_file`.
255 Next, the `out` section is a list of output parameters from the tool
256 that will be used later in the workflow, or as workflow output. We
257 need to know what output parameters the tool produces. Find the
258 `outputs` section of `bio-cwl-tools/fastqc/fastqc_2.cwl`:
279 return "*/summary.txt";
284 Now we know to expect an output parameter called `html_file`.
286 Putting this all together, the `fastq` step consists of a `run`, `in`
287 and `out` subsections, and looks like this:
292 run: bio-cwl-tools/fastqc/fastqc_2.cwl
299 # Running alignment with STAR
301 The next step is to run the STAR aligner.
305 STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within
309 We will go through the same process as the first section. We find
310 there is `bio-cwl-tools/STAR/STAR-Align.cwl`. We will open the file
311 and look at the `inputs` section to determine what input parameters
312 correspond to the command line parmeters from our source script.
316 > Look at `STAR-Align.cwl` and identify the input parameters that
317 > correspond to the command line arguments used in the source script:
318 > `--runThreadN`, `--genomeDir`, `--outSAMtype`, and
319 > `--outSAMunmapped`. Also identify the name of the output parameter.
323 > > * `--runThreadN` → `RunThreadN`
324 > > * `--genomeDir` → `GenomeDir`
325 > > * `--readFilesIn` → `ForwardReads`
326 > > * `--outSAMtype` → `OutSAMtype`, `SortedByCoordinate`
327 > > * `--outSAMunmapped` → `OutSAMunmapped`
329 > > output parameter name: `alignment`
333 > ## Command line flags
335 > Command line flags generally appear appear in either the `arguments`
336 > field, or the `prefix` field of the `inputBinding` section of an
337 > input parameter declaration. For example, this section of
338 > `STAR-Align.cwl` tells us that the `GenomeDir` input parameter
339 > corresponds to the `--genomeDir` command line parameter.
345 > prefix: "--genomeDir"
347 > {: .language-yaml }
352 > Sometimes we want to provide input values to a step without making
353 > them as workflow-level inputs. We can do this with `{default: N}`.
358 > RunThreadN: {default: 4}
360 > {: .language-yaml }
365 > Using the input and output parameters identified in the previous
366 > exercise, write the `run`, `in` and `out` sections of the STAR step.
372 > > run: bio-cwl-tools/STAR/STAR-Align.cwl
374 > > RunThreadN: {default: 4}
375 > > GenomeDir: genome
377 > > OutSAMtype: {default: BAM}
378 > > SortedByCoordinate: {default: true}
379 > > OutSAMunmapped: {default: Within}
382 > > {: .language-yaml }
389 The third step is to generate an index for the aligned BAM.
393 samtools index $counts_input_bam
397 For this step, we need to use the output of a previous step as input
398 to this step. We refer the output of a step by with name of the step
399 (STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment`
401 This creates a dependency between steps. This means the `samtools`
402 step will not run until the `STAR` step has completed successfully.
406 run: bio-cwl-tools/samtools/samtools_index.cwl
408 bam_sorted: STAR/alignment
409 out: [bam_sorted_indexed]
417 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
421 As of this writing, the `subread` package that provides
422 `featureCounts` is not available in `bio-cwl-tools` (and if it has been
423 added since then, let's pretend that it isn't there.) We will go over
424 how to write a CWL wrapper for a command line tool in lesson 4. For
425 now, we will leave off the final step.
429 The last thing to do is declare the workflow outputs in the `outputs` section.
431 For each output, we need to declare the type of output, and what
432 parameter has the output value.
434 Output types are the same as input types, valid types include `File`,
435 `Directory`, `string`, `boolean`, `int`, and `float`.
437 The `outputSource` field refers the a step output in the same way that
438 the `in` block does, the name of the step, a slash, and the name of
439 the output parameter.
441 For our final outputs, we want the results from fastqc and the
442 aligned, sorted and indexed BAM file.
448 outputSource: fastqc/html_file
451 outputSource: samtools/bam_sorted_indexed
455 > ## Episode solution
456 > * <a href="../assets/answers/ep2/main.cwl">main.cwl</a>