1 # Turning a shell script into a workflow by composing existing tools
5 The goal of this training is to walk through the development of a
6 best-practices CWL workflow by translating an existing bioinformatics
7 shell script into CWL. Specific knowledge of the biology of RNA-seq
8 is *not* a prerequisite for these lessons.
10 These lessons are based on "Introduction to RNA-seq using
11 high-performance computing (HPC)" lessons developed by members of the
12 teaching team at the Harvard Chan Bioinformatics Core (HBC). The
13 original training, which includes additional lectures about the
14 biology of RNA-seq can be found here:
16 https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2
20 RNA-seq is the process of sequencing RNA in a biological sample. From
21 the sequence reads, we want to measure the relative number of RNA
22 molecules appearing in the sample that were produced by particular
23 genes. This analysis is called "differential gene expression".
25 The entire process looks like this:
27 ![](RNAseqWorkflow.png)
29 For this training, we are only concerned with the middle analytical
30 steps (skipping adapter trimming).
32 * Quality control (FASTQC)
34 * Counting reads associated with genes
36 ## Analysis shell script
38 This analysis is already available as a Unix shell script, which we
39 will refer to in order to build the workflow.
41 Some of the reasons to use CWL over a plain shell script: portability,
42 scalability, ability to run on platforms that are not traditional HPC.
44 rnaseq_analysis_on_input_file.sh
50 # https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html
53 # This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it.
54 # USAGE: sh rnaseq_analysis_on_input_file.sh <name of fastq file>
58 # initialize a variable with an intuitive name to store the name of the input fastq file
61 # grab base of filename for naming outputs
62 base=`basename $fq .subset.fq`
63 echo "Sample name is $base"
65 # specify the number of cores to use
68 # directory with genome reference FASTA and index files + name of the gene annotation file
69 genome=rnaseq/reference_data
70 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
72 # make all of the output directories
73 # The -p option means mkdir will create the whole path if it
74 # does not exist and refrain from complaining if it does exist
75 mkdir -p rnaseq/results/fastqc
76 mkdir -p rnaseq/results/STAR
77 mkdir -p rnaseq/results/counts
79 # set up output filenames and locations
80 fastqc_out=rnaseq/results/fastqc
81 align_out=rnaseq/results/STAR/${base}_
82 counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam
83 counts=rnaseq/results/counts/${base}_featurecounts.txt
85 echo "Processing file $fq"
87 # Run FastQC and move output to the appropriate folder
91 STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outFileNamePrefix $align_out --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard
94 samtools index $counts_input_bam
97 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
102 We will create a new git repository and import a library of existing
103 tool definitions that will help us build our workflow.
105 Create a new git repository to hold our workflow with this command:
108 git init rnaseq-cwl-training-exercises
114 git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
117 Next, import bio-cwl-tools with this command:
120 git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
123 ## Writing the workflow
127 Create a new file "main.cwl"
129 Start with this header.
135 label: RNAseq CWL practice workflow
138 ### 2. Workflow Inputs
140 The purpose of a workflow is to consume some input parameters, run a
141 series of steps, and produce output values.
143 For this analysis, the input parameters are the fastq file and the reference data required by STAR.
145 In the original shell script, the following variables are declared:
148 # initialize a variable with an intuitive name to store the name of the input fastq file
151 # directory with genome reference FASTA and index files + name of the gene annotation file
152 genome=rnaseq/reference_data
153 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
156 In CWL, we will declare these variables in the `inputs` section.
158 The inputs section lists each input parameter and its type. Valid
159 types include `File`, `Directory`, `string`, `boolean`, `int`, and
162 In this case, the fastq and gene annotation file are individual files. The STAR index is a directory. We can describe these inputs in CWL like this:
171 ### 3. Workflow Steps
173 A workflow consists of one or more steps. This is the `steps` section.
175 Now we need to describe the first step of the workflow. This step is to run `fastqc`.
177 A workflow step consists of the name of the step, the tool to `run`,
178 the input parameters to be passed to the tool in `in`, and the output
179 parameters expected from the tool in `out`.
181 The value of `run` references the tool file. Tip: while typing the
182 file name, you can get suggestions and auto-completion on a partial
183 name using control+space.
185 The `in` block lists input parameters to the tool and the workflow
186 parameters that will be assigned to those inputs.
188 The `out` block lists output parameters to the tool that are used
189 later in the workflow.
191 You need to know which input and output parameters are available for
192 each tool. In vscode, click on the value of `run` and select "Go to
193 definition" to open the tool file. Look for the `inputs` and
194 `outputs` sections of the tool file to find out what parameters are
200 run: bio-cwl-tools/fastqc/fastqc_2.cwl
206 ### 4. Running alignment with STAR
208 STAR has more parameters. Sometimes we want to provide input values
209 to a step without making them as workflow-level inputs. We can do
210 this with `{default: N}`
218 run: bio-cwl-tools/STAR/STAR-Align.cwl
220 RunThreadN: {default: 4}
223 OutSAMtype: {default: BAM}
224 OutSAMunmapped: {default: Within}
228 ### 5. Running samtools
230 The third step is to generate an index for the aligned BAM.
232 For this step, we need to use the output of a previous step as input
233 to this step. We refer the output of a step by with name of the step
234 (STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment`
236 This creates a dependency between steps. This means the `samtools`
237 step will not run until the `STAR` step has completed successfully.
241 run: bio-cwl-tools/samtools/samtools_index.cwl
243 bam_sorted: STAR/alignment
244 out: [bam_sorted_indexed]
249 As of this writing, the `subread` package that provides
250 `featureCounts` is not available in bio-cwl-tools (and if it has been
251 added since writing this, let's pretend that it isn't there.) We will
252 go over how to write a CWL wrapper for a command line tool in
253 lesson 3. For now, we will leave off the final step.
255 ### 7. Workflow Outputs
257 The last thing to do is declare the workflow outputs in the `outputs` section.
259 For each output, we need to declare the type of output, and what
260 parameter has the output value.
262 Output types are the same as input types, valid types include `File`,
263 `Directory`, `string`, `boolean`, `int`, and `float`.
265 The `outputSource` field refers the a step output in the same way that
266 the `in` block does, the name of the step, a slash, and the name of
267 the output parameter.
269 For our final outputs, we want the results from fastqc and the
270 aligned, sorted and indexed BAM file.
276 outputSource: fastqc/html_file
279 outputSource: samtools/bam_sorted_indexed