1 # Turning a shell script into a workflow using existing tools
3 In this lesson we will turn `rnaseq_analysis_on_input_file.sh` into a workflow.
7 We will create a new git repository and import a library of existing
8 tool definitions that will help us build our workflow.
10 Create a new git repository to hold our workflow with this command:
13 git init rnaseq-cwl-training-exercises
19 git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
22 Next, import bio-cwl-tools with this command:
25 git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
34 # https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html
37 # This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it.
38 # USAGE: sh rnaseq_analysis_on_input_file.sh <name of fastq file>
42 # initialize a variable with an intuitive name to store the name of the input fastq file
45 # grab base of filename for naming outputs
46 base=`basename $fq .subset.fq`
47 echo "Sample name is $base"
49 # specify the number of cores to use
52 # directory with genome reference FASTA and index files + name of the gene annotation file
53 genome=rnaseq/reference_data
54 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
56 # make all of the output directories
57 # The -p option means mkdir will create the whole path if it
58 # does not exist and refrain from complaining if it does exist
59 mkdir -p rnaseq/results/fastqc
60 mkdir -p rnaseq/results/STAR
61 mkdir -p rnaseq/results/counts
63 # set up output filenames and locations
64 fastqc_out=rnaseq/results/fastqc
65 align_out=rnaseq/results/STAR/${base}_
66 counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam
67 counts=rnaseq/results/counts/${base}_featurecounts.txt
69 echo "Processing file $fq"
71 # Run FastQC and move output to the appropriate folder
75 STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outFileNamePrefix $align_out --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard
78 samtools index $counts_input_bam
81 featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam
84 ## Writing the workflow
88 Create a new file "main.cwl"
90 Start with this header.
96 label: RNAseq CWL practice workflow
99 ### 2. Workflow Inputs
101 The purpose of a workflow is to consume some input parameters, run a
102 series of steps, and produce output values.
104 For this analysis, the input parameters are the fastq file and the reference data required by STAR.
106 In the original shell script, the following variables are declared:
109 # initialize a variable with an intuitive name to store the name of the input fastq file
112 # directory with genome reference FASTA and index files + name of the gene annotation file
113 genome=rnaseq/reference_data
114 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
117 In CWL, we will declare these variables in the `inputs` section.
119 The inputs section lists each input parameter and its type. Valid
120 types include `File`, `Directory`, `string`, `boolean`, `int`, and
123 In this case, the fastq and gene annotation file are individual files. The STAR index is a directory. We can describe these inputs in CWL like this:
132 ### 3. Workflow Steps
134 A workflow consists of one or more steps. This is the `steps` section.
136 Now we need to describe the first step of the workflow. This step is to run `fastqc`.
138 A workflow step consists of the name of the step, the tool to `run`,
139 the input parameters to be passed to the tool in `in`, and the output
140 parameters expected from the tool in `out`.
142 The value of `run` references the tool file. Tip: while typing the
143 file name, you can get suggestions and auto-completion on a partial
144 name using control+space.
146 The `in` block lists input parameters to the tool and the workflow
147 parameters that will be assigned to those inputs.
149 The `out` block lists output parameters to the tool that are used
150 later in the workflow.
152 You need to know which input and output parameters are available for
153 each tool. In vscode, click on the value of `run` and select "Go to
154 definition" to open the tool file. Look for the `inputs` and
155 `outputs` sections of the tool file to find out what parameters are
161 run: bio-cwl-tools/fastqc/fastqc_2.cwl
167 ### 4. Running alignment with STAR
169 STAR has more parameters. Sometimes we want to provide input values
170 to a step without making them as workflow-level inputs. We can do
171 this with `{default: N}`
179 run: bio-cwl-tools/STAR/STAR-Align.cwl
181 RunThreadN: {default: 4}
184 OutSAMtype: {default: BAM}
185 OutSAMunmapped: {default: Within}
189 ### 5. Running samtools
191 The third step is to generate an index for the aligned BAM.
193 For this step, we need to use the output of a previous step as input
194 to this step. We refer the output of a step by with name of the step
195 (STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment`
197 This creates a dependency between steps. This means the `samtools`
198 step will not run until the `STAR` step has completed successfully.
202 run: bio-cwl-tools/samtools/samtools_index.cwl
204 bam_sorted: STAR/alignment
205 out: [bam_sorted_indexed]
210 As of this writing, the `subread` package that provides
211 `featureCounts` is not available in bio-cwl-tools (and if it has been
212 added since writing this, let's pretend that it isn't there.) We will
213 dive into how to write a CWL wrapper for a command line tool in
214 lesson 2. For now, we will leave off the final step.
216 ### 7. Workflow Outputs
218 The last thing to do is declare the workflow outputs in the `outputs` section.
220 For each output, we need to declare the type of output, and what
221 parameter has the output value.
223 Output types are the same as input types, valid types include `File`,
224 `Directory`, `string`, `boolean`, `int`, and `float`.
226 The `outputSource` field refers the a step output in the same way that
227 the `in` block does, the name of the step, a slash, and the name of
228 the output parameter.
230 For our final outputs, we want the results from fastqc and the
231 aligned, sorted and indexed BAM file.
237 outputSource: fastqc/html_file
240 outputSource: samtools/bam_sorted_indexed