1 # Turning a shell script into a workflow using existing tools
3 In this lesson we will turn `rnaseq_analysis_on_input_file.sh` into a workflow.
7 We will create a new git repository and import a library of existing
8 tool definitions that will help us build our workflow.
10 1. Select "Terminal->New terminal"
12 2. Create a new git repository to hold our workflow with this command:
17 git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
23 git init rnaseq-cwl-training-exercises
27 3. Go to File->Open Folder and select rnaseq-cwl-training-exercises
29 4. Go to the terminal window
31 5. Import bio-cwl-tools with this command:
34 git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
37 # Writing the workflow
39 1. Create a new file "main.cwl"
41 2. Start with this header.
47 label: RNAseq CWL practice workflow
52 The purpose of a workflow is to consume some input parameters, run a
53 series of steps, and produce output values.
55 For this analysis, the input parameters are the fastq file and the reference data required by STAR.
57 In the original shell script, the following variables are declared:
60 # initialize a variable with an intuitive name to store the name of the input fastq file
63 # directory with genome reference FASTA and index files + name of the gene annotation file
64 genome=rnaseq/reference_data
65 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
68 In CWL, we will declare these variables in the `inputs` section.
70 The inputs section lists each input parameter and its type. Valid
71 types include `File`, `Directory`, `string`, `boolean`, `int`, and
74 In this case, the fastq and gene annotation file are individual files. The STAR index is a directory. We can describe these inputs in CWL like this:
85 A workflow consists of one or more steps. This is the `steps` section.
87 Now we need to describe the first step of the workflow. This step is to run `fastqc`.
89 A workflow step consists of the name of the step, the tool to `run`,
90 the input parameters to be passed to the tool in `in`, and the output
91 parameters expected from the tool in `out`.
93 The value of `run` references the tool file. Tip: while typing the
94 file name, you can get suggestions and auto-completion on a partial
95 name using control+space.
97 The `in` block lists input parameters to the tool and the workflow
98 parameters that will be assigned to those inputs.
100 The `out` block lists output parameters to the tool that are used
101 later in the workflow.
103 You need to know which input and output parameters are available for
104 each tool. In vscode, click on the value of `run` and select "Go to
105 definition" to open the tool file. Look for the `inputs` and
106 `outputs` sections of the tool file to find out what parameters are
112 run: bio-cwl-tools/fastqc/fastqc_2.cwl
118 5. Running alignment with STAR
120 STAR has more parameters. Sometimes we want to provide input values
121 to a step without making them as workflow-level inputs. We can do
122 this with `{default: N}`
130 run: bio-cwl-tools/STAR/STAR-Align.cwl
132 RunThreadN: {default: 4}
135 OutSAMtype: {default: BAM}
136 OutSAMunmapped: {default: Within}
142 The third step is to generate an index for the aligned BAM.
144 For this step, we need to use the output of a previous step as input
145 to this step. We refer the output of a step by with name of the step
146 (STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment`
148 This creates a dependency between steps. This means the `samtools`
149 step will not run until the `STAR` step has completed successfully.
153 run: bio-cwl-tools/samtools/samtools_index.cwl
155 bam_sorted: STAR/alignment
156 out: [bam_sorted_indexed]
161 As of this writing, the `subread` package that provides
162 `featureCounts` is not available in bio-cwl-tools (and if it has been
163 added since writing this, let's pretend that it isn't there.) We will
164 dive into how to write a CWL wrapper for a command line tool in
165 lesson 2. For now, we will leave off the final step.
169 The last thing to do is declare the workflow outputs in the `outputs` section.
171 For each output, we need to declare the type of output, and what
172 parameter has the output value.
174 Output types are the same as input types, valid types include `File`,
175 `Directory`, `string`, `boolean`, `int`, and `float`.
177 The `outputSource` field refers the a step output in the same way that
178 the `in` block does, the name of the step, a slash, and the name of
179 the output parameter.
181 For our final outputs, we want the results from fastqc and the
182 aligned, sorted and indexed BAM file.
188 outputSource: fastqc/html_file
191 outputSource: samtools/bam_sorted_indexed