-# Turning a bash script into a workflow using existing tools
+# Turning a shell script into a workflow using existing tools
In this lesson we will turn `rnaseq_analysis_on_input_file.sh` into a workflow.
-# Setting up
+## Setting up
We will create a new git repository and import a library of existing
tool definitions that will help us build our workflow.
2. Create a new git repository to hold our workflow with this command:
-## Arvados
-
```
-git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
+git init rnaseq-cwl-training-exercises
```
-## Generic
+On Arvados use this:
```
-git init rnaseq-cwl-training-exercises
+git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
```
-
3. Go to File->Open Folder and select rnaseq-cwl-training-exercises
4. Go to the terminal window
git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
```
-# Writing the workflow
+## Writing the workflow
+
+### 1. File header
-1. Create a new file "main.cwl"
+Create a new file "main.cwl"
-2. Start with this header.
+Start with this header.
```
label: RNAseq CWL practice workflow
```
-3. Workflow Inputs
+### 2. Workflow Inputs
The purpose of a workflow is to consume some input parameters, run a
series of steps, and produce output values.
For this analysis, the input parameters are the fastq file and the reference data required by STAR.
-In CWL, these are declared in the `inputs` section.
+In the original shell script, the following variables are declared:
+
+```
+# initialize a variable with an intuitive name to store the name of the input fastq file
+fq=$1
+
+# directory with genome reference FASTA and index files + name of the gene annotation file
+genome=rnaseq/reference_data
+gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
+```
+
+In CWL, we will declare these variables in the `inputs` section.
The inputs section lists each input parameter and its type. Valid
types include `File`, `Directory`, `string`, `boolean`, `int`, and
gtf: File
```
-4. Workflow Steps
+### 3. Workflow Steps
A workflow consists of one or more steps. This is the `steps` section.
run: bio-cwl-tools/fastqc/fastqc_2.cwl
in:
reads_file: fq
- out: [html_file, summary_file]
+ out: [html_file]
```
-5. Running alignment with STAR
+### 4. Running alignment with STAR
STAR has more parameters. Sometimes we want to provide input values
to a step without making them as workflow-level inputs. We can do
out: [alignment]
```
-6. Running samtools
+### 5. Running samtools
The third step is to generate an index for the aligned BAM.
out: [bam_sorted_indexed]
```
-7. featureCounts
+### 6. featureCounts
As of this writing, the `subread` package that provides
`featureCounts` is not available in bio-cwl-tools (and if it has been
dive into how to write a CWL wrapper for a command line tool in
lesson 2. For now, we will leave off the final step.
-8. Workflow Outputs
+### 7. Workflow Outputs
The last thing to do is declare the workflow outputs in the `outputs` section.
qc_html:
type: File
outputSource: fastqc/html_file
- qc_summary:
- type: File
- outputSource: fastqc/summary_file
bam_sorted_indexed:
type: File
outputSource: samtools/bam_sorted_indexed