lesson1/lesson1.md

   1 # Turning a shell script into a workflow using existing tools
   2
   3 In this lesson we will turn `rnaseq_analysis_on_input_file.sh` into a workflow.
   4
   5 ## Setting up
   6
   7 We will create a new git repository and import a library of existing
   8 tool definitions that will help us build our workflow.
   9
  10 1. Select "Terminal->New terminal"
  11
  12 2. Create a new git repository to hold our workflow with this command:
  13
  14 ```
  15 git init rnaseq-cwl-training-exercises
  16 ```
  17
  18 On Arvados use this:
  19
  20 ```
  21 git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
  22 ```
  23
  24 3. Go to File->Open Folder and select rnaseq-cwl-training-exercises
  25
  26 4. Go to the terminal window
  27
  28 5. Import bio-cwl-tools with this command:
  29
  30 ```
  31 git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
  32 ```
  33
  34 ## Writing the workflow
  35
  36 ### 1.
  37
  38 Create a new file "main.cwl"
  39
  40 Start with this header.
  41
  42
  43 ```
  44 cwlVersion: v1.2
  45 class: Workflow
  46 label: RNAseq CWL practice workflow
  47 ```
  48
  49 ### 2. Workflow Inputs
  50
  51 The purpose of a workflow is to consume some input parameters, run a
  52 series of steps, and produce output values.
  53
  54 For this analysis, the input parameters are the fastq file and the reference data required by STAR.
  55
  56 In the original shell script, the following variables are declared:
  57
  58 ```
  59 # initialize a variable with an intuitive name to store the name of the input fastq file
  60 fq=$1
  61
  62 # directory with genome reference FASTA and index files + name of the gene annotation file
  63 genome=rnaseq/reference_data
  64 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
  65 ```
  66
  67 In CWL, we will declare these variables in the `inputs` section.
  68
  69 The inputs section lists each input parameter and its type.  Valid
  70 types include `File`, `Directory`, `string`, `boolean`, `int`, and
  71 `float`.
  72
  73 In this case, the fastq and gene annotation file are individual files.  The STAR index is a directory.  We can describe these inputs in CWL like this:
  74
  75 ```
  76 inputs:
  77   fq: File
  78   genome: Directory
  79   gtf: File
  80 ```
  81
  82 ### 3. Workflow Steps
  83
  84 A workflow consists of one or more steps.  This is the `steps` section.
  85
  86 Now we need to describe the first step of the workflow.  This step is to run `fastqc`.
  87
  88 A workflow step consists of the name of the step, the tool to `run`,
  89 the input parameters to be passed to the tool in `in`, and the output
  90 parameters expected from the tool in `out`.
  91
  92 The value of `run` references the tool file.  Tip: while typing the
  93 file name, you can get suggestions and auto-completion on a partial
  94 name using control+space.
  95
  96 The `in` block lists input parameters to the tool and the workflow
  97 parameters that will be assigned to those inputs.
  98
  99 The `out` block lists output parameters to the tool that are used
 100 later in the workflow.
 101
 102 You need to know which input and output parameters are available for
 103 each tool.  In vscode, click on the value of `run` and select "Go to
 104 definition" to open the tool file.  Look for the `inputs` and
 105 `outputs` sections of the tool file to find out what parameters are
 106 defined.
 107
 108 ```
 109 steps:
 110   fastqc:
 111     run: bio-cwl-tools/fastqc/fastqc_2.cwl
 112     in:
 113           reads_file: fq
 114     out: [html_file]
 115 ```
 116
 117 ### 4. Running alignment with STAR
 118
 119 STAR has more parameters.  Sometimes we want to provide input values
 120 to a step without making them as workflow-level inputs.  We can do
 121 this with `{default: N}`
 122
 123
 124 ```
 125   STAR:
 126     requirements:
 127       ResourceRequirement:
 128         ramMin: 6000
 129     run: bio-cwl-tools/STAR/STAR-Align.cwl
 130     in:
 131       RunThreadN: {default: 4}
 132       GenomeDir: genome
 133       ForwardReads: fq
 134       OutSAMtype: {default: BAM}
 135       OutSAMunmapped: {default: Within}
 136     out: [alignment]
 137 ```
 138
 139 ### 5. Running samtools
 140
 141 The third step is to generate an index for the aligned BAM.
 142
 143 For this step, we need to use the output of a previous step as input
 144 to this step.  We refer the output of a step by with name of the step
 145 (STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment`
 146
 147 This creates a dependency between steps.  This means the `samtools`
 148 step will not run until the `STAR` step has completed successfully.
 149
 150 ```
 151   samtools:
 152     run: bio-cwl-tools/samtools/samtools_index.cwl
 153     in:
 154       bam_sorted: STAR/alignment
 155     out: [bam_sorted_indexed]
 156 ```
 157
 158 ### 6. featureCounts
 159
 160 As of this writing, the `subread` package that provides
 161 `featureCounts` is not available in bio-cwl-tools (and if it has been
 162 added since writing this, let's pretend that it isn't there.)  We will
 163 dive into how to write a CWL wrapper for a command line tool in
 164 lesson 2.  For now, we will leave off the final step.
 165
 166 ### 7. Workflow Outputs
 167
 168 The last thing to do is declare the workflow outputs in the `outputs` section.
 169
 170 For each output, we need to declare the type of output, and what
 171 parameter has the output value.
 172
 173 Output types are the same as input types, valid types include `File`,
 174 `Directory`, `string`, `boolean`, `int`, and `float`.
 175
 176 The `outputSource` field refers the a step output in the same way that
 177 the `in` block does, the name of the step, a slash, and the name of
 178 the output parameter.
 179
 180 For our final outputs, we want the results from fastqc and the
 181 aligned, sorted and indexed BAM file.
 182
 183 ```
 184 outputs:
 185   qc_html:
 186     type: File
 187     outputSource: fastqc/html_file
 188   bam_sorted_indexed:
 189     type: File
 190     outputSource: samtools/bam_sorted_indexed
 191 ```