_episodes/02-workflow.md

   1 ---
   2 title: "Turning a shell script into a workflow by composing existing tools"
   3 teaching: 0
   4 exercises: 0
   5 questions:
   6 - "Key question (FIXME)"
   7 objectives:
   8 - "First learning objective. (FIXME)"
   9 keypoints:
  10 - "First key point. Brief Answer to questions. (FIXME)"
  11 ---
  12
  13 # Setting up
  14
  15 We will create a new git repository and import a library of existing
  16 tool definitions that will help us build our workflow.
  17
  18 Create a new git repository to hold our workflow with this command:
  19
  20 ```
  21 git init rnaseq-cwl-training-exercises
  22 ```
  23
  24 On Arvados use this:
  25
  26 ```
  27 git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
  28 ```
  29
  30 Next, import bio-cwl-tools with this command:
  31
  32 ```
  33 git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
  34 ```
  35
  36 # Writing the workflow
  37
  38 ## 1. File header
  39
  40 Create a new file "main.cwl"
  41
  42 Start with this header.
  43
  44
  45 ```
  46 cwlVersion: v1.2
  47 class: Workflow
  48 label: RNAseq CWL practice workflow
  49 ```
  50
  51 ## 2. Workflow Inputs
  52
  53 The purpose of a workflow is to consume some input parameters, run a
  54 series of steps, and produce output values.
  55
  56 For this analysis, the input parameters are the fastq file and the reference data required by STAR.
  57
  58 In the original shell script, the following variables are declared:
  59
  60 ```
  61 # initialize a variable with an intuitive name to store the name of the input fastq file
  62 fq=$1
  63
  64 # directory with genome reference FASTA and index files + name of the gene annotation file
  65 genome=rnaseq/reference_data
  66 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
  67 ```
  68
  69 In CWL, we will declare these variables in the `inputs` section.
  70
  71 The inputs section lists each input parameter and its type.  Valid
  72 types include `File`, `Directory`, `string`, `boolean`, `int`, and
  73 `float`.
  74
  75 In this case, the fastq and gene annotation file are individual files.  The STAR index is a directory.  We can describe these inputs in CWL like this:
  76
  77 ```
  78 inputs:
  79   fq: File
  80   genome: Directory
  81   gtf: File
  82 ```
  83
  84 ## 3. Workflow Steps
  85
  86 A workflow consists of one or more steps.  This is the `steps` section.
  87
  88 Now we need to describe the first step of the workflow.  This step is to run `fastqc`.
  89
  90 A workflow step consists of the name of the step, the tool to `run`,
  91 the input parameters to be passed to the tool in `in`, and the output
  92 parameters expected from the tool in `out`.
  93
  94 The value of `run` references the tool file.  Tip: while typing the
  95 file name, you can get suggestions and auto-completion on a partial
  96 name using control+space.
  97
  98 The `in` block lists input parameters to the tool and the workflow
  99 parameters that will be assigned to those inputs.
 100
 101 The `out` block lists output parameters to the tool that are used
 102 later in the workflow.
 103
 104 You need to know which input and output parameters are available for
 105 each tool.  In vscode, click on the value of `run` and select "Go to
 106 definition" to open the tool file.  Look for the `inputs` and
 107 `outputs` sections of the tool file to find out what parameters are
 108 defined.
 109
 110 ```
 111 steps:
 112   fastqc:
 113     run: bio-cwl-tools/fastqc/fastqc_2.cwl
 114     in:
 115       reads_file: fq
 116     out: [html_file]
 117 ```
 118
 119 ## 4. Running alignment with STAR
 120
 121 STAR has more parameters.  Sometimes we want to provide input values
 122 to a step without making them as workflow-level inputs.  We can do
 123 this with `{default: N}`
 124
 125
 126 ```
 127   STAR:
 128     requirements:
 129       ResourceRequirement:
 130         ramMin: 6000
 131     run: bio-cwl-tools/STAR/STAR-Align.cwl
 132     in:
 133       RunThreadN: {default: 4}
 134       GenomeDir: genome
 135       ForwardReads: fq
 136       OutSAMtype: {default: BAM}
 137       OutSAMunmapped: {default: Within}
 138     out: [alignment]
 139 ```
 140
 141 ## 5. Running samtools
 142
 143 The third step is to generate an index for the aligned BAM.
 144
 145 For this step, we need to use the output of a previous step as input
 146 to this step.  We refer the output of a step by with name of the step
 147 (STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment`
 148
 149 This creates a dependency between steps.  This means the `samtools`
 150 step will not run until the `STAR` step has completed successfully.
 151
 152 ```
 153   samtools:
 154     run: bio-cwl-tools/samtools/samtools_index.cwl
 155     in:
 156       bam_sorted: STAR/alignment
 157     out: [bam_sorted_indexed]
 158 ```
 159
 160 ## 6. featureCounts
 161
 162 As of this writing, the `subread` package that provides
 163 `featureCounts` is not available in bio-cwl-tools (and if it has been
 164 added since writing this, let's pretend that it isn't there.)  We will
 165 go over how to write a CWL wrapper for a command line tool in
 166 lesson 3.  For now, we will leave off the final step.
 167
 168 ## 7. Workflow Outputs
 169
 170 The last thing to do is declare the workflow outputs in the `outputs` section.
 171
 172 For each output, we need to declare the type of output, and what
 173 parameter has the output value.
 174
 175 Output types are the same as input types, valid types include `File`,
 176 `Directory`, `string`, `boolean`, `int`, and `float`.
 177
 178 The `outputSource` field refers the a step output in the same way that
 179 the `in` block does, the name of the step, a slash, and the name of
 180 the output parameter.
 181
 182 For our final outputs, we want the results from fastqc and the
 183 aligned, sorted and indexed BAM file.
 184
 185 ```
 186 outputs:
 187   qc_html:
 188     type: File
 189     outputSource: fastqc/html_file
 190   bam_sorted_indexed:
 191     type: File
 192     outputSource: samtools/bam_sorted_indexed
 193 ```