lesson1/lesson1.md

   1 # Turning a bash script into a workflow using existing tools
   2
   3 In this lesson we will turn `rnaseq_analysis_on_input_file.sh` into a workflow.
   4
   5 # Setting up
   6
   7 We will create a new git repository and import a library of existing
   8 tool definitions that will help us build our workflow.
   9
  10 1. Select "Terminal->New terminal"
  11
  12 2. Create a new git repository to hold our workflow with this command:
  13
  14 ## Arvados
  15
  16 ```
  17 git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
  18 ```
  19
  20 ## Generic
  21
  22 ```
  23 git init rnaseq-cwl-training-exercises
  24 ```
  25
  26
  27 3. Go to File->Open Folder and select rnaseq-cwl-training-exercises
  28
  29 4. Go to the terminal window
  30
  31 5. Import bio-cwl-tools with this command:
  32
  33 ```
  34 git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
  35 ```
  36
  37 # Writing the workflow
  38
  39 1. Create a new file "main.cwl"
  40
  41 2. Start with this header.
  42
  43
  44 ```
  45 cwlVersion: v1.2
  46 class: Workflow
  47 label: RNAseq CWL practice workflow
  48 ```
  49
  50 3. Workflow Inputs
  51
  52 The purpose of a workflow is to consume some input parameters, run a
  53 series of steps, and produce output values.
  54
  55 For this analysis, the input parameters are the fastq file and the reference data required by STAR.
  56
  57 In the original bash script, the following variables are declared:
  58
  59 ```
  60 # initialize a variable with an intuitive name to store the name of the input fastq file
  61 fq=$1
  62
  63 # directory with genome reference FASTA and index files + name of the gene annotation file
  64 genome=rnaseq/reference_data
  65 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
  66 ```
  67
  68 In CWL, we will declare these variables in the `inputs` section.
  69
  70 The inputs section lists each input parameter and its type.  Valid
  71 types include `File`, `Directory`, `string`, `boolean`, `int`, and
  72 `float`.
  73
  74 In this case, the fastq and gene annotation file are individual files.  The STAR index is a directory.  We can describe these inputs in CWL like this:
  75
  76 ```
  77 inputs:
  78   fq: File
  79   genome: Directory
  80   gtf: File
  81 ```
  82
  83 4. Workflow Steps
  84
  85 A workflow consists of one or more steps.  This is the `steps` section.
  86
  87 Now we need to describe the first step of the workflow.  This step is to run `fastqc`.
  88
  89 A workflow step consists of the name of the step, the tool to `run`,
  90 the input parameters to be passed to the tool in `in`, and the output
  91 parameters expected from the tool in `out`.
  92
  93 The value of `run` references the tool file.  Tip: while typing the
  94 file name, you can get suggestions and auto-completion on a partial
  95 name using control+space.
  96
  97 The `in` block lists input parameters to the tool and the workflow
  98 parameters that will be assigned to those inputs.
  99
 100 The `out` block lists output parameters to the tool that are used
 101 later in the workflow.
 102
 103 You need to know which input and output parameters are available for
 104 each tool.  In vscode, click on the value of `run` and select "Go to
 105 definition" to open the tool file.  Look for the `inputs` and
 106 `outputs` sections of the tool file to find out what parameters are
 107 defined.
 108
 109 ```
 110 steps:
 111   fastqc:
 112     run: bio-cwl-tools/fastqc/fastqc_2.cwl
 113     in:
 114           reads_file: fq
 115     out: [html_file, summary_file]
 116 ```
 117
 118 5. Running alignment with STAR
 119
 120 STAR has more parameters.  Sometimes we want to provide input values
 121 to a step without making them as workflow-level inputs.  We can do
 122 this with `{default: N}`
 123
 124
 125 ```
 126   STAR:
 127     requirements:
 128       ResourceRequirement:
 129         ramMin: 6000
 130     run: bio-cwl-tools/STAR/STAR-Align.cwl
 131     in:
 132       RunThreadN: {default: 4}
 133       GenomeDir: genome
 134       ForwardReads: fq
 135       OutSAMtype: {default: BAM}
 136       OutSAMunmapped: {default: Within}
 137     out: [alignment]
 138 ```
 139
 140 6. Running samtools
 141
 142 The third step is to generate an index for the aligned BAM.
 143
 144 For this step, we need to use the output of a previous step as input
 145 to this step.  We refer the output of a step by with name of the step
 146 (STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment`
 147
 148 This creates a dependency between steps.  This means the `samtools`
 149 step will not run until the `STAR` step has completed successfully.
 150
 151 ```
 152   samtools:
 153     run: bio-cwl-tools/samtools/samtools_index.cwl
 154     in:
 155       bam_sorted: STAR/alignment
 156     out: [bam_sorted_indexed]
 157 ```
 158
 159 7. featureCounts
 160
 161 As of this writing, the `subread` package that provides
 162 `featureCounts` is not available in bio-cwl-tools (and if it has been
 163 added since writing this, let's pretend that it isn't there.)  We will
 164 dive into how to write a CWL wrapper for a command line tool in
 165 lesson 2.  For now, we will leave off the final step.
 166
 167 8. Workflow Outputs
 168
 169 The last thing to do is declare the workflow outputs in the `outputs` section.
 170
 171 For each output, we need to declare the type of output, and what
 172 parameter has the output value.
 173
 174 Output types are the same as input types, valid types include `File`,
 175 `Directory`, `string`, `boolean`, `int`, and `float`.
 176
 177 The `outputSource` field refers the a step output in the same way that
 178 the `in` block does, the name of the step, a slash, and the name of
 179 the output parameter.
 180
 181 For our final outputs, we want the results from fastqc and the
 182 aligned, sorted and indexed BAM file.
 183
 184 ```
 185 outputs:
 186   qc_html:
 187     type: File
 188     outputSource: fastqc/html_file
 189   qc_summary:
 190     type: File
 191     outputSource: fastqc/summary_file
 192   bam_sorted_indexed:
 193     type: File
 194     outputSource: samtools/bam_sorted_indexed
 195 ```