lesson1/lesson1.md

   1 # Turning a shell script into a workflow using existing tools
   2
   3 In this lesson we will turn `rnaseq_analysis_on_input_file.sh` into a workflow.
   4
   5 ## Setting up
   6
   7 We will create a new git repository and import a library of existing
   8 tool definitions that will help us build our workflow.
   9
  10 Create a new git repository to hold our workflow with this command:
  11
  12 ```
  13 git init rnaseq-cwl-training-exercises
  14 ```
  15
  16 On Arvados use this:
  17
  18 ```
  19 git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises
  20 ```
  21
  22 Next, import bio-cwl-tools with this command:
  23
  24 ```
  25 git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git
  26 ```
  27
  28 ## Writing the workflow
  29
  30 ### 1. File header
  31
  32 Create a new file "main.cwl"
  33
  34 Start with this header.
  35
  36
  37 ```
  38 cwlVersion: v1.2
  39 class: Workflow
  40 label: RNAseq CWL practice workflow
  41 ```
  42
  43 ### 2. Workflow Inputs
  44
  45 The purpose of a workflow is to consume some input parameters, run a
  46 series of steps, and produce output values.
  47
  48 For this analysis, the input parameters are the fastq file and the reference data required by STAR.
  49
  50 In the original shell script, the following variables are declared:
  51
  52 ```
  53 # initialize a variable with an intuitive name to store the name of the input fastq file
  54 fq=$1
  55
  56 # directory with genome reference FASTA and index files + name of the gene annotation file
  57 genome=rnaseq/reference_data
  58 gtf=rnaseq/reference_data/chr1-hg19_genes.gtf
  59 ```
  60
  61 In CWL, we will declare these variables in the `inputs` section.
  62
  63 The inputs section lists each input parameter and its type.  Valid
  64 types include `File`, `Directory`, `string`, `boolean`, `int`, and
  65 `float`.
  66
  67 In this case, the fastq and gene annotation file are individual files.  The STAR index is a directory.  We can describe these inputs in CWL like this:
  68
  69 ```
  70 inputs:
  71   fq: File
  72   genome: Directory
  73   gtf: File
  74 ```
  75
  76 ### 3. Workflow Steps
  77
  78 A workflow consists of one or more steps.  This is the `steps` section.
  79
  80 Now we need to describe the first step of the workflow.  This step is to run `fastqc`.
  81
  82 A workflow step consists of the name of the step, the tool to `run`,
  83 the input parameters to be passed to the tool in `in`, and the output
  84 parameters expected from the tool in `out`.
  85
  86 The value of `run` references the tool file.  Tip: while typing the
  87 file name, you can get suggestions and auto-completion on a partial
  88 name using control+space.
  89
  90 The `in` block lists input parameters to the tool and the workflow
  91 parameters that will be assigned to those inputs.
  92
  93 The `out` block lists output parameters to the tool that are used
  94 later in the workflow.
  95
  96 You need to know which input and output parameters are available for
  97 each tool.  In vscode, click on the value of `run` and select "Go to
  98 definition" to open the tool file.  Look for the `inputs` and
  99 `outputs` sections of the tool file to find out what parameters are
 100 defined.
 101
 102 ```
 103 steps:
 104   fastqc:
 105     run: bio-cwl-tools/fastqc/fastqc_2.cwl
 106     in:
 107           reads_file: fq
 108     out: [html_file]
 109 ```
 110
 111 ### 4. Running alignment with STAR
 112
 113 STAR has more parameters.  Sometimes we want to provide input values
 114 to a step without making them as workflow-level inputs.  We can do
 115 this with `{default: N}`
 116
 117
 118 ```
 119   STAR:
 120     requirements:
 121       ResourceRequirement:
 122         ramMin: 6000
 123     run: bio-cwl-tools/STAR/STAR-Align.cwl
 124     in:
 125       RunThreadN: {default: 4}
 126       GenomeDir: genome
 127       ForwardReads: fq
 128       OutSAMtype: {default: BAM}
 129       OutSAMunmapped: {default: Within}
 130     out: [alignment]
 131 ```
 132
 133 ### 5. Running samtools
 134
 135 The third step is to generate an index for the aligned BAM.
 136
 137 For this step, we need to use the output of a previous step as input
 138 to this step.  We refer the output of a step by with name of the step
 139 (STAR), a slash, and the name of the output parameter (alignment), e.g. `STAR/alignment`
 140
 141 This creates a dependency between steps.  This means the `samtools`
 142 step will not run until the `STAR` step has completed successfully.
 143
 144 ```
 145   samtools:
 146     run: bio-cwl-tools/samtools/samtools_index.cwl
 147     in:
 148       bam_sorted: STAR/alignment
 149     out: [bam_sorted_indexed]
 150 ```
 151
 152 ### 6. featureCounts
 153
 154 As of this writing, the `subread` package that provides
 155 `featureCounts` is not available in bio-cwl-tools (and if it has been
 156 added since writing this, let's pretend that it isn't there.)  We will
 157 dive into how to write a CWL wrapper for a command line tool in
 158 lesson 2.  For now, we will leave off the final step.
 159
 160 ### 7. Workflow Outputs
 161
 162 The last thing to do is declare the workflow outputs in the `outputs` section.
 163
 164 For each output, we need to declare the type of output, and what
 165 parameter has the output value.
 166
 167 Output types are the same as input types, valid types include `File`,
 168 `Directory`, `string`, `boolean`, `int`, and `float`.
 169
 170 The `outputSource` field refers the a step output in the same way that
 171 the `in` block does, the name of the step, a slash, and the name of
 172 the output parameter.
 173
 174 For our final outputs, we want the results from fastqc and the
 175 aligned, sorted and indexed BAM file.
 176
 177 ```
 178 outputs:
 179   qc_html:
 180     type: File
 181     outputSource: fastqc/html_file
 182   bam_sorted_indexed:
 183     type: File
 184     outputSource: samtools/bam_sorted_indexed
 185 ```