_episodes/05-scatter.md

   1 ---
   2 title: "Analyzing Multiple Samples"
   3 teaching: 30
   4 exercises: 30
   5 questions:
   6 - "How can you run the same workflow over multiple samples?"
   7 objectives:
   8 - "Modify the workflow to process multiple samples, then perform a joint analysis."
   9 keypoints:
  10 - "Separate the part of the workflow that you want to run multiple times into a subworkflow."
  11 - "Use a scatter step to run the subworkflow over a list of inputs."
  12 - "The result of a scatter is an array, which can be used in a combine step to get a single result."
  13 ---
  14
  15 In the previous lesson, we completed converting the function of the
  16 original source shell script into CWL.  This lesson expands the scope
  17 by demonstrating what changes to make to the workflow to be able to
  18 analyze multiple samples in parallel.
  19
  20 # Subworkflows
  21
  22 In addition to running command line tools, a workflow step can also
  23 execute another workflow.
  24
  25 First, copy `main.cwl` to `alignment.cwl`.
  26
  27 Next, open `main.cwl` for editing.  We are going to replace the `steps` and `outputs` sections.
  28
  29 Remove all the steps and replace them with a single `alignment` step
  30 which invokes the `alignment.cwl` we just copied.
  31
  32 ```
  33 steps:
  34   alignment:
  35     run: alignment.cwl
  36     in:
  37       fq: fq
  38       genome: genome
  39       gtf: gtf
  40     out: [qc_html, bam_sorted_indexed, featurecounts]
  41 ```
  42 {: .language-yaml }
  43
  44 In the `outputs` section, all the output sources are from the alignment step:
  45
  46 ```
  47 outputs:
  48   qc_html:
  49     type: File
  50     outputSource: alignment/qc_html
  51   bam_sorted_indexed:
  52     type: File
  53     outputSource: alignment/bam_sorted_indexed
  54   featurecounts:
  55     type: File
  56     outputSource: alignment/featurecounts
  57 ```
  58 {: .language-yaml }
  59
  60 We also need add "SubworkflowFeatureRequirement" to tell the workflow
  61 runner that we are using subworkflows:
  62
  63 ```
  64 requirements:
  65   SubworkflowFeatureRequirement: {}
  66 ```
  67 {: .language-yaml }
  68
  69 > ## Running the workflow
  70 >
  71 > Run this workflow.  You should get exactly the same results as
  72 > before, as all we have done so far is to wrap the inner workflow with
  73 > an outer workflow.
  74 >
  75 {: .challenge }
  76
  77 # Scattering
  78
  79 The "wrapper" step lets us do something useful.  We can modify the
  80 outer workflow to accept a list of files, and then invoke the inner
  81 workflow step for every one of those files.  We will need to modify
  82 the `inputs`, `steps`, `outputs`, and `requirements` sections.
  83
  84 First we change the `fq` parameter to expect a list of files:
  85
  86 ```
  87 inputs:
  88   fq: File[]
  89   genome: Directory
  90   gtf: File
  91 ```
  92 {: .language-yaml }
  93
  94 Next, we add `scatter` to the alignment step.  The means we want to
  95 run run `alignment.cwl` for each value in the list in the `fq`
  96 parameter.
  97
  98 ```
  99 steps:
 100   alignment:
 101     run: alignment.cwl
 102     scatter: fq
 103     in:
 104       fq: fq
 105       genome: genome
 106       gtf: gtf
 107     out: [qc_html, bam_sorted_indexed, featurecounts]
 108 ```
 109 {: .language-yaml }
 110
 111 Because the scatter produces multiple outputs, each output parameter
 112 becomes a list as well:
 113
 114 ```
 115 outputs:
 116   qc_html:
 117     type: File[]
 118     outputSource: alignment/qc_html
 119   bam_sorted_indexed:
 120     type: File[]
 121     outputSource: alignment/bam_sorted_indexed
 122   featurecounts:
 123     type: File[]
 124     outputSource: alignment/featurecounts
 125 ```
 126 {: .language-yaml }
 127
 128 We also need add "ScatterFeatureRequirement" to tell the workflow
 129 runner that we are using scatter:
 130
 131 ```
 132 requirements:
 133   SubworkflowFeatureRequirement: {}
 134   ScatterFeatureRequirement: {}
 135 ```
 136 {: .language-yaml }
 137
 138 # Input parameter lists
 139
 140 The `fq` parameter needs to be a list.  You write a list in yaml by
 141 starting each list item with a dash.  Example `main-input.yaml`
 142
 143 ```
 144 fq:
 145   - class: File
 146     location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq
 147     format: http://edamontology.org/format_1930
 148   - class: File
 149     location: rnaseq/raw_fastq/Mov10_oe_2.subset.fq
 150     format: http://edamontology.org/format_1930
 151   - class: File
 152     location: rnaseq/raw_fastq/Mov10_oe_3.subset.fq
 153     format: http://edamontology.org/format_1930
 154   - class: File
 155     location: rnaseq/raw_fastq/Irrel_kd_1.subset.fq
 156     format: http://edamontology.org/format_1930
 157   - class: File
 158     location: rnaseq/raw_fastq/Irrel_kd_2.subset.fq
 159     format: http://edamontology.org/format_1930
 160   - class: File
 161     location: rnaseq/raw_fastq/Irrel_kd_3.subset.fq
 162     format: http://edamontology.org/format_1930
 163 genome:
 164   class: Directory
 165   location: hg19-chr1-STAR-index
 166 gtf:
 167   class: File
 168   location: rnaseq/reference_data/chr1-hg19_genes.gtf
 169 ```
 170 {: .language-yaml }
 171
 172 > ## Running the workflow
 173 >
 174 > Run this workflow.  You will now get results for each one of the
 175 > input fastq files.
 176 >
 177 {: .challenge }
 178
 179 # Combining results
 180
 181 Each instance of the alignment workflow produces its own
 182 `featurecounts.tsv` file.  However, to be able to compare results
 183 easily, we would like single file with all the results.
 184
 185 We can modify the workflow to run `featureCounts` once at the end of
 186 the workflow, taking all the bam files listed on the command line.
 187
 188 We will need to change a few things.
 189
 190 First, in `featureCounts.cwl` we need to modify it to accept either a
 191 single bam file or list of bam files.
 192
 193 ```
 194 inputs:
 195   gtf: File
 196   counts_input_bam:
 197    - File
 198    - File[]
 199 ```
 200 {: .language-yaml }
 201
 202 Second, in `alignment.cwl` we need to remove the `featureCounts` step from alignment.cwl, as well as the `featurecounts` output parameter.
 203
 204 Third, in `main.cwl` we need to remove `featurecounts` from the `alignment` step
 205 outputs, and add a new step:
 206
 207 ```
 208 steps:
 209   alignment:
 210     run: alignment.cwl
 211     scatter: fq
 212     in:
 213       fq: fq
 214       genome: genome
 215       gtf: gtf
 216     out: [qc_html, bam_sorted_indexed]
 217   featureCounts:
 218     requirements:
 219       ResourceRequirement:
 220         ramMin: 500
 221     run: featureCounts.cwl
 222     in:
 223       counts_input_bam: alignment/bam_sorted_indexed
 224       gtf: gtf
 225     out: [featurecounts]
 226 ```
 227 {: .language-yaml }
 228
 229 Last, we modify the `featurecounts` output parameter.  Instead of a
 230 list of files produced by the `alignment` step, it is now a single
 231 file produced by the new `featureCounts` step.
 232
 233 ```
 234 outputs:
 235   ...
 236   featurecounts:
 237     type: File
 238     outputSource: featureCounts/featurecounts
 239 ```
 240 {: .language-yaml }
 241
 242 > ## Running the workflow
 243 >
 244 > Run this workflow.  You will still have separate results from fastq
 245 > and and STAR, but now you will only have a single
 246 > `featurecounts.tsv` file with a column for each bam file.
 247 >
 248 {: .challenge }