_episodes/05-scatter.md

   1 ---
   2 title: "Analyzing Multiple Samples"
   3 teaching: 30
   4 exercises: 30
   5 questions:
   6 - "How can you run the same workflow over multiple samples?"
   7 objectives:
   8 - "Modify the workflow to process multiple samples, then perform a joint analysis."
   9 keypoints:
  10 - "Separate the part of the workflow that you want to run multiple times into a subworkflow."
  11 - "Use a scatter step to run the subworkflow over a list of inputs."
  12 - "The result of a scatter is an array, which can be used in a combine step to get a single result."
  13 ---
  14
  15 In the previous lesson, we completed converting the function of the
  16 original source shell script into CWL.  This lesson expands the scope
  17 by demonstrating what changes to make to the workflow to be able to
  18 analyze multiple samples in parallel.
  19
  20 # Subworkflows
  21
  22 In addition to running command line tools, a workflow step can also
  23 execute another workflow.
  24
  25 First, copy `main.cwl` to `alignment.cwl`.
  26
  27 Next, open `main.cwl` for editing.  We are going to replace the `steps` and `outputs` sections.
  28
  29 Remove all the steps and replace them with a single `alignment` step
  30 which invokes the `alignment.cwl` we just copied.
  31
  32 ```
  33 steps:
  34   alignment:
  35     run: alignment.cwl
  36     in:
  37       fq: fq
  38       genome: genome
  39       gtf: gtf
  40     out: [qc_html, bam_sorted_indexed, featurecounts]
  41 ```
  42 {: .language-yaml }
  43
  44 In the `outputs` section, all the output sources are from the alignment step:
  45
  46 ```
  47 outputs:
  48   qc_html:
  49     type: File
  50     outputSource: alignment/qc_html
  51   bam_sorted_indexed:
  52     type: File
  53     outputSource: alignment/bam_sorted_indexed
  54   featurecounts:
  55     type: File
  56     outputSource: alignment/featurecounts
  57 ```
  58 {: .language-yaml }
  59
  60 We also need add "SubworkflowFeatureRequirement" to tell the workflow
  61 runner that we are using subworkflows:
  62
  63 ```
  64 requirements:
  65   SubworkflowFeatureRequirement: {}
  66 ```
  67 {: .language-yaml }
  68
  69 > ## Running the workflow
  70 >
  71 > Run this workflow.  You should get exactly the same results as
  72 > before, as all we have done so far is to wrap the inner workflow with
  73 > an outer workflow.
  74 >
  75 {: .challenge }
  76
  77 > ## Part 1 solution
  78 > * <a href="../assets/answers/ep5/part1/main.cwl">main.cwl</a>
  79 > * <a href="../assets/answers/ep5/part1/alignment.cwl">alignment.cwl</a>
  80 > * <a href="../assets/answers/ep5/part1/featureCounts.cwl">featureCounts.cwl</a>
  81 {: .solution}
  82
  83 # Scattering
  84
  85 The "wrapper" step lets us do something useful.  We can modify the
  86 outer workflow to accept a list of files, and then invoke the inner
  87 workflow step for every one of those files.  We will need to modify
  88 the `inputs`, `steps`, `outputs`, and `requirements` sections.
  89
  90 First we change the `fq` parameter to expect a list of files:
  91
  92 ```
  93 inputs:
  94   fq: File[]
  95   genome: Directory
  96   gtf: File
  97 ```
  98 {: .language-yaml }
  99
 100 Next, we add `scatter` to the alignment step.  The means we want to
 101 run run `alignment.cwl` for each value in the list in the `fq`
 102 parameter.
 103
 104 ```
 105 steps:
 106   alignment:
 107     run: alignment.cwl
 108     scatter: fq
 109     in:
 110       fq: fq
 111       genome: genome
 112       gtf: gtf
 113     out: [qc_html, bam_sorted_indexed, featurecounts]
 114 ```
 115 {: .language-yaml }
 116
 117 Because the scatter produces multiple outputs, each output parameter
 118 becomes a list as well:
 119
 120 ```
 121 outputs:
 122   qc_html:
 123     type: File[]
 124     outputSource: alignment/qc_html
 125   bam_sorted_indexed:
 126     type: File[]
 127     outputSource: alignment/bam_sorted_indexed
 128   featurecounts:
 129     type: File[]
 130     outputSource: alignment/featurecounts
 131 ```
 132 {: .language-yaml }
 133
 134 We also need add "ScatterFeatureRequirement" to tell the workflow
 135 runner that we are using scatter:
 136
 137 ```
 138 requirements:
 139   SubworkflowFeatureRequirement: {}
 140   ScatterFeatureRequirement: {}
 141 ```
 142 {: .language-yaml }
 143
 144 > ## Part 2 solution
 145 > * <a href="../assets/answers/ep5/part2/main.cwl">main.cwl</a>
 146 > * <a href="../assets/answers/ep5/part2/alignment.cwl">alignment.cwl</a>
 147 > * <a href="../assets/answers/ep5/part2/featureCounts.cwl">featureCounts.cwl</a>
 148 {: .solution}
 149
 150 # Input parameter lists
 151
 152 The `fq` parameter needs to be a list.  You write a list in yaml by
 153 starting each list item with a dash.  Example `main-input.yaml`
 154
 155 ```
 156 fq:
 157   - class: File
 158     location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq
 159     format: http://edamontology.org/format_1930
 160   - class: File
 161     location: rnaseq/raw_fastq/Mov10_oe_2.subset.fq
 162     format: http://edamontology.org/format_1930
 163   - class: File
 164     location: rnaseq/raw_fastq/Mov10_oe_3.subset.fq
 165     format: http://edamontology.org/format_1930
 166   - class: File
 167     location: rnaseq/raw_fastq/Irrel_kd_1.subset.fq
 168     format: http://edamontology.org/format_1930
 169   - class: File
 170     location: rnaseq/raw_fastq/Irrel_kd_2.subset.fq
 171     format: http://edamontology.org/format_1930
 172   - class: File
 173     location: rnaseq/raw_fastq/Irrel_kd_3.subset.fq
 174     format: http://edamontology.org/format_1930
 175 genome:
 176   class: Directory
 177   location: hg19-chr1-STAR-index
 178 gtf:
 179   class: File
 180   location: rnaseq/reference_data/chr1-hg19_genes.gtf
 181 ```
 182 {: .language-yaml }
 183
 184 > ## Running the workflow
 185 >
 186 > Run this workflow.  You will now get results for each one of the
 187 > input fastq files.
 188 >
 189 {: .challenge }
 190
 191 # Combining results
 192
 193 Each instance of the alignment workflow produces its own
 194 `featurecounts.tsv` file.  However, to be able to compare results
 195 easily, we would like single file with all the results.
 196
 197 We can modify the workflow to run `featureCounts` once at the end of
 198 the workflow, taking all the bam files listed on the command line.
 199
 200 We will need to change a few things.
 201
 202 First, in `featureCounts.cwl` we need to modify it to accept either a
 203 single bam file or list of bam files.
 204
 205 ```
 206 inputs:
 207   gtf: File
 208   counts_input_bam:
 209    - File
 210    - File[]
 211 ```
 212 {: .language-yaml }
 213
 214 Second, in `alignment.cwl` we need to remove the `featureCounts` step from alignment.cwl, as well as the `featurecounts` output parameter.
 215
 216 Third, in `main.cwl` we need to remove `featurecounts` from the `alignment` step
 217 outputs, and add a new step:
 218
 219 ```
 220 steps:
 221   alignment:
 222     run: alignment.cwl
 223     scatter: fq
 224     in:
 225       fq: fq
 226       genome: genome
 227       gtf: gtf
 228     out: [qc_html, bam_sorted_indexed]
 229   featureCounts:
 230     requirements:
 231       ResourceRequirement:
 232         ramMin: 500
 233     run: featureCounts.cwl
 234     in:
 235       counts_input_bam: alignment/bam_sorted_indexed
 236       gtf: gtf
 237     out: [featurecounts]
 238 ```
 239 {: .language-yaml }
 240
 241 Last, we modify the `featurecounts` output parameter.  Instead of a
 242 list of files produced by the `alignment` step, it is now a single
 243 file produced by the new `featureCounts` step.
 244
 245 ```
 246 outputs:
 247   ...
 248   featurecounts:
 249     type: File
 250     outputSource: featureCounts/featurecounts
 251 ```
 252 {: .language-yaml }
 253
 254 > ## Running the workflow
 255 >
 256 > Run this workflow.  You will still have separate results from fastq
 257 > and and STAR, but now you will only have a single
 258 > `featurecounts.tsv` file with a column for each bam file.
 259 >
 260 {: .challenge }
 261
 262 > ## Episode solution
 263 > * <a href="../assets/answers/ep5/part4/main.cwl">main.cwl</a>
 264 > * <a href="../assets/answers/ep5/part4/alignment.cwl">alignment.cwl</a>
 265 > * <a href="../assets/answers/ep5/part4/featureCounts.cwl">featureCounts.cwl</a>
 266 {: .solution}