_episodes/05-scatter.md

   1 ---
   2 title: "Analyzing Multiple Samples"
   3 teaching: 20
   4 exercises: 0
   5 questions:
   6 - "How can you run the same workflow over multiple samples?"
   7 objectives:
   8 - "Modify the workflow to process multiple samples, then perform a joint analysis."
   9 keypoints:
  10 - "Separate the part of the workflow that you want to run multiple times into a subworkflow."
  11 - "Use a scatter step to run the subworkflow over a list of inputs."
  12 - "The result of a scatter is an array, which can be used in a combine step to get a single result."
  13 ---
  14
  15 In the previous lesson, we completed converting the function of the
  16 original source shell script into CWL.  This lesson expands the scope
  17 by demonstrating what changes to make to the workflow to be able to
  18 analyze multiple samples in parallel.
  19
  20 # Subworkflows
  21
  22 In addition to running command line tools, a workflow step can also
  23 execute another workflow.
  24
  25 First, copy `main.cwl` to `alignment.cwl`.
  26
  27 Next, open `main.cwl` for editing.  We are going to replace the `steps` and `outputs` sections.
  28
  29 Remove all the steps and replace them with a single `alignment` step
  30 which invokes the `alignment.cwl` we just copied.
  31
  32 ```
  33 steps:
  34   alignment:
  35     run: alignment.cwl
  36     in:
  37       fq: fq
  38       genome: genome
  39       gtf: gtf
  40     out: [qc_html, bam_sorted_indexed, featurecounts]
  41 ```
  42 {: .language-yaml }
  43
  44 In the `outputs` section, all the output sources are from the alignment step:
  45
  46 ```
  47 outputs:
  48   qc_html:
  49     type: File
  50     outputSource: alignment/qc_html
  51   bam_sorted_indexed:
  52     type: File
  53     outputSource: alignment/bam_sorted_indexed
  54   featurecounts:
  55     type: File
  56     outputSource: alignment/featurecounts
  57 ```
  58 {: .language-yaml }
  59
  60 We also need add "SubworkflowFeatureRequirement" to tell the workflow
  61 runner that we are using subworkflows:
  62
  63 ```
  64 requirements:
  65   SubworkflowFeatureRequirement: {}
  66 ```
  67 {: .language-yaml }
  68
  69 If you run this workflow, you will get exactly the same results as
  70 before, as all we have done so far is to wrap the inner workflow with
  71 an outer workflow.
  72
  73 # Scattering
  74
  75 The "wrapper" step lets us do something useful.  We can modify the
  76 outer workflow to accept a list of files, and then invoke the inner
  77 workflow step for every one of those files.  We will need to modify
  78 the `inputs`, `steps`, `outputs`, and `requirements` sections.
  79
  80 First we change the `fq` parameter to expect a list of files:
  81
  82 ```
  83 inputs:
  84   fq: File[]
  85   genome: Directory
  86   gtf: File
  87 ```
  88 {: .language-yaml }
  89
  90 Next, we add `scatter` to the alignment step.  The means we want to
  91 run run `alignment.cwl` for each value in the list in the `fq`
  92 parameter.
  93
  94 ```
  95 steps:
  96   alignment:
  97     run: alignment.cwl
  98     scatter: fq
  99     in:
 100       fq: fq
 101       genome: genome
 102       gtf: gtf
 103     out: [qc_html, bam_sorted_indexed, featurecounts]
 104 ```
 105 {: .language-yaml }
 106
 107 Because the scatter produces multiple outputs, each output parameter
 108 becomes a list as well:
 109
 110 ```
 111 outputs:
 112   qc_html:
 113     type: File[]
 114     outputSource: alignment/qc_html
 115   bam_sorted_indexed:
 116     type: File[]
 117     outputSource: alignment/bam_sorted_indexed
 118   featurecounts:
 119     type: File[]
 120     outputSource: alignment/featurecounts
 121 ```
 122 {: .language-yaml }
 123
 124 We also need add "ScatterFeatureRequirement" to tell the workflow
 125 runner that we are using scatter:
 126
 127 ```
 128 requirements:
 129   SubworkflowFeatureRequirement: {}
 130   ScatterFeatureRequirement: {}
 131 ```
 132 {: .language-yaml }
 133
 134 # Input parameter lists
 135
 136 The `fq` parameter needs to be a list.  You write a list in yaml by
 137 starting each list item with a dash.  Example `main-input.yaml`
 138
 139 ```
 140 fq:
 141   - class: File
 142     location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq
 143     format: http://edamontology.org/format_1930
 144   - class: File
 145     location: rnaseq/raw_fastq/Mov10_oe_2.subset.fq
 146     format: http://edamontology.org/format_1930
 147   - class: File
 148     location: rnaseq/raw_fastq/Mov10_oe_3.subset.fq
 149     format: http://edamontology.org/format_1930
 150   - class: File
 151     location: rnaseq/raw_fastq/Irrel_kd_1.subset.fq
 152     format: http://edamontology.org/format_1930
 153   - class: File
 154     location: rnaseq/raw_fastq/Irrel_kd_2.subset.fq
 155     format: http://edamontology.org/format_1930
 156   - class: File
 157     location: rnaseq/raw_fastq/Irrel_kd_3.subset.fq
 158     format: http://edamontology.org/format_1930
 159 genome:
 160   class: Directory
 161   location: hg19-chr1-STAR-index
 162 gtf:
 163   class: File
 164   location: rnaseq/reference_data/chr1-hg19_genes.gtf
 165 ```
 166 {: .language-yaml }
 167
 168 If you run the workflow, you will get results for each one of the
 169 input fastq files.
 170
 171 # Combining results
 172
 173 Each instance of the alignment workflow produces its own
 174 `featurecounts.tsv` file.  However, to be able to compare results
 175 easily, we would like single file with all the results.
 176
 177 We can modify the workflow to run `featureCounts` once at the end of
 178 the workflow, taking all the bam files listed on the command line.
 179
 180 We will need to change a few things.
 181
 182 First, in `featureCounts.cwl` we need to modify it to accept either a
 183 single bam file or list of bam files.
 184
 185 ```
 186 inputs:
 187   gtf: File
 188   counts_input_bam:
 189    - File
 190    - File[]
 191 ```
 192 {: .language-yaml }
 193
 194 Second, in `alignment.cwl` we need to remove the `featureCounts` step from alignment.cwl, as well as the `featurecounts` output parameter.
 195
 196 Third, in `main.cwl` we need to remove `featurecounts` from the `alignment` step
 197 outputs, and add a new step:
 198
 199 ```
 200 steps:
 201   alignment:
 202     run: alignment.cwl
 203     scatter: fq
 204     in:
 205       fq: fq
 206       genome: genome
 207       gtf: gtf
 208     out: [qc_html, bam_sorted_indexed]
 209   featureCounts:
 210     requirements:
 211       ResourceRequirement:
 212         ramMin: 500
 213     run: featureCounts.cwl
 214     in:
 215       counts_input_bam: alignment/bam_sorted_indexed
 216       gtf: gtf
 217     out: [featurecounts]
 218 ```
 219 {: .language-yaml }
 220
 221 Last, we modify the `featurecounts` output parameter.  Instead of a
 222 list of files produced by the `alignment` step, it is now a single
 223 file produced by the new `featureCounts` step.
 224
 225 ```
 226 outputs:
 227   ...
 228   featurecounts:
 229     type: File
 230     outputSource: featureCounts/featurecounts
 231 ```
 232 {: .language-yaml }
 233
 234 Run this workflow to get a single `featurecounts.tsv` file with a
 235 column for each bam file.