_episodes/05-scatter.md

   1 ---
   2 title: " Analyzing multiple samples"
   3 teaching: 0
   4 exercises: 0
   5 questions:
   6 - "Key question (FIXME)"
   7 objectives:
   8 - "First learning objective. (FIXME)"
   9 keypoints:
  10 - "First key point. Brief Answer to questions. (FIXME)"
  11 ---
  12
  13 # Analyzing multiple samples
  14
  15 Analyzing a single sample is great, but in the real world you probably
  16 have a batch of samples that you need to analyze and then compare.
  17
  18 ### 1. Subworkflows
  19
  20 In addition to running command line tools, a workflow step can also
  21 execute another workflow.
  22
  23 Let's copy "main.cwl" to "alignment.cwl".
  24
  25 Now, edit open "main.cwl" for editing.  We are going to replace the `steps` and `outputs` sections.
  26
  27 ```
  28 steps:
  29   alignment:
  30     run: alignment.cwl
  31     in:
  32       fq: fq
  33       genome: genome
  34       gtf: gtf
  35     out: [qc_html, bam_sorted_indexed, featurecounts]
  36 ```
  37
  38 In the outputs section, all the output sources are from the alignment step:
  39
  40 ```
  41 outputs:
  42   qc_html:
  43     type: File
  44     outputSource: alignment/qc_html
  45   bam_sorted_indexed:
  46     type: File
  47     outputSource: alignment/bam_sorted_indexed
  48   featurecounts:
  49     type: File
  50     outputSource: alignment/featurecounts
  51 ```
  52
  53 We also need a little boilerplate to tell the workflow runner that we want to use subworkflows:
  54
  55 ```
  56 requirements:
  57   SubworkflowFeatureRequirement: {}
  58 ```
  59
  60 If you run this workflow, you will get exactly the same results as
  61 before, we've just wrapped the inner workflow with an outer workflow.
  62
  63 ### 2. Scattering
  64
  65 The wrapper lets us do something useful.  We can modify the outer
  66 workflow to accept a list of files, and then invoke the inner workflow
  67 step for every one of those files.  We will need to modify the
  68 `inputs`, `steps`, `outputs`, and `requirements` sections.
  69
  70 First we change the `fq` parameter to expect a list of files:
  71
  72 ```
  73 inputs:
  74   fq: File[]
  75   genome: Directory
  76   gtf: File
  77 ```
  78
  79 Next, we add `scatter` to the alignment step.  The means it will
  80 run `alignment.cwl` for each value in the list in the `fq` parameter.
  81
  82 ```
  83 steps:
  84   alignment:
  85     run: alignment.cwl
  86     scatter: fq
  87     in:
  88       fq: fq
  89       genome: genome
  90       gtf: gtf
  91     out: [qc_html, bam_sorted_indexed, featurecounts]
  92 ```
  93
  94 Because the scatter produces multiple outputs, each output parameter
  95 becomes a list as well:
  96
  97 ```
  98 outputs:
  99   qc_html:
 100     type: File[]
 101     outputSource: alignment/qc_html
 102   bam_sorted_indexed:
 103     type: File[]
 104     outputSource: alignment/bam_sorted_indexed
 105   featurecounts:
 106     type: File[]
 107     outputSource: alignment/featurecounts
 108 ```
 109
 110 Finally, we need a little more boilerplate to tell the workflow runner
 111 that we want to use scatter:
 112
 113 ```
 114 requirements:
 115   SubworkflowFeatureRequirement: {}
 116   ScatterFeatureRequirement: {}
 117 ```
 118
 119 ### 3. Running with list inputs
 120
 121 The `fq` parameter needs to be a list.  You write a list in yaml by
 122 starting each list item with a dash.  Example `main-input.yaml`
 123
 124 ```
 125 fq:
 126   - class: File
 127     location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq
 128     format: http://edamontology.org/format_1930
 129   - class: File
 130     location: rnaseq/raw_fastq/Mov10_oe_2.subset.fq
 131     format: http://edamontology.org/format_1930
 132   - class: File
 133     location: rnaseq/raw_fastq/Mov10_oe_3.subset.fq
 134     format: http://edamontology.org/format_1930
 135   - class: File
 136     location: rnaseq/raw_fastq/Irrel_kd_1.subset.fq
 137     format: http://edamontology.org/format_1930
 138   - class: File
 139     location: rnaseq/raw_fastq/Irrel_kd_2.subset.fq
 140     format: http://edamontology.org/format_1930
 141   - class: File
 142     location: rnaseq/raw_fastq/Irrel_kd_3.subset.fq
 143     format: http://edamontology.org/format_1930
 144 genome:
 145   class: Directory
 146   location: hg19-chr1-STAR-index
 147 gtf:
 148   class: File
 149   location: rnaseq/reference_data/chr1-hg19_genes.gtf
 150 ```
 151
 152 Now you can run the workflow the same way as in Lesson 2.
 153
 154 ### 4. Combining results
 155
 156 Each instance of the alignment workflow produces its own featureCounts
 157 file.  However, to be able to compare results easily, we need them a
 158 single file with all the results.
 159
 160 The easiest way to do this is to run `featureCounts` just once at the
 161 end of the workflow, with all the bam files listed on the command
 162 line.
 163
 164 We'll need to modify a few things.
 165
 166 First, in `featureCounts.cwl` we need to modify it to accept either a
 167 single bam file or list of bam files.
 168
 169 ```
 170 inputs:
 171   gtf: File
 172   counts_input_bam:
 173    - File
 174    - File[]
 175 ```
 176
 177 Second, in `alignment.cwl` we need to remove the `featureCounts` step from alignment.cwl, as well as the `featurecounts` output parameter.
 178
 179 Third, in `main.cwl` we need to remove `featurecounts` from the `alignment` step
 180 outputs, and add a new step:
 181
 182 ```
 183 steps:
 184   alignment:
 185     run: alignment.cwl
 186     scatter: fq
 187     in:
 188       fq: fq
 189       genome: genome
 190       gtf: gtf
 191     out: [qc_html, bam_sorted_indexed]
 192   featureCounts:
 193     requirements:
 194       ResourceRequirement:
 195         ramMin: 500
 196     run: featureCounts.cwl
 197     in:
 198       counts_input_bam: alignment/bam_sorted_indexed
 199       gtf: gtf
 200     out: [featurecounts]
 201 ```
 202
 203 Last, we modify the `featurecounts` output parameter.  Instead of a
 204 list of files produced by the `alignment` step, it is now a single
 205 file produced by the new `featureCounts` step.
 206
 207 ```
 208 outputs:
 209   ...
 210   featurecounts:
 211     type: File
 212     outputSource: featureCounts/featurecounts
 213 ```
 214
 215 Run this workflow to get a single `featurecounts.tsv` file with a column for each bam file.