_episodes/05-scatter.md

   1 ---
   2 title: " Analyzing multiple samples"
   3 teaching: 0
   4 exercises: 0
   5 questions:
   6 - "Key question (FIXME)"
   7 objectives:
   8 - "First learning objective. (FIXME)"
   9 keypoints:
  10 - "First key point. Brief Answer to questions. (FIXME)"
  11 ---
  12
  13 Analyzing a single sample is great, but in the real world you probably
  14 have a batch of samples that you need to analyze and then compare.
  15
  16 # 1. Subworkflows
  17
  18 In addition to running command line tools, a workflow step can also
  19 execute another workflow.
  20
  21 Let's copy "main.cwl" to "alignment.cwl".
  22
  23 Now, edit open "main.cwl" for editing.  We are going to replace the `steps` and `outputs` sections.
  24
  25 ```
  26 steps:
  27   alignment:
  28     run: alignment.cwl
  29     in:
  30       fq: fq
  31       genome: genome
  32       gtf: gtf
  33     out: [qc_html, bam_sorted_indexed, featurecounts]
  34 ```
  35
  36 In the outputs section, all the output sources are from the alignment step:
  37
  38 ```
  39 outputs:
  40   qc_html:
  41     type: File
  42     outputSource: alignment/qc_html
  43   bam_sorted_indexed:
  44     type: File
  45     outputSource: alignment/bam_sorted_indexed
  46   featurecounts:
  47     type: File
  48     outputSource: alignment/featurecounts
  49 ```
  50
  51 We also need a little boilerplate to tell the workflow runner that we want to use subworkflows:
  52
  53 ```
  54 requirements:
  55   SubworkflowFeatureRequirement: {}
  56 ```
  57
  58 If you run this workflow, you will get exactly the same results as
  59 before, we've just wrapped the inner workflow with an outer workflow.
  60
  61 # 2. Scattering
  62
  63 The wrapper lets us do something useful.  We can modify the outer
  64 workflow to accept a list of files, and then invoke the inner workflow
  65 step for every one of those files.  We will need to modify the
  66 `inputs`, `steps`, `outputs`, and `requirements` sections.
  67
  68 First we change the `fq` parameter to expect a list of files:
  69
  70 ```
  71 inputs:
  72   fq: File[]
  73   genome: Directory
  74   gtf: File
  75 ```
  76
  77 Next, we add `scatter` to the alignment step.  The means it will
  78 run `alignment.cwl` for each value in the list in the `fq` parameter.
  79
  80 ```
  81 steps:
  82   alignment:
  83     run: alignment.cwl
  84     scatter: fq
  85     in:
  86       fq: fq
  87       genome: genome
  88       gtf: gtf
  89     out: [qc_html, bam_sorted_indexed, featurecounts]
  90 ```
  91
  92 Because the scatter produces multiple outputs, each output parameter
  93 becomes a list as well:
  94
  95 ```
  96 outputs:
  97   qc_html:
  98     type: File[]
  99     outputSource: alignment/qc_html
 100   bam_sorted_indexed:
 101     type: File[]
 102     outputSource: alignment/bam_sorted_indexed
 103   featurecounts:
 104     type: File[]
 105     outputSource: alignment/featurecounts
 106 ```
 107
 108 Finally, we need a little more boilerplate to tell the workflow runner
 109 that we want to use scatter:
 110
 111 ```
 112 requirements:
 113   SubworkflowFeatureRequirement: {}
 114   ScatterFeatureRequirement: {}
 115 ```
 116
 117 # 3. Running with list inputs
 118
 119 The `fq` parameter needs to be a list.  You write a list in yaml by
 120 starting each list item with a dash.  Example `main-input.yaml`
 121
 122 ```
 123 fq:
 124   - class: File
 125     location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq
 126     format: http://edamontology.org/format_1930
 127   - class: File
 128     location: rnaseq/raw_fastq/Mov10_oe_2.subset.fq
 129     format: http://edamontology.org/format_1930
 130   - class: File
 131     location: rnaseq/raw_fastq/Mov10_oe_3.subset.fq
 132     format: http://edamontology.org/format_1930
 133   - class: File
 134     location: rnaseq/raw_fastq/Irrel_kd_1.subset.fq
 135     format: http://edamontology.org/format_1930
 136   - class: File
 137     location: rnaseq/raw_fastq/Irrel_kd_2.subset.fq
 138     format: http://edamontology.org/format_1930
 139   - class: File
 140     location: rnaseq/raw_fastq/Irrel_kd_3.subset.fq
 141     format: http://edamontology.org/format_1930
 142 genome:
 143   class: Directory
 144   location: hg19-chr1-STAR-index
 145 gtf:
 146   class: File
 147   location: rnaseq/reference_data/chr1-hg19_genes.gtf
 148 ```
 149
 150 Now you can run the workflow the same way as in Lesson 2.
 151
 152 # 4. Combining results
 153
 154 Each instance of the alignment workflow produces its own featureCounts
 155 file.  However, to be able to compare results easily, we need them a
 156 single file with all the results.
 157
 158 The easiest way to do this is to run `featureCounts` just once at the
 159 end of the workflow, with all the bam files listed on the command
 160 line.
 161
 162 We'll need to modify a few things.
 163
 164 First, in `featureCounts.cwl` we need to modify it to accept either a
 165 single bam file or list of bam files.
 166
 167 ```
 168 inputs:
 169   gtf: File
 170   counts_input_bam:
 171    - File
 172    - File[]
 173 ```
 174
 175 Second, in `alignment.cwl` we need to remove the `featureCounts` step from alignment.cwl, as well as the `featurecounts` output parameter.
 176
 177 Third, in `main.cwl` we need to remove `featurecounts` from the `alignment` step
 178 outputs, and add a new step:
 179
 180 ```
 181 steps:
 182   alignment:
 183     run: alignment.cwl
 184     scatter: fq
 185     in:
 186       fq: fq
 187       genome: genome
 188       gtf: gtf
 189     out: [qc_html, bam_sorted_indexed]
 190   featureCounts:
 191     requirements:
 192       ResourceRequirement:
 193         ramMin: 500
 194     run: featureCounts.cwl
 195     in:
 196       counts_input_bam: alignment/bam_sorted_indexed
 197       gtf: gtf
 198     out: [featurecounts]
 199 ```
 200
 201 Last, we modify the `featurecounts` output parameter.  Instead of a
 202 list of files produced by the `alignment` step, it is now a single
 203 file produced by the new `featureCounts` step.
 204
 205 ```
 206 outputs:
 207   ...
 208   featurecounts:
 209     type: File
 210     outputSource: featureCounts/featurecounts
 211 ```
 212
 213 Run this workflow to get a single `featurecounts.tsv` file with a column for each bam file.