lesson4/lesson4.md

   1 # Analyzing multiple samples
   2
   3 Analyzing a single sample is great, but in the real world you probably
   4 have a batch of samples that you need to analyze and then compare.
   5
   6 1. Subworkflows
   7
   8 In addition to running command line tools, a workflow step can also
   9 execute another workflow.
  10
  11 Let's copy "main.cwl" to "alignment.cwl".
  12
  13 Now, edit open "main.cwl" for editing.  We are going to replace the `steps` and `outputs` sections.
  14
  15 ```
  16 steps:
  17   alignment:
  18     run: alignment.cwl
  19     in:
  20           fq: fq
  21           genome: genome
  22           gtf: gtf
  23         out: [qc_html, bam_sorted_indexed, featurecounts]
  24 ```
  25
  26 In the outputs section, all the output sources are from the alignment step:
  27
  28 ```
  29 outputs:
  30   qc_html:
  31     type: File
  32     outputSource: alignment/qc_html
  33   bam_sorted_indexed:
  34     type: File
  35     outputSource: alignment/bam_sorted_indexed
  36   featurecounts:
  37     type: File
  38     outputSource: alignment/featurecounts
  39 ```
  40
  41 We also need a little boilerplate to tell the workflow runner that we want to use subworkflows:
  42
  43 ```
  44 requirements:
  45   SubworkflowFeatureRequirement: {}
  46 ```
  47
  48 If you run this workflow, you will get exactly the same results as
  49 before, we've just wrapped the inner workflow with an outer workflow.
  50
  51 2. Scattering
  52
  53 The wrapper lets us do something useful.  We can modify the outer
  54 workflow to accept a list of files, and then invoke the inner workflow
  55 step for every one of those files.  We will need to modify the
  56 `inputs`, `steps`, `outputs`, and `requirements` sections.
  57
  58 First we change the `fq` parameter to expect a list of files:
  59
  60 ```
  61 inputs:
  62   fq: File[]
  63   genome: Directory
  64   gtf: File
  65 ```
  66
  67 Next, we add `scatter` to the alignment step.  The means it will
  68 run `alignment.cwl` for each value in the list in the `fq` parameter.
  69
  70 ```
  71 steps:
  72   alignment:
  73     run: alignment.cwl
  74         scatter: fq
  75     in:
  76           fq: fq
  77           genome: genome
  78           gtf: gtf
  79         out: [qc_html, bam_sorted_indexed, featurecounts]
  80 ```
  81
  82 Because the scatter produces multiple outputs, each output parameter
  83 becomes a list as well:
  84
  85 ```
  86 outputs:
  87   qc_html:
  88     type: File[]
  89     outputSource: alignment/qc_html
  90   bam_sorted_indexed:
  91     type: File[]
  92     outputSource: alignment/bam_sorted_indexed
  93   featurecounts:
  94     type: File[]
  95     outputSource: alignment/featurecounts
  96 ```
  97
  98 Finally, we need a little more boilerplate to tell the workflow runner
  99 that we want to use scatter:
 100
 101 ```
 102 requirements:
 103   SubworkflowFeatureRequirement: {}
 104   ScatterFeatureRequirement: {}
 105 ```
 106
 107 3. Running with list inputs
 108
 109 The `fq` parameter needs to be a list.  You write a list in yaml by
 110 starting each list item with a dash.  Example `main-input.yaml`
 111
 112 ```
 113 fq:
 114   - class: File
 115     location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq
 116     format: http://edamontology.org/format_1930
 117   - class: File
 118     location: rnaseq/raw_fastq/Mov10_oe_2.subset.fq
 119     format: http://edamontology.org/format_1930
 120   - class: File
 121     location: rnaseq/raw_fastq/Mov10_oe_3.subset.fq
 122     format: http://edamontology.org/format_1930
 123   - class: File
 124     location: rnaseq/raw_fastq/Irrel_kd_1.subset.fq
 125     format: http://edamontology.org/format_1930
 126   - class: File
 127     location: rnaseq/raw_fastq/Irrel_kd_2.subset.fq
 128     format: http://edamontology.org/format_1930
 129   - class: File
 130     location: rnaseq/raw_fastq/Irrel_kd_3.subset.fq
 131     format: http://edamontology.org/format_1930
 132 genome:
 133   class: Directory
 134   location: hg19-chr1-STAR-index
 135 gtf:
 136   class: File
 137   location: rnaseq/reference_data/chr1-hg19_genes.gtf
 138 ```
 139
 140 Now you can run the workflow the same way as in Lesson 2.
 141
 142 4. Combining results
 143
 144 Each instance of the alignment workflow produces its own featureCounts
 145 file.  However, to be able to compare results easily, we need them a
 146 single file with all the results.
 147
 148 The easiest way to do this is to run `featureCounts` just once at the
 149 end of the workflow, with all the bam files listed on the command
 150 line.
 151
 152 We'll need to modify a few things.
 153
 154 First, in `featureCounts.cwl` we need to modify it to accept either a
 155 single bam file or list of bam files.
 156
 157 ```
 158 inputs:
 159   gtf: File
 160   counts_input_bam:
 161    - File
 162    - File[]
 163 ```
 164
 165 Second, in `alignment.cwl` we need to remove the `featureCounts` step from alignment.cwl, as well as the `featurecounts` output parameter.
 166
 167 Third, in `main.cwl` we need to remove `featurecounts` from the `alignment` step
 168 outputs, and add a new step:
 169
 170 ```
 171 steps:
 172   alignment:
 173     run: alignment.cwl
 174     scatter: fq
 175     in:
 176       fq: fq
 177       genome: genome
 178       gtf: gtf
 179     out: [qc_html, bam_sorted_indexed]
 180   featureCounts:
 181     requirements:
 182       ResourceRequirement:
 183         ramMin: 500
 184     run: featureCounts.cwl
 185     in:
 186       counts_input_bam: alignment/bam_sorted_indexed
 187       gtf: gtf
 188     out: [featurecounts]
 189 ```
 190
 191 Last, we modify the `featurecounts` output parameter.  Instead of a
 192 list of files produced by the `alignment` step, it is now a single
 193 file produced by the new `featureCounts` step.
 194
 195 ```
 196 outputs:
 197   ...
 198   featurecounts:
 199     type: File
 200     outputSource: featureCounts/featurecounts
 201 ```
 202
 203 Run this workflow to get a single `featurecounts.tsv` file with a column for each bam file.