lesson4/lesson4.md

   1 # Analyzing multiple samples
   2
   3 Analyzing a single sample is great, but in the real world you probably
   4 have a batch of samples that you need to analyze and then compare.
   5
   6 ### 1. Subworkflows
   7
   8 In addition to running command line tools, a workflow step can also
   9 execute another workflow.
  10
  11 Let's copy "main.cwl" to "alignment.cwl".
  12
  13 Now, edit open "main.cwl" for editing.  We are going to replace the `steps` and `outputs` sections.
  14
  15 ```
  16 steps:
  17   alignment:
  18     run: alignment.cwl
  19     in:
  20       fq: fq
  21       genome: genome
  22       gtf: gtf
  23     out: [qc_html, bam_sorted_indexed, featurecounts]
  24 ```
  25
  26 In the outputs section, all the output sources are from the alignment step:
  27
  28 ```
  29 outputs:
  30   qc_html:
  31     type: File
  32     outputSource: alignment/qc_html
  33   bam_sorted_indexed:
  34     type: File
  35     outputSource: alignment/bam_sorted_indexed
  36   featurecounts:
  37     type: File
  38     outputSource: alignment/featurecounts
  39 ```
  40
  41 We also need a little boilerplate to tell the workflow runner that we want to use subworkflows:
  42
  43 ```
  44 requirements:
  45   SubworkflowFeatureRequirement: {}
  46 ```
  47
  48 If you run this workflow, you will get exactly the same results as
  49 before, we've just wrapped the inner workflow with an outer workflow.
  50
  51 ### 2. Scattering
  52
  53 The wrapper lets us do something useful.  We can modify the outer
  54 workflow to accept a list of files, and then invoke the inner workflow
  55 step for every one of those files.  We will need to modify the
  56 `inputs`, `steps`, `outputs`, and `requirements` sections.
  57
  58 First we change the `fq` parameter to expect a list of files:
  59
  60 ```
  61 inputs:
  62   fq: File[]
  63   genome: Directory
  64   gtf: File
  65 ```
  66
  67 Next, we add `scatter` to the alignment step.  The means it will
  68 run `alignment.cwl` for each value in the list in the `fq` parameter.
  69
  70 ```
  71 steps:
  72   alignment:
  73     run: alignment.cwl
  74     scatter: fq
  75     in:
  76       fq: fq
  77       genome: genome
  78       gtf: gtf
  79     out: [qc_html, bam_sorted_indexed, featurecounts]
  80 ```
  81
  82 Because the scatter produces multiple outputs, each output parameter
  83 becomes a list as well:
  84
  85 ```
  86 outputs:
  87   qc_html:
  88     type: File[]
  89     outputSource: alignment/qc_html
  90   bam_sorted_indexed:
  91     type: File[]
  92     outputSource: alignment/bam_sorted_indexed
  93   featurecounts:
  94     type: File[]
  95     outputSource: alignment/featurecounts
  96 ```
  97
  98 Finally, we need a little more boilerplate to tell the workflow runner
  99 that we want to use scatter:
 100
 101 ```
 102 requirements:
 103   SubworkflowFeatureRequirement: {}
 104   ScatterFeatureRequirement: {}
 105 ```
 106
 107 ### 3. Running with list inputs
 108
 109 The `fq` parameter needs to be a list.  You write a list in yaml by
 110 starting each list item with a dash.  Example `main-input.yaml`
 111
 112 ```
 113 fq:
 114   - class: File
 115     location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq
 116     format: http://edamontology.org/format_1930
 117   - class: File
 118     location: rnaseq/raw_fastq/Mov10_oe_2.subset.fq
 119     format: http://edamontology.org/format_1930
 120   - class: File
 121     location: rnaseq/raw_fastq/Mov10_oe_3.subset.fq
 122     format: http://edamontology.org/format_1930
 123   - class: File
 124     location: rnaseq/raw_fastq/Irrel_kd_1.subset.fq
 125     format: http://edamontology.org/format_1930
 126   - class: File
 127     location: rnaseq/raw_fastq/Irrel_kd_2.subset.fq
 128     format: http://edamontology.org/format_1930
 129   - class: File
 130     location: rnaseq/raw_fastq/Irrel_kd_3.subset.fq
 131     format: http://edamontology.org/format_1930
 132 genome:
 133   class: Directory
 134   location: hg19-chr1-STAR-index
 135 gtf:
 136   class: File
 137   location: rnaseq/reference_data/chr1-hg19_genes.gtf
 138 ```
 139
 140 Now you can run the workflow the same way as in Lesson 2.
 141
 142 ### 4. Combining results
 143
 144 Each instance of the alignment workflow produces its own featureCounts
 145 file.  However, to be able to compare results easily, we need them a
 146 single file with all the results.
 147
 148 The easiest way to do this is to run `featureCounts` just once at the
 149 end of the workflow, with all the bam files listed on the command
 150 line.
 151
 152 We'll need to modify a few things.
 153
 154 First, in `featureCounts.cwl` we need to modify it to accept either a
 155 single bam file or list of bam files.
 156
 157 ```
 158 inputs:
 159   gtf: File
 160   counts_input_bam:
 161    - File
 162    - File[]
 163 ```
 164
 165 Second, in `alignment.cwl` we need to remove the `featureCounts` step from alignment.cwl, as well as the `featurecounts` output parameter.
 166
 167 Third, in `main.cwl` we need to remove `featurecounts` from the `alignment` step
 168 outputs, and add a new step:
 169
 170 ```
 171 steps:
 172   alignment:
 173     run: alignment.cwl
 174     scatter: fq
 175     in:
 176       fq: fq
 177       genome: genome
 178       gtf: gtf
 179     out: [qc_html, bam_sorted_indexed]
 180   featureCounts:
 181     requirements:
 182       ResourceRequirement:
 183         ramMin: 500
 184     run: featureCounts.cwl
 185     in:
 186       counts_input_bam: alignment/bam_sorted_indexed
 187       gtf: gtf
 188     out: [featurecounts]
 189 ```
 190
 191 Last, we modify the `featurecounts` output parameter.  Instead of a
 192 list of files produced by the `alignment` step, it is now a single
 193 file produced by the new `featureCounts` step.
 194
 195 ```
 196 outputs:
 197   ...
 198   featurecounts:
 199     type: File
 200     outputSource: featureCounts/featurecounts
 201 ```
 202
 203 Run this workflow to get a single `featurecounts.tsv` file with a column for each bam file.