Remove software carpentries logo
[rnaseq-cwl-training.git] / _episodes / 05-scatter.md
1 ---
2 title: "Analyzing Multiple Samples"
3 teaching: 30
4 exercises: 30
5 questions:
6 - "How can you run the same workflow over multiple samples?"
7 objectives:
8 - "Modify the workflow to process multiple samples, then perform a joint analysis."
9 keypoints:
10 - "Separate the part of the workflow that you want to run multiple times into a subworkflow."
11 - "Use a scatter step to run the subworkflow over a list of inputs."
12 - "The result of a scatter is an array, which can be used in a combine step to get a single result."
13 ---
14
15 In the previous lesson, we completed converting the function of the
16 original source shell script into CWL.  This lesson expands the scope
17 by demonstrating what changes to make to the workflow to be able to
18 analyze multiple samples in parallel.
19
20 # Subworkflows
21
22 In addition to running command line tools, a workflow step can also
23 execute another workflow.
24
25 First, copy `main.cwl` to `alignment.cwl`.
26
27 Next, open `main.cwl` for editing.  We are going to replace the `steps` and `outputs` sections.
28
29 Remove all the steps and replace them with a single `alignment` step
30 which invokes the `alignment.cwl` we just copied.
31
32 ```
33 steps:
34   alignment:
35     run: alignment.cwl
36     in:
37       fq: fq
38       genome: genome
39       gtf: gtf
40     out: [qc_html, bam_sorted_indexed, featurecounts]
41 ```
42 {: .language-yaml }
43
44 In the `outputs` section, all the output sources are from the alignment step:
45
46 ```
47 outputs:
48   qc_html:
49     type: File
50     outputSource: alignment/qc_html
51   bam_sorted_indexed:
52     type: File
53     outputSource: alignment/bam_sorted_indexed
54   featurecounts:
55     type: File
56     outputSource: alignment/featurecounts
57 ```
58 {: .language-yaml }
59
60 We also need add "SubworkflowFeatureRequirement" to tell the workflow
61 runner that we are using subworkflows:
62
63 ```
64 requirements:
65   SubworkflowFeatureRequirement: {}
66 ```
67 {: .language-yaml }
68
69 > ## Running the workflow
70 >
71 > Run this workflow.  You should get exactly the same results as
72 > before, as all we have done so far is to wrap the inner workflow with
73 > an outer workflow.
74 >
75 {: .challenge }
76
77 > ## Part 1 solution
78 > * <a href="../assets/answers/ep5/part1/main.cwl">main.cwl</a>
79 > * <a href="../assets/answers/ep5/part1/alignment.cwl">alignment.cwl</a>
80 > * <a href="../assets/answers/ep5/part1/featureCounts.cwl">featureCounts.cwl</a>
81 {: .solution}
82
83 # Scattering
84
85 The "wrapper" step lets us do something useful.  We can modify the
86 outer workflow to accept a list of files, and then invoke the inner
87 workflow step for every one of those files.  We will need to modify
88 the `inputs`, `steps`, `outputs`, and `requirements` sections.
89
90 First we change the `fq` parameter to expect a list of files:
91
92 ```
93 inputs:
94   fq: File[]
95   genome: Directory
96   gtf: File
97 ```
98 {: .language-yaml }
99
100 Next, we add `scatter` to the alignment step.  The means we want to
101 run run `alignment.cwl` for each value in the list in the `fq`
102 parameter.
103
104 ```
105 steps:
106   alignment:
107     run: alignment.cwl
108     scatter: fq
109     in:
110       fq: fq
111       genome: genome
112       gtf: gtf
113     out: [qc_html, bam_sorted_indexed, featurecounts]
114 ```
115 {: .language-yaml }
116
117 Because the scatter produces multiple outputs, each output parameter
118 becomes a list as well:
119
120 ```
121 outputs:
122   qc_html:
123     type: File[]
124     outputSource: alignment/qc_html
125   bam_sorted_indexed:
126     type: File[]
127     outputSource: alignment/bam_sorted_indexed
128   featurecounts:
129     type: File[]
130     outputSource: alignment/featurecounts
131 ```
132 {: .language-yaml }
133
134 We also need add "ScatterFeatureRequirement" to tell the workflow
135 runner that we are using scatter:
136
137 ```
138 requirements:
139   SubworkflowFeatureRequirement: {}
140   ScatterFeatureRequirement: {}
141 ```
142 {: .language-yaml }
143
144 > ## Part 2 solution
145 > * <a href="../assets/answers/ep5/part2/main.cwl">main.cwl</a>
146 > * <a href="../assets/answers/ep5/part2/alignment.cwl">alignment.cwl</a>
147 > * <a href="../assets/answers/ep5/part2/featureCounts.cwl">featureCounts.cwl</a>
148 {: .solution}
149
150 # Input parameter lists
151
152 The `fq` parameter needs to be a list.  You write a list in yaml by
153 starting each list item with a dash.  Example `main-input.yaml`
154
155 ```
156 fq:
157   - class: File
158     location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq
159     format: http://edamontology.org/format_1930
160   - class: File
161     location: rnaseq/raw_fastq/Mov10_oe_2.subset.fq
162     format: http://edamontology.org/format_1930
163   - class: File
164     location: rnaseq/raw_fastq/Mov10_oe_3.subset.fq
165     format: http://edamontology.org/format_1930
166   - class: File
167     location: rnaseq/raw_fastq/Irrel_kd_1.subset.fq
168     format: http://edamontology.org/format_1930
169   - class: File
170     location: rnaseq/raw_fastq/Irrel_kd_2.subset.fq
171     format: http://edamontology.org/format_1930
172   - class: File
173     location: rnaseq/raw_fastq/Irrel_kd_3.subset.fq
174     format: http://edamontology.org/format_1930
175 genome:
176   class: Directory
177   location: hg19-chr1-STAR-index
178 gtf:
179   class: File
180   location: rnaseq/reference_data/chr1-hg19_genes.gtf
181 ```
182 {: .language-yaml }
183
184 > ## Running the workflow
185 >
186 > Run this workflow.  You will now get results for each one of the
187 > input fastq files.
188 >
189 {: .challenge }
190
191 # Combining results
192
193 Each instance of the alignment workflow produces its own
194 `featurecounts.tsv` file.  However, to be able to compare results
195 easily, we would like single file with all the results.
196
197 We can modify the workflow to run `featureCounts` once at the end of
198 the workflow, taking all the bam files listed on the command line.
199
200 We will need to change a few things.
201
202 First, in `featureCounts.cwl` we need to modify it to accept either a
203 single bam file or list of bam files.
204
205 ```
206 inputs:
207   gtf: File
208   counts_input_bam:
209    - File
210    - File[]
211 ```
212 {: .language-yaml }
213
214 Second, in `alignment.cwl` we need to remove the `featureCounts` step from alignment.cwl, as well as the `featurecounts` output parameter.
215
216 Third, in `main.cwl` we need to remove `featurecounts` from the `alignment` step
217 outputs, and add a new step:
218
219 ```
220 steps:
221   alignment:
222     run: alignment.cwl
223     scatter: fq
224     in:
225       fq: fq
226       genome: genome
227       gtf: gtf
228     out: [qc_html, bam_sorted_indexed]
229   featureCounts:
230     requirements:
231       ResourceRequirement:
232         ramMin: 500
233     run: featureCounts.cwl
234     in:
235       counts_input_bam: alignment/bam_sorted_indexed
236       gtf: gtf
237     out: [featurecounts]
238 ```
239 {: .language-yaml }
240
241 Last, we modify the `featurecounts` output parameter.  Instead of a
242 list of files produced by the `alignment` step, it is now a single
243 file produced by the new `featureCounts` step.
244
245 ```
246 outputs:
247   ...
248   featurecounts:
249     type: File
250     outputSource: featureCounts/featurecounts
251 ```
252 {: .language-yaml }
253
254 > ## Running the workflow
255 >
256 > Run this workflow.  You will still have separate results from fastq
257 > and and STAR, but now you will only have a single
258 > `featurecounts.tsv` file with a column for each bam file.
259 >
260 {: .challenge }
261
262 > ## Episode solution
263 > * <a href="../assets/answers/ep5/part4/main.cwl">main.cwl</a>
264 > * <a href="../assets/answers/ep5/part4/alignment.cwl">alignment.cwl</a>
265 > * <a href="../assets/answers/ep5/part4/featureCounts.cwl">featureCounts.cwl</a>
266 {: .solution}