2 title: "Analyzing Multiple Samples"
6 - "How can you run the same workflow over multiple samples?"
8 - "Modify the workflow to process multiple samples, then perform a joint analysis."
10 - "Separate the part of the workflow that you want to run multiple times into a subworkflow."
11 - "Use a scatter step to run the subworkflow over a list of inputs."
12 - "The result of a scatter is an array, which can be used in a combine step to get a single result."
15 In the previous lesson, we completed converting the function of the
16 original source shell script into CWL. This lesson expands the scope
17 by demonstrating what changes to make to the workflow to be able to
18 analyze multiple samples in parallel.
22 In addition to running command line tools, a workflow step can also
23 execute another workflow.
25 First, copy `main.cwl` to `alignment.cwl`.
27 Next, open `main.cwl` for editing. We are going to replace the `steps` and `outputs` sections.
29 Remove all the steps and replace them with a single `alignment` step
30 which invokes the `alignment.cwl` we just copied.
40 out: [qc_html, bam_sorted_indexed, featurecounts]
44 In the `outputs` section, all the output sources are from the alignment step:
50 outputSource: alignment/qc_html
53 outputSource: alignment/bam_sorted_indexed
56 outputSource: alignment/featurecounts
60 We also need add "SubworkflowFeatureRequirement" to tell the workflow
61 runner that we are using subworkflows:
65 SubworkflowFeatureRequirement: {}
69 > ## Running the workflow
71 > Run this workflow. You should get exactly the same results as
72 > before, as all we have done so far is to wrap the inner workflow with
78 > * <a href="../assets/answers/ep5/part1/main.cwl">main.cwl</a>
79 > * <a href="../assets/answers/ep5/part1/alignment.cwl">alignment.cwl</a>
80 > * <a href="../assets/answers/ep5/part1/featureCounts.cwl">featureCounts.cwl</a>
85 The "wrapper" step lets us do something useful. We can modify the
86 outer workflow to accept a list of files, and then invoke the inner
87 workflow step for every one of those files. We will need to modify
88 the `inputs`, `steps`, `outputs`, and `requirements` sections.
90 First we change the `fq` parameter to expect a list of files:
100 Next, we add `scatter` to the alignment step. The means we want to
101 run run `alignment.cwl` for each value in the list in the `fq`
113 out: [qc_html, bam_sorted_indexed, featurecounts]
117 Because the scatter produces multiple outputs, each output parameter
118 becomes a list as well:
124 outputSource: alignment/qc_html
127 outputSource: alignment/bam_sorted_indexed
130 outputSource: alignment/featurecounts
134 We also need add "ScatterFeatureRequirement" to tell the workflow
135 runner that we are using scatter:
139 SubworkflowFeatureRequirement: {}
140 ScatterFeatureRequirement: {}
145 > * <a href="../assets/answers/ep5/part2/main.cwl">main.cwl</a>
146 > * <a href="../assets/answers/ep5/part2/alignment.cwl">alignment.cwl</a>
147 > * <a href="../assets/answers/ep5/part2/featureCounts.cwl">featureCounts.cwl</a>
150 # Input parameter lists
152 The `fq` parameter needs to be a list. You write a list in yaml by
153 starting each list item with a dash. Example `main-input.yaml`
158 location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq
159 format: http://edamontology.org/format_1930
161 location: rnaseq/raw_fastq/Mov10_oe_2.subset.fq
162 format: http://edamontology.org/format_1930
164 location: rnaseq/raw_fastq/Mov10_oe_3.subset.fq
165 format: http://edamontology.org/format_1930
167 location: rnaseq/raw_fastq/Irrel_kd_1.subset.fq
168 format: http://edamontology.org/format_1930
170 location: rnaseq/raw_fastq/Irrel_kd_2.subset.fq
171 format: http://edamontology.org/format_1930
173 location: rnaseq/raw_fastq/Irrel_kd_3.subset.fq
174 format: http://edamontology.org/format_1930
177 location: hg19-chr1-STAR-index
180 location: rnaseq/reference_data/chr1-hg19_genes.gtf
184 > ## Running the workflow
186 > Run this workflow. You will now get results for each one of the
193 Each instance of the alignment workflow produces its own
194 `featurecounts.tsv` file. However, to be able to compare results
195 easily, we would like single file with all the results.
197 We can modify the workflow to run `featureCounts` once at the end of
198 the workflow, taking all the bam files listed on the command line.
200 We will need to change a few things.
202 First, in `featureCounts.cwl` we need to modify it to accept either a
203 single bam file or list of bam files.
214 Second, in `alignment.cwl` we need to remove the `featureCounts` step from alignment.cwl, as well as the `featurecounts` output parameter.
216 Third, in `main.cwl` we need to remove `featurecounts` from the `alignment` step
217 outputs, and add a new step:
228 out: [qc_html, bam_sorted_indexed]
233 run: featureCounts.cwl
235 counts_input_bam: alignment/bam_sorted_indexed
241 Last, we modify the `featurecounts` output parameter. Instead of a
242 list of files produced by the `alignment` step, it is now a single
243 file produced by the new `featureCounts` step.
250 outputSource: featureCounts/featurecounts
254 > ## Running the workflow
256 > Run this workflow. You will still have separate results from fastq
257 > and and STAR, but now you will only have a single
258 > `featurecounts.tsv` file with a column for each bam file.
262 > ## Episode solution
263 > * <a href="../assets/answers/ep5/part4/main.cwl">main.cwl</a>
264 > * <a href="../assets/answers/ep5/part4/alignment.cwl">alignment.cwl</a>
265 > * <a href="../assets/answers/ep5/part4/featureCounts.cwl">featureCounts.cwl</a>