# Analyzing multiple samples Analyzing a single sample is great, but in the real world you probably have a batch of samples that you need to analyze and then compare. 1. Subworkflows In addition to running command line tools, a workflow step can also execute another workflow. Let's copy "main.cwl" to "alignment.cwl". Now, edit open "main.cwl" for editing. We are going to replace the `steps` and `outputs` sections. ``` steps: alignment: run: alignment.cwl in: fq: fq genome: genome gtf: gtf out: [qc_html, bam_sorted_indexed, featurecounts] ``` In the outputs section, all the output sources are from the alignment step: ``` outputs: qc_html: type: File outputSource: alignment/qc_html bam_sorted_indexed: type: File outputSource: alignment/bam_sorted_indexed featurecounts: type: File outputSource: alignment/featurecounts ``` We also need a little boilerplate to tell the workflow runner that we want to use subworkflows: ``` requirements: SubworkflowFeatureRequirement: {} ``` If you run this workflow, you will get exactly the same results as before, we've just wrapped the inner workflow with an outer workflow. 2. Scattering The wrapper lets us do something useful. We can modify the outer workflow to accept a list of files, and then invoke the inner workflow step for every one of those files. We will need to modify the `inputs`, `steps`, `outputs`, and `requirements` sections. First we change the `fq` parameter to expect a list of files: ``` inputs: fq: File[] genome: Directory gtf: File ``` Next, we add `scatter` to the alignment step. The means it will run `alignment.cwl` for each value in the list in the `fq` parameter. ``` steps: alignment: run: alignment.cwl scatter: fq in: fq: fq genome: genome gtf: gtf out: [qc_html, bam_sorted_indexed, featurecounts] ``` Because the scatter produces multiple outputs, each output parameter becomes a list as well: ``` outputs: qc_html: type: File[] outputSource: alignment/qc_html bam_sorted_indexed: type: File[] outputSource: alignment/bam_sorted_indexed featurecounts: type: File[] outputSource: alignment/featurecounts ``` Finally, we need a little more boilerplate to tell the workflow runner that we want to use scatter: ``` requirements: SubworkflowFeatureRequirement: {} ScatterFeatureRequirement: {} ``` 3. Running with list inputs The `fq` parameter needs to be a list. You write a list in yaml by starting each list item with a dash. Example `main-input.yaml` ``` fq: - class: File location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq format: http://edamontology.org/format_1930 - class: File location: rnaseq/raw_fastq/Mov10_oe_2.subset.fq format: http://edamontology.org/format_1930 - class: File location: rnaseq/raw_fastq/Mov10_oe_3.subset.fq format: http://edamontology.org/format_1930 - class: File location: rnaseq/raw_fastq/Irrel_kd_1.subset.fq format: http://edamontology.org/format_1930 - class: File location: rnaseq/raw_fastq/Irrel_kd_2.subset.fq format: http://edamontology.org/format_1930 - class: File location: rnaseq/raw_fastq/Irrel_kd_3.subset.fq format: http://edamontology.org/format_1930 genome: class: Directory location: hg19-chr1-STAR-index gtf: class: File location: rnaseq/reference_data/chr1-hg19_genes.gtf ``` Now you can run the workflow the same way as in Lesson 2. 4. Combining results Each instance of the alignment workflow produces its own featureCounts file. However, to be able to compare results easily, we need them a single file with all the results. The easiest way to do this is to run `featureCounts` just once at the end of the workflow, with all the bam files listed on the command line. We'll need to modify a few things. First, in `featureCounts.cwl` we need to modify it to accept either a single bam file or list of bam files. ``` inputs: gtf: File counts_input_bam: - File - File[] ``` Second, in `alignment.cwl` we need to remove the `featureCounts` step from alignment.cwl, as well as the `featurecounts` output parameter. Third, in `main.cwl` we need to remove `featurecounts` from the `alignment` step outputs, and add a new step: ``` steps: alignment: run: alignment.cwl scatter: fq in: fq: fq genome: genome gtf: gtf out: [qc_html, bam_sorted_indexed] featureCounts: requirements: ResourceRequirement: ramMin: 500 run: featureCounts.cwl in: counts_input_bam: alignment/bam_sorted_indexed gtf: gtf out: [featurecounts] ``` Last, we modify the `featurecounts` output parameter. Instead of a list of files produced by the `alignment` step, it is now a single file produced by the new `featureCounts` step. ``` outputs: ... featurecounts: type: File outputSource: featureCounts/featurecounts ``` Run this workflow to get a single `featurecounts.tsv` file with a column for each bam file.