# Analyzing multiple samples

Analyzing a single sample is great, but in the real world you probably
have a batch of samples that you need to analyze and then compare.

1. Subworkflows

In addition to running command line tools, a workflow step can also
execute another workflow.

Let's copy "main.cwl" to "alignment.cwl".

Now, edit open "main.cwl" for editing.  We are going to replace the `steps` and `outputs` sections.

```
steps:
  alignment:
    run: alignment.cwl
    in:
	  fq: fq
	  genome: genome
	  gtf: gtf
	out: [qc_html, bam_sorted_indexed, featurecounts]
```

In the outputs section, all the output sources are from the alignment step:

```
outputs:
  qc_html:
    type: File
    outputSource: alignment/qc_html
  bam_sorted_indexed:
    type: File
    outputSource: alignment/bam_sorted_indexed
  featurecounts:
    type: File
    outputSource: alignment/featurecounts
```

We also need a little boilerplate to tell the workflow runner that we want to use subworkflows:

```
requirements:
  SubworkflowFeatureRequirement: {}
```

If you run this workflow, you will get exactly the same results as
before, we've just wrapped the inner workflow with an outer workflow.

2. Scattering

The wrapper lets us do something useful.  We can modify the outer
workflow to accept a list of files, and then invoke the inner workflow
step for every one of those files.  We will need to modify the
`inputs`, `steps`, `outputs`, and `requirements` sections.

First we change the `fq` parameter to expect a list of files:

```
inputs:
  fq: File[]
  genome: Directory
  gtf: File
```

Next, we add `scatter` to the alignment step.  The means it will
run `alignment.cwl` for each value in the list in the `fq` parameter.

```
steps:
  alignment:
    run: alignment.cwl
	scatter: fq
    in:
	  fq: fq
	  genome: genome
	  gtf: gtf
	out: [qc_html, bam_sorted_indexed, featurecounts]
```

Because the scatter produces multiple outputs, each output parameter
becomes a list as well:

```
outputs:
  qc_html:
    type: File[]
    outputSource: alignment/qc_html
  bam_sorted_indexed:
    type: File[]
    outputSource: alignment/bam_sorted_indexed
  featurecounts:
    type: File[]
    outputSource: alignment/featurecounts
```

Finally, we need a little more boilerplate to tell the workflow runner
that we want to use scatter:

```
requirements:
  SubworkflowFeatureRequirement: {}
  ScatterFeatureRequirement: {}
```

3. Running with list inputs

The `fq` parameter needs to be a list.  You write a list in yaml by
starting each list item with a dash.  Example `main-input.yaml`

```
fq:
  - class: File
    location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq
    format: http://edamontology.org/format_1930
  - class: File
    location: rnaseq/raw_fastq/Mov10_oe_2.subset.fq
    format: http://edamontology.org/format_1930
  - class: File
    location: rnaseq/raw_fastq/Mov10_oe_3.subset.fq
    format: http://edamontology.org/format_1930
  - class: File
    location: rnaseq/raw_fastq/Irrel_kd_1.subset.fq
    format: http://edamontology.org/format_1930
  - class: File
    location: rnaseq/raw_fastq/Irrel_kd_2.subset.fq
    format: http://edamontology.org/format_1930
  - class: File
    location: rnaseq/raw_fastq/Irrel_kd_3.subset.fq
    format: http://edamontology.org/format_1930
genome:
  class: Directory
  location: hg19-chr1-STAR-index
gtf:
  class: File
  location: rnaseq/reference_data/chr1-hg19_genes.gtf
```

Now you can run the workflow the same way as in Lesson 2.

4. Combining results

Each instance of the alignment workflow produces its own featureCounts
file.  However, to be able to compare results easily, we need them a
single file with all the results.

The easiest way to do this is to run `featureCounts` just once at the
end of the workflow, with all the bam files listed on the command
line.

We'll need to modify a few things.

First, in `featureCounts.cwl` we need to modify it to accept either a
single bam file or list of bam files.

```
inputs:
  gtf: File
  counts_input_bam:
   - File
   - File[]
```

Second, in `alignment.cwl` we need to remove the `featureCounts` step from alignment.cwl, as well as the `featurecounts` output parameter.

Third, in `main.cwl` we need to remove `featurecounts` from the `alignment` step
outputs, and add a new step:

```
steps:
  alignment:
    run: alignment.cwl
    scatter: fq
    in:
      fq: fq
      genome: genome
      gtf: gtf
    out: [qc_html, bam_sorted_indexed]
  featureCounts:
    requirements:
      ResourceRequirement:
        ramMin: 500
    run: featureCounts.cwl
    in:
      counts_input_bam: alignment/bam_sorted_indexed
      gtf: gtf
    out: [featurecounts]
```

Last, we modify the `featurecounts` output parameter.  Instead of a
list of files produced by the `alignment` step, it is now a single
file produced by the new `featureCounts` step.

```
outputs:
  ...
  featurecounts:
    type: File
    outputSource: featureCounts/featurecounts
```

Run this workflow to get a single `featurecounts.tsv` file with a column for each bam file.