--- title: " Analyzing multiple samples" teaching: 0 exercises: 0 questions: - "Key question (FIXME)" objectives: - "First learning objective. (FIXME)" keypoints: - "First key point. Brief Answer to questions. (FIXME)" --- # Analyzing multiple samples Analyzing a single sample is great, but in the real world you probably have a batch of samples that you need to analyze and then compare. ### 1. Subworkflows In addition to running command line tools, a workflow step can also execute another workflow. Let's copy "main.cwl" to "alignment.cwl". Now, edit open "main.cwl" for editing. We are going to replace the `steps` and `outputs` sections. ``` steps: alignment: run: alignment.cwl in: fq: fq genome: genome gtf: gtf out: [qc_html, bam_sorted_indexed, featurecounts] ``` In the outputs section, all the output sources are from the alignment step: ``` outputs: qc_html: type: File outputSource: alignment/qc_html bam_sorted_indexed: type: File outputSource: alignment/bam_sorted_indexed featurecounts: type: File outputSource: alignment/featurecounts ``` We also need a little boilerplate to tell the workflow runner that we want to use subworkflows: ``` requirements: SubworkflowFeatureRequirement: {} ``` If you run this workflow, you will get exactly the same results as before, we've just wrapped the inner workflow with an outer workflow. ### 2. Scattering The wrapper lets us do something useful. We can modify the outer workflow to accept a list of files, and then invoke the inner workflow step for every one of those files. We will need to modify the `inputs`, `steps`, `outputs`, and `requirements` sections. First we change the `fq` parameter to expect a list of files: ``` inputs: fq: File[] genome: Directory gtf: File ``` Next, we add `scatter` to the alignment step. The means it will run `alignment.cwl` for each value in the list in the `fq` parameter. ``` steps: alignment: run: alignment.cwl scatter: fq in: fq: fq genome: genome gtf: gtf out: [qc_html, bam_sorted_indexed, featurecounts] ``` Because the scatter produces multiple outputs, each output parameter becomes a list as well: ``` outputs: qc_html: type: File[] outputSource: alignment/qc_html bam_sorted_indexed: type: File[] outputSource: alignment/bam_sorted_indexed featurecounts: type: File[] outputSource: alignment/featurecounts ``` Finally, we need a little more boilerplate to tell the workflow runner that we want to use scatter: ``` requirements: SubworkflowFeatureRequirement: {} ScatterFeatureRequirement: {} ``` ### 3. Running with list inputs The `fq` parameter needs to be a list. You write a list in yaml by starting each list item with a dash. Example `main-input.yaml` ``` fq: - class: File location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq format: http://edamontology.org/format_1930 - class: File location: rnaseq/raw_fastq/Mov10_oe_2.subset.fq format: http://edamontology.org/format_1930 - class: File location: rnaseq/raw_fastq/Mov10_oe_3.subset.fq format: http://edamontology.org/format_1930 - class: File location: rnaseq/raw_fastq/Irrel_kd_1.subset.fq format: http://edamontology.org/format_1930 - class: File location: rnaseq/raw_fastq/Irrel_kd_2.subset.fq format: http://edamontology.org/format_1930 - class: File location: rnaseq/raw_fastq/Irrel_kd_3.subset.fq format: http://edamontology.org/format_1930 genome: class: Directory location: hg19-chr1-STAR-index gtf: class: File location: rnaseq/reference_data/chr1-hg19_genes.gtf ``` Now you can run the workflow the same way as in Lesson 2. ### 4. Combining results Each instance of the alignment workflow produces its own featureCounts file. However, to be able to compare results easily, we need them a single file with all the results. The easiest way to do this is to run `featureCounts` just once at the end of the workflow, with all the bam files listed on the command line. We'll need to modify a few things. First, in `featureCounts.cwl` we need to modify it to accept either a single bam file or list of bam files. ``` inputs: gtf: File counts_input_bam: - File - File[] ``` Second, in `alignment.cwl` we need to remove the `featureCounts` step from alignment.cwl, as well as the `featurecounts` output parameter. Third, in `main.cwl` we need to remove `featurecounts` from the `alignment` step outputs, and add a new step: ``` steps: alignment: run: alignment.cwl scatter: fq in: fq: fq genome: genome gtf: gtf out: [qc_html, bam_sorted_indexed] featureCounts: requirements: ResourceRequirement: ramMin: 500 run: featureCounts.cwl in: counts_input_bam: alignment/bam_sorted_indexed gtf: gtf out: [featurecounts] ``` Last, we modify the `featurecounts` output parameter. Instead of a list of files produced by the `alignment` step, it is now a single file produced by the new `featureCounts` step. ``` outputs: ... featurecounts: type: File outputSource: featureCounts/featurecounts ``` Run this workflow to get a single `featurecounts.tsv` file with a column for each bam file.