X-Git-Url: https://git.arvados.org/rnaseq-cwl-training.git/blobdiff_plain/1dcbce0b09a71abaec4b1e72dd70a2f573e4dc6b..d177113e4a57ff8fe980212e51fe4317cf0fcb5a:/lesson1/lesson1.md diff --git a/lesson1/lesson1.md b/lesson1/lesson1.md index 060597a..5c45265 100644 --- a/lesson1/lesson1.md +++ b/lesson1/lesson1.md @@ -1,44 +1,132 @@ -# Turning a bash script into a workflow using existing tools +# Turning a shell script into a workflow by composing existing tools -In this lesson we will turn `rnaseq_analysis_on_input_file.sh` into a workflow. +## Introduction -# Setting up +The goal of this training is to walk through the development of a +best-practices CWL workflow by translating an existing bioinformatics +shell script into CWL. Specific knowledge of the biology of RNA-seq +is *not* a prerequisite for these lessons. -We will create a new git repository and import a library of existing -tool definitions that will help us build our workflow. +These lessons are based on "Introduction to RNA-seq using +high-performance computing (HPC)" lessons developed by members of the +teaching team at the Harvard Chan Bioinformatics Core (HBC). The +original training, which includes additional lectures about the +biology of RNA-seq can be found here: + +https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2 + +## Background + +RNA-seq is the process of sequencing RNA in a biological sample. From +the sequence reads, we want to measure the relative number of RNA +molecules appearing in the sample that were produced by particular +genes. This analysis is called "differential gene expression". + +The entire process looks like this: + +![](RNAseqWorkflow.png) + +For this training, we are only concerned with the middle analytical +steps (skipping adapter trimming). + +* Quality control (FASTQC) +* Alignment (mapping) +* Counting reads associated with genes + +## Analysis shell script -1. Select "Terminal->New terminal" +This analysis is already available as a Unix shell script, which we +will refer to in order to build the workflow. -2. Create a new git repository to hold our workflow with this command: +Some of the reasons to use CWL over a plain shell script: portability, +scalability, ability to run on platforms that are not traditional HPC. -## Arvados +rnaseq_analysis_on_input_file.sh ``` -git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises +#!/bin/bash + +# Based on +# https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html +# + +# This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it. +# USAGE: sh rnaseq_analysis_on_input_file.sh + +set -e + +# initialize a variable with an intuitive name to store the name of the input fastq file +fq=$1 + +# grab base of filename for naming outputs +base=`basename $fq .subset.fq` +echo "Sample name is $base" + +# specify the number of cores to use +cores=4 + +# directory with genome reference FASTA and index files + name of the gene annotation file +genome=rnaseq/reference_data +gtf=rnaseq/reference_data/chr1-hg19_genes.gtf + +# make all of the output directories +# The -p option means mkdir will create the whole path if it +# does not exist and refrain from complaining if it does exist +mkdir -p rnaseq/results/fastqc +mkdir -p rnaseq/results/STAR +mkdir -p rnaseq/results/counts + +# set up output filenames and locations +fastqc_out=rnaseq/results/fastqc +align_out=rnaseq/results/STAR/${base}_ +counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam +counts=rnaseq/results/counts/${base}_featurecounts.txt + +echo "Processing file $fq" + +# Run FastQC and move output to the appropriate folder +fastqc $fq + +# Run STAR +STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outFileNamePrefix $align_out --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard + +# Create BAM index +samtools index $counts_input_bam + +# Count mapped reads +featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam ``` -## Generic +## Setting up + +We will create a new git repository and import a library of existing +tool definitions that will help us build our workflow. + +Create a new git repository to hold our workflow with this command: ``` git init rnaseq-cwl-training-exercises ``` +On Arvados use this: -3. Go to File->Open Folder and select rnaseq-cwl-training-exercises - -4. Go to the terminal window +``` +git clone https://github.com/arvados/arvados-vscode-cwl-template.git rnaseq-cwl-training-exercises +``` -5. Import bio-cwl-tools with this command: +Next, import bio-cwl-tools with this command: ``` git submodule add https://github.com/common-workflow-library/bio-cwl-tools.git ``` -# Writing the workflow +## Writing the workflow -1. Create a new file "main.cwl" +### 1. File header -2. Start with this header. +Create a new file "main.cwl" + +Start with this header. ``` @@ -47,14 +135,25 @@ class: Workflow label: RNAseq CWL practice workflow ``` -3. Workflow Inputs +### 2. Workflow Inputs The purpose of a workflow is to consume some input parameters, run a series of steps, and produce output values. For this analysis, the input parameters are the fastq file and the reference data required by STAR. -In CWL, these are declared in the `inputs` section. +In the original shell script, the following variables are declared: + +``` +# initialize a variable with an intuitive name to store the name of the input fastq file +fq=$1 + +# directory with genome reference FASTA and index files + name of the gene annotation file +genome=rnaseq/reference_data +gtf=rnaseq/reference_data/chr1-hg19_genes.gtf +``` + +In CWL, we will declare these variables in the `inputs` section. The inputs section lists each input parameter and its type. Valid types include `File`, `Directory`, `string`, `boolean`, `int`, and @@ -69,7 +168,7 @@ inputs: gtf: File ``` -4. Workflow Steps +### 3. Workflow Steps A workflow consists of one or more steps. This is the `steps` section. @@ -100,11 +199,11 @@ steps: fastqc: run: bio-cwl-tools/fastqc/fastqc_2.cwl in: - reads_file: fq - out: [html_file, summary_file] + reads_file: fq + out: [html_file] ``` -5. Running alignment with STAR +### 4. Running alignment with STAR STAR has more parameters. Sometimes we want to provide input values to a step without making them as workflow-level inputs. We can do @@ -126,7 +225,7 @@ this with `{default: N}` out: [alignment] ``` -6. Running samtools +### 5. Running samtools The third step is to generate an index for the aligned BAM. @@ -145,7 +244,7 @@ step will not run until the `STAR` step has completed successfully. out: [bam_sorted_indexed] ``` -7. featureCounts +### 6. featureCounts As of this writing, the `subread` package that provides `featureCounts` is not available in bio-cwl-tools (and if it has been @@ -153,7 +252,7 @@ added since writing this, let's pretend that it isn't there.) We will dive into how to write a CWL wrapper for a command line tool in lesson 2. For now, we will leave off the final step. -8. Workflow Outputs +### 7. Workflow Outputs The last thing to do is declare the workflow outputs in the `outputs` section. @@ -175,9 +274,6 @@ outputs: qc_html: type: File outputSource: fastqc/html_file - qc_summary: - type: File - outputSource: fastqc/summary_file bam_sorted_indexed: type: File outputSource: samtools/bam_sorted_indexed