--- title: "Introduction" teaching: 10 exercises: 0 questions: - "What is CWL?" - "What is the goal of this training?" objectives: - "First learning objective. (FIXME)" keypoints: - "First key point. Brief Answer to questions. (FIXME)" --- # Introduction to Common Worklow Language The Common Workflow Language (CWL) is an open standard for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. CWL is designed to meet the needs of data-intensive science, such as Bioinformatics, Medical Imaging, Astronomy, High Energy Physics, and Machine Learning. # Introduction to this training The goal of this training is to walk the student through the development of a best-practices CWL workflow, starting from an existing shell script that performs a common bioinformatics analysis. Specific knowledge of the biology of RNA-seq is *not* a prerequisite for these lessons. CWL is not domain specific to bioinformatics. We hope that you will find this training useful even if you work in some other field of research. These lessons are based on [Introduction to RNA-seq using high-performance computing (HPC)](https://github.com/hbctraining/Intro-to-rnaseq-hpc-O2) lessons developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). The original training, which includes additional lectures about the biology of RNA-seq, can be found at that link. # Introduction to the example analysis RNA-seq is the process of sequencing RNA present in a biological sample. From the sequence reads, we want to measure the relative numbers of different RNA molecules appearing in the sample that were produced by particular genes. This analysis is called "differential gene expression". The entire process looks like this: ![](/assets/img/RNAseqWorkflow.png){: height="400px"} For this training, we are only concerned with the middle analytical steps (skipping adapter trimming). * Quality control (FASTQC) * Alignment (mapping) * Counting reads associated with genes In this training, we are not attempting to develop the analysis from scratch, instead we we will be starting from an analysis written as a shell script. We will be using the following shell script as a guide to build our workflow. rnaseq_analysis_on_input_file.sh ``` #!/bin/bash # Based on # https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/07_automating_workflow.html # # This script takes a fastq file of RNA-Seq data, runs FastQC and outputs a counts file for it. # USAGE: sh rnaseq_analysis_on_input_file.sh set -e # initialize a variable with an intuitive name to store the name of the input fastq file fq=$1 # grab base of filename for naming outputs base=`basename $fq .subset.fq` echo "Sample name is $base" # specify the number of cores to use cores=4 # directory with genome reference FASTA and index files + name of the gene annotation file genome=rnaseq/reference_data gtf=rnaseq/reference_data/chr1-hg19_genes.gtf # make all of the output directories # The -p option means mkdir will create the whole path if it # does not exist and refrain from complaining if it does exist mkdir -p rnaseq/results/fastqc mkdir -p rnaseq/results/STAR mkdir -p rnaseq/results/counts # set up output filenames and locations fastqc_out=rnaseq/results/fastqc align_out=rnaseq/results/STAR/${base}_ counts_input_bam=rnaseq/results/STAR/${base}_Aligned.sortedByCoord.out.bam counts=rnaseq/results/counts/${base}_featurecounts.txt echo "Processing file $fq" # Run FastQC and move output to the appropriate folder fastqc $fq # Run STAR STAR --runThreadN $cores --genomeDir $genome --readFilesIn $fq --outFileNamePrefix $align_out --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes Standard # Create BAM index samtools index $counts_input_bam # Count mapped reads featureCounts -T $cores -s 2 -a $gtf -o $counts $counts_input_bam ``` {% include links.md %}