lesson2/lesson2.md

   1 # Running and debugging a workflow
   2
   3 ### 1. The input parameter file
   4
   5 CWL input values are provided in the form of a YAML or JSON file.
   6 Create one by right clicking on the explorer, select "New File" and
   7 create a called file "main-input.yaml".
   8
   9 This file gives the values for parameters declared in the `inputs`
  10 section of our workflow.  Our workflow takes `fq`, `genome` and `gtf`
  11 as input parameters.
  12
  13 When setting inputs, Files and Directories are given as an object with
  14 `class: File` or `class: Directory`.  This distinguishes them from
  15 plain strings that may or may not be file paths.
  16
  17 Note: if you don't have example sequence data or the STAR index files, see the Appendix below.
  18
  19 ```
  20 fq:
  21   class: File
  22   location: rnaseq/raw_fastq/Mov10_oe_1.subset.fq
  23   format: http://edamontology.org/format_1930
  24 genome:
  25   class: Directory
  26   location: hg19-chr1-STAR-index
  27 gtf:
  28   class: File
  29   location: rnaseq/reference_data/chr1-hg19_genes.gtf
  30 ```
  31
  32 On Arvados, do this:
  33
  34 ```
  35 fq:
  36   class: File
  37   location: keep:9178fe1b80a08a422dbe02adfd439764+925/raw_fastq/Mov10_oe_1.subset.fq
  38   format: http://edamontology.org/format_1930
  39 genome:
  40   class: Directory
  41   location: keep:02a12ce9e2707610991bd29d38796b57+2912
  42 gtf:
  43   class: File
  44   location: keep:9178fe1b80a08a422dbe02adfd439764+925/reference_data/chr1-hg19_genes.gtf
  45 ```
  46
  47 ### 2. Running the workflow
  48
  49 Type this into the terminal:
  50
  51 ```
  52 cwl-runner main.cwl main-input.yaml
  53 ```
  54
  55 On Arvados with vscode, select "main.cwl" and then choose "Terminal -> Run task -> Run CWL workflow on Arvados"
  56
  57 ### 3. Debugging the workflow
  58
  59 A workflow can fail for many reasons: some possible reasons include
  60 bad input, bugs in the code, or running out memory.  In this case, the
  61 STAR workflow might fail with an out of memory error.
  62
  63 To help diagnose these errors, the workflow runner produces logs that
  64 record what happened, either in the terminal or the web interface.
  65
  66 Some errors you might see in the logs that would indicate an out of
  67 memory condition:
  68
  69 ```
  70 EXITING: fatal error trying to allocate genome arrays, exception thrown: std::bad_alloc
  71 Possible cause 1: not enough RAM. Check if you have enough RAM 5711762337 bytes
  72 Possible cause 2: not enough virtual memory allowed with ulimit. SOLUTION: run ulimit -v 5711762337
  73 ```
  74
  75 or
  76
  77 ```
  78 Container exited with code: 137
  79 ```
  80
  81 (Exit code 137 most commonly occurs when a container goes "out of memory" and is terminated by the operating system).
  82
  83 If this happens, you will need to request more RAM.
  84
  85 ### 4. Setting runtime RAM requirements
  86
  87 By default, a step is allocated 256 MB of RAM.  From the STAR error message:
  88
  89 > Check if you have enough RAM 5711762337 bytes
  90
  91 We can see that STAR requires quite a bit more RAM than that.  To
  92 request more RAM, add a "requirements" section with
  93 "ResourceRequirement" to the "STAR" step:
  94
  95 ```
  96   STAR:
  97     requirements:
  98       ResourceRequirement:
  99         ramMin: 8000
 100     run: bio-cwl-tools/STAR/STAR-Align.cwl
 101 ```
 102
 103 Resource requirements you can set include:
 104
 105 * coresMin: CPU cores
 106 * ramMin: RAM (in megabytes)
 107 * tmpdirMin: temporary directory available space
 108 * outdirMin: output directory available space
 109
 110 After setting the RAM requirements, re-run the workflow.
 111
 112 ### 5. Workflow results
 113
 114 The CWL runner will print a results JSON object to standard output.  It will look something like this (it may include additional fields).
 115
 116
 117 ```
 118 {
 119     "bam_sorted_indexed": {
 120         "location": "file:///home/username/rnaseq-cwl-training-exercises/Aligned.sortedByCoord.out.bam",
 121         "basename": "Aligned.sortedByCoord.out.bam",
 122         "class": "File",
 123         "size": 25370707,
 124         "secondaryFiles": [
 125             {
 126                 "basename": "Aligned.sortedByCoord.out.bam.bai",
 127                 "location": "file:///home/username/rnaseq-cwl-training-exercises/Aligned.sortedByCoord.out.bam.bai",
 128                 "class": "File",
 129                 "size": 176552,
 130             }
 131         ]
 132     },
 133     "qc_html": {
 134         "location": "file:///home/username/rnaseq-cwl-training-exercises/Mov10_oe_1.subset_fastqc.html",
 135         "basename": "Mov10_oe_1.subset_fastqc.html",
 136         "class": "File",
 137         "size": 383589
 138     }
 139 }
 140 ```
 141
 142 This has the same structure as `main-input.yaml`.  The each output
 143 parameter is listed, with the `location` field of each `File` object
 144 indicating where the output file can be found.
 145
 146 # Appendix
 147
 148 ## Downloading sample and reference data
 149
 150 Start from your rnaseq-cwl-exercises directory.
 151
 152 ```
 153 mkdir rnaseq
 154 cd rnaseq
 155 wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=9178fe1b80a08a422dbe02adfd439764+925/
 156 ```
 157
 158 ## Downloading or generating STAR index
 159
 160 Running STAR requires index files generated from the reference.
 161
 162 This is a rather large download (4 GB).  Depending on your bandwidth, it may be faster to generate it yourself.
 163
 164 ### Downloading
 165
 166 Go to the "Terminal" tab in the lower vscode panel.  If necessary, select `bash` from the dropdown list in the upper right corner.
 167
 168 ```
 169 mkdir hg19-chr1-STAR-index
 170 cd hg19-chr1-STAR-index
 171 wget --mirror --no-parent --no-host --cut-dirs=1 https://download.pirca.arvadosapi.com/c=02a12ce9e2707610991bd29d38796b57+2912/
 172 ```
 173
 174 ### Generating
 175
 176 Create `chr1-star-index.yaml`:
 177
 178 ```
 179 InputFiles:
 180   - class: File
 181     location: rnaseq/reference_data/chr1.fa
 182     format: http://edamontology.org/format_1930
 183 IndexName: 'hg19-chr1-STAR-index'
 184 Gtf:
 185   class: File
 186   location: rnaseq/reference_data/chr1-hg19_genes.gtf
 187 Overhang: 99
 188 ```
 189
 190 Next, go to the "Terminal" tab in the lower vscode panel.  If
 191 necessary, select `bash` from the dropdown list in the upper right
 192 corner.  Generate the index with your local cwl-runner.
 193
 194 ```
 195 cwl-runner bio-cwl-tools/STAR/STAR-Index.cwl chr1-star-index.yaml
 196 ```