cwl/lightning/README.md

   1 [comment]: # (Copyright (C) The Lightning Authors. All rights reserved.)
   2 [comment]: # ()
   3 [comment]: # (SPDX-License-Identifier: AGPL-3.0)
   4 # Running tiling workflow
   5 ===
   6
   7 ## Running the actual workflow
   8 ---
   9 `arvados-cwl-runner --submit --no-wait --project-uuid <project_uuid> fasta2numpy-wf.cwl <input_yml>`
  10
  11 The main workflow, `fasta2numpy-wf.cwl`, has the following workflow:
  12
  13 1) Tile the input FASTA file
  14 2) Generate PCA values
  15 3) Perform logistic regression
  16 4) Perform chi^2 p-value tests
  17 5) Plot these values
  18 6) Output
  19
  20 For examples of input yml files, see `yml/fasta2numpy-wf-100test.yml` and `yml/fasta2numpy-wf-0831_0315.yml`
  21
  22 ## Input parameters
  23 ---
  24 - **fastadirs** - an array of fasta directories, in our implementation, each directory consists of around 100 fasta pairs.
  25 - **refdir** - cirectory containing reference FASTAs.
  26
  27 The list of tags is needed to perform tiling
  28 - **tagset** - List of tags. Found here.
  29
  30 Some parameters are used to determine how many processes, and how much each process is processing at a time:
  31
  32 - **batchsize** - an integer determining the batch size when running lighting-import step, e.g., for batchsize 12, we run lightning-import for 12 fasta directories together as a batch, the resulting libraries then get merged by lightning-slice.
  33 - **threads** - number of parallel processes to run. This is necessary to avoid running out of memory.
  34
  35 Some parameters are used as values passed to lightning on the command line as flags:
  36
  37 - **mergeoutput** - option to slice numpy. `True` or `False` are optional values.
  38 - **expandregions** - Command Line value needed to run `lightning`. Default value is `0`.
  39
  40 Some parameters are used to determine which portions of the genome the tiling workflow is run on:
  41
  42 - **chrs**: chromosones to run on.
  43 - **regions** - specific regions of the chromosomes to run on.
  44 - **matchgenome**: a string pattern used for obtaining a subset of the cohort, e.g, matchgenome "ADNI|WCAP" runs tiling for all samples with "ADNI" or "WCAP" in their name, matchgenome "" runs for the entire cohort.
  45
  46 Some int/float parameters are needed for setting up random generation, output of statistical tests, etc:
  47
  48 - **randomseed** - Random seed for random number generation.
  49 - **pcacomponents** - Top N PCA components to extract from PCA
  50 - **trainingsetsize**: a float between 0 and 1 to determine the training set size..
  51
  52 Phenotypes are used as sample metadata for lightning:
  53
  54 - **phenotypesnofamilydir** - phenotype information for samples with *no* family members.
  55 - **phenotypesdir** - phenotype information for samples *with* family members.
  56
  57 Some publicily accessible data is needed to run the workflows:
  58
  59 - **snpeffdatadir** -
  60 - **dbsnp** -
  61 - **gnomaddir** - gnomAD data.