20461 Updating README.md

[lightning.git] / cwl / lightning / README.md
diff --git a/cwl/lightning/README.md b/cwl/lightning/README.md

index 4ca131faaf5a4287e352d9137dca3d531282bf73..427dede48d416340c14b3fba8b6d8942a3b669fd 100644 (file)
--- a/cwl/lightning/README.md
+++ b/cwl/lightning/README.md
@@ -2,10 +2,10 @@
  [comment]: # ()
  [comment]: # (SPDX-License-Identifier: AGPL-3.0)
  # Running tiling workflow
-===
+Tiling is an efficient representation for genomic data that enables fast queries and machine learning. It abstracts a called genome by partitioning it into overlapping variable length shorter sequences, known as tiles. This tiling workflow tiles an input file, and peforms some statistical analysis on it. 
  
  ## Running the actual workflow
----
+To run on Arvados:
  `arvados-cwl-runner --submit --no-wait --project-uuid <project_uuid> fasta2numpy-wf.cwl <input_yml>`
  
  The main workflow, `fasta2numpy-wf.cwl`, has the following workflow:
@@ -20,12 +20,14 @@ The main workflow, `fasta2numpy-wf.cwl`, has the following workflow:
  For examples of input yml files, see `yml/fasta2numpy-wf-100test.yml` and `yml/fasta2numpy-wf-0831_0315.yml`
  
  ## Input parameters
----
+
+The tiling workflow has many different inputs, some (like **fastadirs**) vary depending on your run, while others remain more constant (like **dbsnp**)
+
  - **fastadirs** - an array of fasta directories, in our implementation, each directory consists of around 100 fasta pairs.
-- **refdir** - cirectory containing reference FASTAs.
+- **refdir** - directory containing reference FASTAs.
  
  The list of tags is needed to perform tiling
-- **tagset** - List of tags. Found here.
+- **tagset** - List of tags. Found here: c37923fd267415556962d5c535e9b075+110/tagset.fa.gz
  
  Some parameters are used to determine how many processes, and how much each process is processing at a time:
  
@@ -46,8 +48,8 @@ Some parameters are used to determine which portions of the genome the tiling wo
  Some int/float parameters are needed for setting up random generation, output of statistical tests, etc:
  
  - **randomseed** - Random seed for random number generation.
-- **pcacomponents** - Top N PCA components to extract from PCA
-- **trainingsetsize**: a float between 0 and 1 to determine the training set size..
+- **pcacomponents** - Top N PCA components to extract from PCA. 
+- **trainingsetsize**: a float between 0 and 1 to determine the training set size.
    
  Phenotypes are used as sample metadata for lightning:
  
@@ -56,6 +58,20 @@ Phenotypes are used as sample metadata for lightning:
  
  Some publicily accessible data is needed to run the workflows:
  
-- **snpeffdatadir** - 
-- **dbsnp** - 
-- **gnomaddir** - gnomAD data. 
-\ No newline at end of file
+- **snpeffdatadir** - Directory of SNP data download. Current data download can be found here: 66c966928931de252274772c76f73025+52054
+- **dbsnp** - SNP database. A single file. Current database can be found here: a088b297d614e4c63cbb23f8ad404438+12313/00-All.vcf.gz_renamed.bcf
+- **gnomaddir** - gnomAD data. Current data can be found here: c6a8fc877e85d73ac5b165e2d7367e26+675135
+
+## Outputs
+
+All outputs will be documented in the README.md that is generated. 
+
+## Building Docker images
+Several docker images are needed to run the tiling workflow. 
+
+All can be found under .../..docker
+
+The images neede to be build are
+
+1)  `lightning` - this can be built
+2)  `vcfutil` - this can be built
+\ No newline at end of file