[comment]: # ()
[comment]: # (SPDX-License-Identifier: AGPL-3.0)
# Running tiling workflow
-===
+Tiling is an efficient representation for genomic data that enables fast queries and machine learning. It abstracts a called genome by partitioning it into overlapping variable length shorter sequences, known as tiles. This tiling workflow tiles an input file, and peforms some statistical analysis on it.
## Running the actual workflow
----
+To run on Arvados:
`arvados-cwl-runner --submit --no-wait --project-uuid <project_uuid> fasta2numpy-wf.cwl <input_yml>`
The main workflow, `fasta2numpy-wf.cwl`, has the following workflow:
For examples of input yml files, see `yml/fasta2numpy-wf-100test.yml` and `yml/fasta2numpy-wf-0831_0315.yml`
## Input parameters
----
+
+The tiling workflow has many different inputs, some (like **fastadirs**) vary depending on your run, while others remain more constant (like **dbsnp**)
+
- **fastadirs** - an array of fasta directories, in our implementation, each directory consists of around 100 fasta pairs.
- **refdir** - cirectory containing reference FASTAs.
The list of tags is needed to perform tiling
-- **tagset** - List of tags. Found here.
+- **tagset** - List of tags. Found here: c37923fd267415556962d5c535e9b075+110/tagset.fa.gz
Some parameters are used to determine how many processes, and how much each process is processing at a time:
Some int/float parameters are needed for setting up random generation, output of statistical tests, etc:
- **randomseed** - Random seed for random number generation.
-- **pcacomponents** - Top N PCA components to extract from PCA
-- **trainingsetsize**: a float between 0 and 1 to determine the training set size..
+- **pcacomponents** - Top N PCA components to extract from PCA.
+- **trainingsetsize**: a float between 0 and 1 to determine the training set size.
Phenotypes are used as sample metadata for lightning:
Some publicily accessible data is needed to run the workflows:
-- **snpeffdatadir** -
-- **dbsnp** -
-- **gnomaddir** - gnomAD data.
\ No newline at end of file
+- **snpeffdatadir** - Directory of SNP data download. Current data download can be found here: 66c966928931de252274772c76f73025+52054
+- **dbsnp** - SNP database. A single file. Current database can be found here: a088b297d614e4c63cbb23f8ad404438+12313/00-All.vcf.gz_renamed.bcf
+- **gnomaddir** - gnomAD data. Current data can be found here: c6a8fc877e85d73ac5b165e2d7367e26+675135
\ No newline at end of file