From: Alex Coleman Date: Wed, 30 Aug 2023 16:53:31 +0000 (-0600) Subject: 20461: Updating resource requirements X-Git-Url: https://git.arvados.org/lightning.git/commitdiff_plain/690c5971f3da799a4f2d5b6c75fb3b72c1c233d3 20461: Updating resource requirements Updating resource requirements to run on jutro, and updating dockerfile and README. Arvados-DCO-1.1-Signed-off-by: Alex Coleman --- diff --git a/cwl/lightning/README.md b/cwl/lightning/README.md index 4ca131faaf..ad484202b8 100644 --- a/cwl/lightning/README.md +++ b/cwl/lightning/README.md @@ -2,10 +2,10 @@ [comment]: # () [comment]: # (SPDX-License-Identifier: AGPL-3.0) # Running tiling workflow -=== +Tiling is an efficient representation for genomic data that enables fast queries and machine learning. It abstracts a called genome by partitioning it into overlapping variable length shorter sequences, known as tiles. This tiling workflow tiles an input file, and peforms some statistical analysis on it. ## Running the actual workflow ---- +To run on Arvados: `arvados-cwl-runner --submit --no-wait --project-uuid fasta2numpy-wf.cwl ` The main workflow, `fasta2numpy-wf.cwl`, has the following workflow: @@ -20,12 +20,14 @@ The main workflow, `fasta2numpy-wf.cwl`, has the following workflow: For examples of input yml files, see `yml/fasta2numpy-wf-100test.yml` and `yml/fasta2numpy-wf-0831_0315.yml` ## Input parameters ---- + +The tiling workflow has many different inputs, some (like **fastadirs**) vary depending on your run, while others remain more constant (like **dbsnp**) + - **fastadirs** - an array of fasta directories, in our implementation, each directory consists of around 100 fasta pairs. - **refdir** - cirectory containing reference FASTAs. The list of tags is needed to perform tiling -- **tagset** - List of tags. Found here. +- **tagset** - List of tags. Found here: c37923fd267415556962d5c535e9b075+110/tagset.fa.gz Some parameters are used to determine how many processes, and how much each process is processing at a time: @@ -46,8 +48,8 @@ Some parameters are used to determine which portions of the genome the tiling wo Some int/float parameters are needed for setting up random generation, output of statistical tests, etc: - **randomseed** - Random seed for random number generation. -- **pcacomponents** - Top N PCA components to extract from PCA -- **trainingsetsize**: a float between 0 and 1 to determine the training set size.. +- **pcacomponents** - Top N PCA components to extract from PCA. +- **trainingsetsize**: a float between 0 and 1 to determine the training set size. Phenotypes are used as sample metadata for lightning: @@ -56,6 +58,6 @@ Phenotypes are used as sample metadata for lightning: Some publicily accessible data is needed to run the workflows: -- **snpeffdatadir** - -- **dbsnp** - -- **gnomaddir** - gnomAD data. \ No newline at end of file +- **snpeffdatadir** - Directory of SNP data download. Current data download can be found here: 66c966928931de252274772c76f73025+52054 +- **dbsnp** - SNP database. A single file. Current database can be found here: a088b297d614e4c63cbb23f8ad404438+12313/00-All.vcf.gz_renamed.bcf +- **gnomaddir** - gnomAD data. Current data can be found here: c6a8fc877e85d73ac5b165e2d7367e26+675135 \ No newline at end of file diff --git a/cwl/lightning/lightning-anno2vcf.cwl b/cwl/lightning/lightning-anno2vcf.cwl index d91aa795b9..ae8568fdb2 100644 --- a/cwl/lightning/lightning-anno2vcf.cwl +++ b/cwl/lightning/lightning-anno2vcf.cwl @@ -14,7 +14,7 @@ hints: dockerPull: lightning ResourceRequirement: coresMin: 64 - ramMin: 100000 #500000 + ramMin: 200000 #500000 arv:RuntimeConstraints: keep_cache: 83000 outputDirType: keep_output_dir diff --git a/cwl/lightning/lightning-choose-samples.cwl b/cwl/lightning/lightning-choose-samples.cwl index f03c585aba..92bef11336 100644 --- a/cwl/lightning/lightning-choose-samples.cwl +++ b/cwl/lightning/lightning-choose-samples.cwl @@ -44,7 +44,7 @@ arguments: - prefix: "-case-control-file=" valueFrom: $(inputs.phenotypesdir) separate: false - - "-case-control-column=AD" + - "-case-control-column=DISEASE" - prefix: "-training-set-size=" valueFrom: $(inputs.trainingsetsize) separate: false diff --git a/cwl/lightning/lightning-slice-numpy-onehot.cwl b/cwl/lightning/lightning-slice-numpy-onehot.cwl index 7bd02101f7..2d58232bdb 100644 --- a/cwl/lightning/lightning-slice-numpy-onehot.cwl +++ b/cwl/lightning/lightning-slice-numpy-onehot.cwl @@ -14,7 +14,7 @@ hints: dockerPull: lightning ResourceRequirement: coresMin: 64 - ramMin: 100000 #660000 + ramMin: 200000 #660000 arv:RuntimeConstraints: keep_cache: 83000 outputDirType: keep_output_dir diff --git a/cwl/lightning/lightning-slice-numpy-pca.cwl b/cwl/lightning/lightning-slice-numpy-pca.cwl index 689b3b7bc1..b4818eb6ac 100644 --- a/cwl/lightning/lightning-slice-numpy-pca.cwl +++ b/cwl/lightning/lightning-slice-numpy-pca.cwl @@ -14,7 +14,7 @@ hints: dockerPull: lightning ResourceRequirement: coresMin: 64 - ramMin: 100000 #1500000 + ramMin: 200000 #1500000 arv:RuntimeConstraints: keep_cache: 83000 outputDirType: keep_output_dir diff --git a/cwl/lightning/lightning-slice-numpy.cwl b/cwl/lightning/lightning-slice-numpy.cwl index 8e61d1af1f..6c48ff3804 100644 --- a/cwl/lightning/lightning-slice-numpy.cwl +++ b/cwl/lightning/lightning-slice-numpy.cwl @@ -14,7 +14,7 @@ hints: dockerPull: lightning ResourceRequirement: coresMin: 64 - ramMin: 100000 #660000 + ramMin: 200000 #660000 arv:RuntimeConstraints: keep_cache: 83000 outputDirType: keep_output_dir diff --git a/cwl/lightning/lightning-slice.cwl b/cwl/lightning/lightning-slice.cwl index 3fed33f970..f416e55d48 100644 --- a/cwl/lightning/lightning-slice.cwl +++ b/cwl/lightning/lightning-slice.cwl @@ -14,7 +14,7 @@ hints: dockerPull: lightning ResourceRequirement: coresMin: 64 #96 - ramMin: 100000 #660000 + ramMin: 200000 #660000 arv:RuntimeConstraints: keep_cache: 6200 outputDirType: keep_output_dir diff --git a/docker/lightning/Dockerfile b/docker/lightning/Dockerfile index 061d9d45a7..d11a30b43d 100644 --- a/docker/lightning/Dockerfile +++ b/docker/lightning/Dockerfile @@ -5,9 +5,7 @@ # build instruction: # docker build -t dockername --file=/path/to/lightning/docker/lightning/Dockerfile /path/to/lightning -FROM ubuntu:latest -MAINTAINER Jiayong Li s -USER root +FROM python:3.11-buster ARG DEBIAN_FRONTEND=noninteractive # Install necessary dependencies @@ -24,7 +22,6 @@ RUN apt-get install -qy --no-install-recommends wget \ libncursesw5-dev \ gcc \ make \ - python3.8 \ python3-pip \ python3-numpy \ python3-pandas \ @@ -34,6 +31,7 @@ RUN apt-get install -qy --no-install-recommends wget \ RUN pip3 install sklearn RUN pip3 install --upgrade scipy +RUN pip3 install matplotlib # Installing go 1.19