cwl/lightning/src/genreadme.py

   1 #!/usr/bin/env python
   2 # Copyright (C) The Lightning Authors. All rights reserved.
   3 #
   4 # SPDX-License-Identifier: AGPL-3.0
   5
   6 from __future__ import print_function
   7 import os
   8 import sys
   9
  10 text = """h1. Data release readme
  11
  12 Data release candidate: {}
  13 Description: This dataset contains {} human genomes ({}) encoded using the Lightning tiling system for the AI4AD project. It is published at {}.
  14
  15 Collection contains:
  16 * library_full/ -- Full Tiled Data Set
  17 ** matrix.0000.npy, matrix.0001.npy, matrix.0002.npy, ... -- tile variant# for each (sample, tag)
  18 ** chunk-tag-offset.csv -- tag offset for each matrix.NNNN.npy file
  19 ** samples.csv --  sample ID for each row of matrix.NNNN.npy
  20 * library_filtered/ -- Filtered Tiled Data Set (filtered using chi-square test between tile variants and AD phenotype)
  21 ** onehot.npy -- one-hot representation of tiled data filtered by p-value
  22 ** onehot-columns.npy -- tag, variant, het/hom, p-value for each column of onehot.npy
  23 ** samples.csv -- sample ID for each row of onehot.npy
  24 * GRCh38.86_library_annotation/ -- Annotations for Tiled Data Set
  25 ** GRCh38.86_library_snpeff_dbsnp_gnomad.vcf.gz -- annotations for each genomic variant found in tiled dataset
  26 ** GRCh38.86_library_snpeff_dbsnp_gnomad.vcf.gz.tbi -- index for annotations vcf
  27 ** GRCh38.86_library_summary.txt -- % of variants in each chromosome that were found in gnomad
  28 ** hg38.fa.gz.bed -- position of tile set in reference genome
  29
  30 Tiling Background:
  31
  32 Tiling abstracts a called genome by partitioning it into overlapping variable length shorter sequences, known as tiles. A tile is a genomic sequence that is braced on either side by 24 base (24-mer) "tags".
  33
  34 A tile sequence must be at least 248 base pairs long where each tile is labeled with a "position" according to the number of tiles before it. One tile position can have multiple tile variants, one for each sequence observed at that position. When a variation occurs on a tag, we allow tile variants to span multiple steps where the tags would normally end. These tiles that span multiple steps are known as "spanning tiles"
  35
  36 Our choice of tags ("tag-set") partition the human reference genome into 10,655,006 tiles, composed of 3.1 billion bases (with an average of around 315 bases per tile). The set of all positions and tile variants are stored in is what we call the tile library. An individual's genome can then be easily represented as an array of tag sets that reference tiles in the tile library. Each position in the array corresponds to a tile position and points to the tile variant observed at that position for that individual.
  37
  38 To create the tiled genomes, we use Lightning, a system that allows for efficient access to large scale population genomic data with a focus on clinical and research use. The Lightning system is a combination of a conceptual way to think about genomes (genomic tiling), the internal representation of genomes for efficient access, and the software that manages access to the data.
  39
  40 h2. Read me for library_full
  41
  42 Directory:  library_full/
  43
  44 Files:
  45
  46 * matrix.XXX.npy:  numpy-encoded matrix with one row per genome, and a pair of columns per tag / tile position (one for each allele). Each matrix element is an integer. For easier loading, the numpy matrix is broken into chunks. :
  47 ** -1 indicates a "low quality" tile variant containing no-calls.
  48 ** 0 indicates the tag for this tile was not found, i.e., this part of the genome is covered by a spanning tile in an earlier (leftward) column.
  49 ** Tile variants can span multiple tile positions  if a tag is not found and are known as spanning tiles.
  50 ** 1 indicates the most common high quality variant of this tile in this dataset; 2 indicates the 2nd most common; etc.
  51
  52 * chunk-tag-offset.csv - common separated text file that indicates tag offset for each matrix.NNNN.npy file
  53 ** Columns are file name and offset
  54
  55 * samples.csv: mapping from numpy file (matrix.npy) and row number to input ID for each tiled genome
  56 ** Columns are row number, genome ID (usually taken from tile name of gvcf/vcf, and name of npy output
  57         - Example: 0,"A-WCAP-WC000711-BL-COL-39141BL1","matrix.npy"
  58
  59
  60 h2. Read me for library_filtered
  61
  62 Directory: library_filtered/
  63 Files:
  64 * onehot.npy --
  65 **  The tile variants have been filtered using a chi2 filter between each separate tile variant and the AD phenotype. Only tile positions with 90% coverage are included (i.e. 90% of the tile variants in a tile position do not contain no-calls).
  66 ** Contains the positions of the non-zeros elements of the filtered sparse matrix.: two rows: 1) row position 2) column position
  67 ** This sparse numpy-encoded matrix has one row per genome, and a pair of columns per tile variant. One column represents the heterozygous tile variant (i.e. tile variant found in 1 allele) and one for homozygous tile variant (i.e. tile variant found in 2 alleles). Each matrix element is an integer with a 1 indicating the tile variant is present in that form and a 0 indicating the tile variant is not present in that format.
  68 ** Can create a sparse matrix with the following commands in python:
  69
  70 import numpy as np
  71 from scipy.sparse import csr_matrix
  72
  73 Xrc = np.load('onehot.npy')
  74 data = np.ones(Xrc[0,:].shape)
  75 row_ind = Xrc[0,:]
  76 col_ind = Xrc[1,:]
  77 filtered = csr_matrix((data, (row_ind, col_ind)))
  78
  79 * onehot-columns.npy -
  80 numpy file containing information corresponding to each column of the one-hot matrix representation of the filtered data.
  81 Columns are as follows: tag, tile variant, zygosity with heterozygous = 0 and homozygous = 1, p-value * 1e6 for each column of onehot.npy
  82 * samples.csv -mapping from numpy file (matrix.npy) and row number to input ID for each tiled genome
  83 ** Columns are row number, genome ID (usually taken from tile name of gvcf/vcf, and name of npy output
  84     - Example: 0,"A-WCAP-WC000711-BL-COL-39141BL1","matrix.npy"
  85
  86
  87 h2. Read me for annotations
  88
  89 Directory: GRCh38.86_library_annotation/
  90
  91 Files:
  92 * GRCh38.86_library_snpeff_dbsnp_gnomad.vcf.gz
  93 ** gzipped vcf of each genomic variant found in tile variants containing frequencies and other annotation details (gene, predicted effects, etc) from dbsnp and gnmad.
  94 ** ID contains both HGVS and rsID (if found) and INFO contains tile variant (TV: tileposition-tilevariant) as well as the other annotations. All tiles variants contains that genomic variant are listed in the TV field.
  95 - Example:
  96 - #CHROM        POS     ID      REF     ALT     QUAL    FILTER  INFO
  97 - chr9  45079   chr9:g.45080del;rs55984476      TC      T       .       .       TV=,5649728-1,;ANN=T|intergenic_region|MODIFIER|FAM138C-PGM5P3-AS1|ENSG00000218839-ENSG00000277631|intergenic_region|ENSG00000218839-ENSG00000277631|||n.45080delC||||||;AC=129535;AN=129536;AF=0.999992;AF_afr=0.999966;AF_amr=1;AF_asj=1;AF_eas=1;AF_fin=1;AF_nfe=1;AF_oth=1
  98 ** In this annotation file, for simplicity the name of the chromosome is used instead of the proper HGVS annotation for the reference and chromosome. If you want to search the HGVS annotation you will need to replace it.
  99         - Example: chr3:g.36130213T>A -> NC_000003.12:g.36130213T>A
 100         - Example: chr10:g.13511587G>A -> NC_000010.12:g.13511587G>A
 101
 102 * GRCh38.86_library_snpeff_dbsnp_gnomad.vcf.gz.tbi
 103 ** index file for vcf of annotations
 104
 105 * GRCh38.86_library_summary.txt
 106 ** text file containing % of variants in each chromosome that were found in gnomad
 107 * GRCh38.86_reference_tiles.bed
 108 ** bed file containing tile locations on GRCh38 for reference.
 109 ** The columns are as follows:
 110 ** 1) Chromosome
 111 ** 2) Tile start (including tag)
 112 ** 3) Tile end (including tag)
 113 ** 4) Tag #
 114 ** 5) Coverage (this gives a score 0-1000 of how many times this tile is placed in a set of genomes, 1000 means the tag is found in every genome of the set. 0 indicates the tag is not found in any of the genomes.  Tag may not be placed due to variants or no-calls existing on the tag.
 115 ** 6) Strand (always ., included so that our bed file maintains the bed standard format)
 116 ** 7) Tile start (not including tag)
 117 ** 8) Tile end (not including tag
 118 - Example:
 119 M 0 467 10654109 870 . 0 443
 120 M 443 959 10654110 895 . 467 935
 121 M 935 1394 10654111 985 . 959 1370
 122 """
 123
 124 def count_samples(samplescsv):
 125   count = 0
 126   with open(samplescsv) as f:
 127     for line in f:
 128       if line != "\n":
 129         count += 1
 130   return count
 131
 132 def main():
 133   samplescsv = sys.argv[1]
 134   date = sys.argv[2]
 135   description = sys.argv[3]
 136   projecturl = sys.argv[4]
 137
 138   cohortsize = count_samples(samplescsv)
 139   print(text.format(date, cohortsize, description, projecturl))
 140
 141 if __name__ == '__main__':
 142   main()