NSF DBI1660648

Grant title: A Graph Based Approach for the Genome Wide Prediction of Conditionaly Essential Genes

How does one identify, and characterize at the genome scale, the set of genes that is essential for an organism to grow and thrive under particular conditions? Predicting such sets of genes is a fundamental goal in bioinformatics; this project aims to create methods and tools for making accurate lists of such functional genes. The approach combines phenotype prediction with knowledge about the functional biological networks in cells to infer new knowledge. The network analysis methods developed here can be easily transferred and applied to a large variety of datasets to answer a wide range of questions from inferring gene-phenotype associations to detecting communities on social networks, extensions highly relevant to the network science community. Moreover, the project's state-of-the-art analysis of temporal gene expression data using state-space models and dimensionality reduction techniques is universally applicable to any groups of genes - e.g. tissue specific vs universally expressed genes. In addition to advancing functional genomics knowledge in the study organism, yeast, the tools will have an impact on research in fields like personal genomics research, by providing a large-scale system-level identification and molecular characterization of phenotypes. Finally, this project provides new and innovative tools for education in bioinformatics.

In more technical terms, this project's major goal is to develop new mathematical models and methods that, given a set of genes or an entire genome, can infer their phenotypes and suggest whether or not these genes are necessary for the organism survival. Specifically, information will be integrated on two levels: phenotypic and molecular. At the phenotypic level the structure of biological networks will be used to assign phenotypic attributes to genes and identify sets of genes that share similar essential phenotypes. At the molecular level, the resulted phenotype predictions will be refined by identifying groups of essential genes governed by similar activity patterns. The integration of the information on these two levels will result in a comprehensive gene-phenotype characterization and a refined group of conditionally essential genes. The resulting predictions will be validated experimentally in two yeast systems. All the tools and datasets associated with this project will be made freely available through genopheno.gersteinlab.org.

This award has led to several key publications on prioritizing gene importance, understanding the impact of mutations, and analyzing gene networks. It also led to the development of practical machine-learning methods.

(1) We developed a general theoretical framework that uses extended Pearls do-calculus to incorporate cyclic causal interactions and multilevel causation. To analyze causal information dynamics in our framework, we also developed information-theoretic notions necessary to introduce a causal generalization of the Partial Information Decomposition framework. Our causal framework helps to clarify conceptual issues in the context of complex trait analysis and assign variation in an observed trait to genetic, epigenetic, and environmental factors, including mutations. This work has been published in Biology & Philosophy (2020).

(2) We developed a machine learning model to predict the biological effects of genetic variants using deep 3D convolutional neural networks. This work has been published in PLoS Computational biology (2020).

(3) In addition to predicting the impact of genetic variants in the human genome, we developed a pipeline that integrates dimensionality reduction and Latent Dirichlet Allocation (LDA) with single-cell RNA-seq data to predict potential interactions between microbes and human genes in patients. This work has been published in Genome biology (2020).

(4) We presented a machine learning framework to learn the latent signatures of small molecules and their deleterious effects. We demonstrated that our model is informative in relating the molecule signatures to distinct anatomical categories. This work has been published in Nature communications (2020).

(5) Finally, at the network level, we compared the "patterns of mutation" in biological and technological networks, and based on network propagation methods, we developed computational models that can predict genes associated with a disease. Integrating network information increases the statistical power of our models and thus can identify many previously overlooked causal genes. These works have been published in Cell systems (2020) and Genome biology (2021).

Final report

Latent Evolutionary Signatures: A General Framework for Analyzing Music and Cultural Evolution
J Warrell, L Salichos, M Gerstein (2020). bioRxiv.

Network propagation-based prioritization of long tail genes in 17 cancer types.
H Mohsen, V Gunasekharan, T Qing, M Seay, Y Surovtseva, S Negahban, Z Szallasi, L Pusztai, MB Gerstein (2021). Genome Biol 22: 287.

Predicting changes in protein thermodynamic stability upon point mutation with deep 3D convolutional neural networks.
B Li, YT Yang, JA Capra, MB Gerstein (2020). PLoS Comput Biol 16: e1008291.

Predicting the frequencies of drug side effects.
D Galeano, S Li, M Gerstein, A Paccanaro (2020). Nat Commun 11: 4575.

Cyclic and Multilevel Causation in Evolutionary Processes
J Warrell, M Gerstein (2020). Biology & Philosophy, 35(5), pp.1-36.

Approaches for integrating heterogeneous RNA-seq data reveal cross-talk between microbes and genes in asthmatic patients.
D Spakowicz, S Lou, B Barron, JL Gomez, T Li, Q Liu, N Grant, X Yan, R Hoyd, G Weinstock, GL Chupp, M Gerstein (2020). Genome Biol 21: 150.

Comparing Technological Development and Biological Evolution from a Network Perspective.
KK Yan, D Wang, K Xiong, M Gerstein (2020). Cell Syst 10: 219-222.

Return to front page