2024-summary

In 2024, we contributed to the fields of neurogenomics, single-cell transcriptomics, regulatory networks, structural biology, and bioinformatics, developing useful software tools and publishing in well-known journals such as PNAS, Science, NAR, Bioinformatics, and others. Many projects were part of large consortia, including PsychENCODE, GENCODE, IGVF (Impact of Genomic Variation on Function), HGSVC (Human-Genome Structural-Variation Consortium), modERN (Model-Organism Encyclopedia of Regulatory Networks), and SCORCH (Single-Cell Opioid-Responses in Context of HIV).

The neurogenomics work is the most significant highlight (Emani et al., Science, 2024). In the PsychENCODE consortium, we analyzed single-nuclei multi-omics data from 388 brains, uncovering 1.4 million expression quantitative-trait loci and cell-type-specific regulatory networks. This work revealed cellular changes in aging and neuropsychiatric disorders, enabling the construction of deep-learning models prioritizing disease-risk genes and drug targets.

The application of large language models (LLMs) to biomedical challenges was a significant focus of our research. In particular, we surveyed LLMs and generative AI in drug design (Tang et al., Brief Bioinform, 2024). We introduced MolLM, a unified language model that integrates 2D and 3D molecular structures with biomedical text for tasks such as molecule-text matching, property prediction, and text-prompted molecular editing. This model emphasizes the importance of explicit 3D molecular representations in enhancing cross-modal capabilities (Tang et al., Bioinformatics, 2024).

BioCoder, a benchmark for bioinformatics code generation, demonstrates how LLMs can assist in automating repetitive coding tasks, though challenges remain in handling complex bioinformatics pipelines (Tang et al., Bioinformatics, 2024). We fine-tuned an LLM to predict protein phase transitions (Frank et al., Proc Natl Acad Sci U S A, 2024). This work demonstrated that more extensive aggregation is associated with reduced gene expression in Alzheimer's, suggesting a natural defense mechanism.

For the phase-transition problem, we also showed how a graph neural network could help in more precisely defining disordered regions of proteins, a key biophysical feature in predicting transitions (Wang et al., Cell Rep Phys Sci, 2024). Additional LLM work includes our contributions to developing FAVOR-GPT, a generative interface for interpreting genomic variant annotations, and the Dense Homolog Retriever, a LLM enhancing homolog detection (Li et al., Bioinform Adv, 2024; Hong et al., Nat Biotechnol, 2024).

Much of the rest of our research was focused on tool development in biomedical data science.

In particular, we developed REPIC, a consensus-based method for Cryo-EM particle picking, which helps integrate outputs from multiple algorithms into high-quality consensus reconstructions, reducing user burden and enhancing accuracy (Cameron et al., Commun Biol, 2024).

We also developed an ensemble framework for combining empirical docking with deep learning models, significantly improving affinity predictions. This framework optimized meta-modeling approaches to outperform individual base models (Lee et al., J Chem Inf Model, 2024).

We explored deep learning methods for the early detection of Alzheimer's disease, highlighting the complexities inherent in integrating multimodal medical datasets and demonstrating the potential of deep learning in medical imaging to predict Alzheimer's disease (Zhou et al., PLoS One, 2024).

We introduced LatentDAG, a Bayesian network that simplifies gene expression relationships. In combination with a graph neural network, LatentDAG improves tasks such as gene conservation prediction and gene clustering (Gao et al., Bioinformatics, 2024).

We introduced the concept of TF signal 'crowdedness' to address interference from non-target motifs in ChIP-seq data, which allowed us to improve motif inference accuracy (Xu et al., Nucleic Acids Res, 2024).

Finally, we analyzed music and cultural evolution, revealing patterns that resemble biological evolution (Warrell et al., J R Soc Interface, 2024).

Fast, sensitive detection of protein homologs using deep dense retrieval.

L Hong, Z Hu, S Sun, X Tang, J Wang, Q Tan, L Zheng, S Wang, S Xu, I King, M Gerstein, Y Li (2025). Nat Biotechnol 43: 983-995.

website

medline

Deciphering the impact of genomic variation on function

IGVF Consortium (2024). Nature.

link

Deep learning analysis of fMRI data for predicting Alzheimer's Disease: A focus on convolutional neural networks and model interpretability

X Zhou, S Kedia, R Meng, M Gerstein (2024). PLoS One 19: e0312848.

medline

A Variational Graph Partitioning Approach to Modeling Protein Liquid-liquid Phase Separation

G Wang, J Warrell, S Zheng, M Gerstein (2024). Cell Rep Phys Sci 5.

preprint

medline

link

Improved Prediction of Ligand-Protein Binding Affinities by Meta-modeling

HJ Lee, PS Emani, MB Gerstein (2024). J Chem Inf Model 64: 8684-8704.

medline

REliable PIcking by Consensus (REPIC): a consensus methodology for harnessing multiple cryo-EM particle pickers

CJF Cameron, SJH Seager, FJ Sigworth, HD Tagare, MB Gerstein (2024). Commun Biol 7: 1421.

website

preprint

medline

FAVOR-GPT: a generative natural language interface to whole genome variant functional annotations

TC Li, H Zhou, V Verma, X Tang, Y Shao, E Van Buren, Z Weng, M Gerstein, B Neale, SR Sunyaev, X Lin (2024). Bioinform Adv 4: vbae143.

medline

Representing core gene expression activity relationships using the latent structure implicit in Bayesian networks

J Gao, M Gerstein (2024). Bioinformatics 40.

website

medline

Leveraging a large language model to predict protein phase transition: a physical, multiscale and interpretable approach

M Frank, P Ni, M Jensen, MB Gerstein (2024). Proc Natl Acad Sci U S A 121: e2320510121.

website

preprint

medline

A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation

X Tang, H Dai, E Knight, F Wu, Y Li, T Li, M Gerstein (2024). Brief Bioinform 25.

medline

MolLM: A Unified Language Model to Integrate Biomedical Text with 2D and 3D Molecular Representations

X Tang, A Tran, J Tan, MB Gerstein (2024). Bioinformatics 40: i357-i368.

preprint

medline

BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge

X Tang, B Qian, R Gao, J Chen, X Chen, MB Gerstein (2024). Bioinformatics 40: i266-i276.

preprint

medline

Single-cell genomics and regulatory networks for 388 human brains

PS Emani, JJ Liu, D Clarke, M Jensen, J Warrell, C Gupta, R Meng, CY Lee, S Xu, C Dursun, S Lou, Y Chen, Z Chu, T Galeev, A Hwang, Y Li, P Ni, X Zhou, PsychENCODE Consortium, TE Bakken, J Bendl, L Bicks, T Chatterjee, L Cheng, Y Cheng, Y Dai, Z Duan, M Flaherty, JF Fullard, M Gancz, D Garrido-Martin, S Gaynor-Gillett, J Grundman, N Hawken, E Henry, GE Hoffman, A Huang, Y Jiang, T Jin, NL Jorstad, R Kawaguchi, S Khullar, J Liu, J Liu, S Liu, S Ma, M Margolis, S Mazariegos, J Moore, JR Moran, E Nguyen, N Phalke, M Pjanic, H Pratt, D Quintero, AS Rajagopalan, TR Riesenmy, N Shedd, M Shi, M Spector, R Terwilliger, KJ Travaglini, B Wamsley, G Wang, Y Xia, S Xiao, AC Yang, S Zheng, MJ Gandal, D Lee, ES Lein, P Roussos, N Sestan, Z Weng, KP White, H Won, MJ Girgenti, J Zhang, D Wang, D Geschwind, M Gerstein (2024). Science 384: eadi5199.

Latent Evolutionary Signatures: A General Framework for Analyzing Music and Cultural Evolution

J Warrell, L Salichos, M Gancz, MB Gerstein (2024). J R Soc Interface 21: 20230647.

preprint

medline

Less-is-more: selecting transcription factor binding regions informative for motif inference

J Xu, J Gao, P Ni, M Gerstein (2024). Nucleic Acids Res 52: e20.

website

preprint

medline

Return to front page