hivbigdat yr5 report

Aim 3: Develop novel machine learning models to uncover how key transcriptional, epigenetic, and network changes in CD4+ T cells upon HIV infection and/or SUD can lead to immune dysfunction.

To improve molecular and contextual feature integration in models to uncover key factors in HIV-induced immune dysfunction, we introduced a unified pre-trained language model MolLM, the first molecular multimodal language model that combines 2D and 3D molecular representations with biomedical text. We adapted MolLM to HIV datasets to predict anti-HIV activity of molecular compounds. This approach enabled richer embedding spaces that can support downstream tasks such as cross-modal retrieval, molecule captioning, property prediction and molecule editing in the context of SUD and HIV (Tang et al, Gerstein 2024 Bioinformatics).

We further extended our modeling pipeline by developing a meta-modeling framework inspired by ensemble-based prediction of ligand-protein binding affinities. This framework integrates force-field-based empirical docking and sequence-based deep learning models. Our meta-models demonstrated improved generalization and robustness over base models. This approach can be applied to HIV and SUD drug discovery or development by enabling accurate prediction of binding affinities between candidate ligands and their molecular targets (Lee, Emani, and Gerstein 2024 J Chem Inf Model).