Aim 1: Integrate high-dimensional data to build the comprehensive cell atlas, dissect the cis-regulatory landscape, and construct multi-modal gene regulatory networks in CD4+ T cells from healthy individuals.
This project will ultimately reduce the complex gene expression patterns that result from HIV and substance use disorders (SUD) to simplified networks that help understand the impact of these conditions upon immune function. To initiate progress on this large undertaking, we have started by developing several novel computational methods for reduction of complex gene expression to simpler units.
Specifically, we have studied example disease states (asthma, cancer) that have high and complex immune dysfunction but well understood clustering of subtypes of gene expression and phenotypes. Thus, we can verify that a) our methods can simplify high-dimensional data, and b) the simplified data aligns with established subtypes of gene expression and phenotypes. These capabilities of our methods will be essential when we begin studying the high-dimensional, multi-scale, and highly heterogeneous genomics data of T cells and their relationship to the poorly understood phenotypes of HIV and SUD.
The first method is an unsupervised denoising autoencoder (dAsthma) that obtained simplified and robust representations of gene expression in asthma. Asthma involves complex and non-linear actions between several biological pathways. Unsupervised, non-linear generative models can obtain stable representations of the data for robust feature extraction. dAsthma generates robust and non-linear representations with clinical relevance. The encoder embeds the original input data into a lower-dimensional space, the hidden layer, and the decoder reconstructs the input from the values of the hidden layer. Five of the hidden units generated were found to correlate positively or negatively with previously defined transcriptomic endotypes of asthma (TEAs). Within this set of five hidden units, both the positively and negatively correlated ones showed enrichment of asthma-related pathways according to the Kyoto Encyclopedia of Genes and Genomes (KEGG). The set of enriched pathways was highly similar among the positive ones and highly similar among the negative ones. The top-weighted genes within the hidden units could predict asthma severity level and a clinical lung function assessment. (Lou et al., Gerstein 2020 BMC Bioinformatics)
The second method is a two-step workflow to simplifying neural networks that we applied to identifying gene expression subtypes of acute myeloid leukemia (AML). The workflow is Activation-based Neuron Tuning (ANT) to discard neurons during training and Personalized Weight Product (PWP) to interpret the resulting network using products of data and weight paths. ANT examines input distributions to activation functions, to identify neurons that do not improve the final performance of neural networks and potentially complicate the learning path of the network which may make the network less reproducible. ANT turns off these neurons. We applied this workflow of tuning models using ANT followed by interpreting the models using PWP (ANT-PWP) to the identification of AML subtypes. ANT-PWP outperformed a standard method for leveraging weight matrix products for neural networks (Garson's) as well as PWP without ANT. (Mohsen et al., Gerstein 2020 The 15th Machine Learning in Computational Biology Workshop)
Aim 2: Build the comprehensive immune profiling data hub (HeaLTH) for HIV/SUD affected individuals and construct the disease- and cell-type-specific regulatory networks.
In our initial application we noted that we had already generated single-cell multi-omics profiling data among individuals living with HIV. We generate these data using expanded CRISPR compatible Cellular Indexing of Transcriptomes and Epitopes by Sequencing (ECCITE-seq) which profiles the cellular transcriptome, surface protein markers, T cell receptor (TCR) sequence, and HIV RNA within the same single blood cells. The data reported in the initial application included samples from n=3 individuals sampled after acute HIV infection and again one year after immediate antiretroviral therapy (ART) initiation (i.e., ANT initiated ~1 month after infection).