Research

The Gerstein lab has been engaged in biomedical data science for the past ~25 years – before the field had a defined name. We initially focused on macromolecular structure and physical simulation due to the availability of data and a well-developed calculational formalism. While we continue to work in these areas, the excitement surrounding the human genome has led us to increasingly focus on genomics. Overall, the lab serves as a connector, bridging the vast data generation in the biomedical sciences with analytic approaches from statistics and computer science, particularly AI-driven methods. Much of our work takes place within large consortia, such as ENCODE and 1000 Genomes.

xl_v6_blue-page-001

Currently, our lab conducts analyses across multiple areas:

Genome Annotation, particularly in terms of Biological Networks

Annotating the human genome is a central focus of many in biomedicine. We have contributed significantly to this effort through active participation in worldwide collaborations such as ENCODE, modENCODE, and GENCODE, as well as through the development of computational approaches for processing bulk, single-cell, multi-omic, and spatial data. A main focus has been on identifying regulatory sites (e.g., enhancers), epigenetics, and coding and non-coding RNAs. We have developed integrative methods to put these together, predicting the expression of target genes in specific cell types from their upstream control regions. Additionally, we have contributed to annotating pseudogenes (fossil genes) and determining what they tell us about the history of the human genome. See encode/annotation and pseudogene papers.

Beyond direct annotation, we develop approaches to recast the genomic information into molecular networks. In particular, we build and analyze gene-regulatory networks, protein-protein interactions, cell-to-cell communication networks, and metabolic pathways, identifying key nodes, such as information-flow bottlenecks and the apexes of regulatory hierarchies. We integrate these networks with dynamic expression data, three-dimensional protein structures, and other functional data to uncover principles governing biological systems. Given that people often have stronger intuition for social and computer networks than for biological ones, we have found cross-disciplinary comparisons useful in elucidating system-level properties—such as the relationship between connectivity and evolutionary constraint. See network papers.

Disease Genomics (Neurogenomics & Cancer Genomics)

The declining cost of next-generation sequencing has enabled researchers to rapidly investigate the genomic contributions to an individual’s disease. We have contributed to this effort through comprehensive studies and computational approaches designed to link personal genomic variants to disease. Our research has spanned a wide range of diseases, with a particular focus on cancers and brain disorders. Recent efforts in cancer include developing tools to prioritize noncoding driver mutations, examine the collective impact of nondriver mutations and analyze the full mutational spectra. See cancer genomics papers.

In neurogenomics, we have developed a comprehensive functional genomic resource for the human brain in the PsychENCODE project, integrating single-cell data with bulk functional genomics datasets. We have used this to determine many eQTLs (expression QTLs) in both bulk and single-cell contexts. Furthermore, the resource has allowed us to construct predictive models linking genomic variants via chromatin activity and single-cell gene expression to observed organismal phenotypes for schizophrenia, bipolar disorder, and Alzheimer’s disease. This model enables us to highlight key genes and pathways in these disorders, potentially identifying drug targets. See neurogenomics papers.

Using Packing to Understand Macromolecular Dynamics

While non-coding regions play an important, if underappreciated, role in genome function and disease, we also characterize coding sequences and drill deep into their protein products. By analyzing protein motions, we can better predict how a mutation affects function. Our effort involves devising a system for characterizing motions in a standardized fashion in terms of key statistics, such as the degree of rotation around hinges. It is guided by the fact that tight packing highly restricts protein mobility. See molecular motion and structure papers.

Interpretable, Machine Learning Tools for Biomedical Data

A lot of our work is developing practical tools and software applications that can be used to tackle concrete biomedical problems. Often, these take the form of computational pipelines or web servers that encapsulate various statistical and machine-learning methods. A key distinguishing aspect of our tools is grounding them in physical principles and biological mechanisms to enhance interpretability and ensure alignment with established scientific knowledge. Examples of some of our recent tools include genomic pipelines for characterizing multi-scale “peaks” in chromatin activity data and identifying enhancers with recurrent neural networks, visualization servers for macromolecular motions and for the gene-regulatory hierarchy, and software for identifying protein misfolding by repurposing existing embeddings from large-language models. See key paper tools and Gerstein Lab repository on GitHub for more details.

Privacy of Genomic and Biomedical Data

Increasingly, one of the main limitations in genomic analysis is securing enough individuals for a properly powered analysis. This requires keeping many individuals’ biomedical data private. While this may seem straightforward, it is highly complex due to the high dimensionality and large scale of the data, particularly for genomes. We have developed various statistical methods to quantify the extent of private information leakage, including subtle and often overlooked risks (e.g., via linking attacks). Additionally, we have designed approaches to selectively sanitize and share data while minimizing the loss of its utility for downstream analyses. This includes secure data-sharing frameworks using homomorphic encryption and blockchain storage. See privacy papers.

Future Directions, Fusing Diverse Biomedical Data Modalities with AI Approaches

Going forward, we are trying to integrate the broader range of biomedical data coming online, including image data, biosensor data, and various forms of textual data from publications and electronic health records. We see a tremendous value in creatively fusing diverse data types, using the genome as an organizing platform. We have had notable progress in this recently, linking genetic variants to biosensor outputs (i.e., using smartwatch outputs in GWAS), developing ensemble machine-learning approaches for cryo-EM image processing, and developing large-language models for automatic bioinformatics code generation.

References

See Papers.GersteinLab.org – in particular, Best Papers and listing of Key Contributions.

Some talks giving a quick overview of the lab: 5′ Quick Overview (’25), 5′ animation (’20), 15′ powerpoint (’19)

More information on research interests can also be found here.