Soon, sequencing one’s genome may become as commonplace as getting an X-ray. Consequently, personal genomes will increasingly serve as the lenses through which the public views biology. Addressing this, the focus of the Gerstein Lab is interpreting personal genomes, particularly in relation to disorders, such as cancer. This endeavor has a number of related aspects described below. Moreover, the approaches we take have broad connections to a variety of data-intensive fields, within the emerging discipline of data science.
Personal Genome Variation: SVs
We are involved in finding variants in personal genomes. We focus on particular types of variants, which involve the re-arrangement of large blocks of the genome (structural variation). It is believed that structural variants involve as many nucleotides in the genome as the better-known SNPs. Moreover, re-arrangements are very prevalent in genomic diseases such as cancer, and we have developed tools for identifying them (e.g. using split reads and fusion genes). See: SV papers.
Human Genome Annotation: Processing Next-Gen Sequencing Data
After one has determined all of the variants in an individual’s genome, the next step is understanding what they mean. This involves genome annotation, where one places each base within a biochemical context. Our focus has been on transcription-factor binding sites and non-coding RNAs (ncRNAs). We have carried out this effort by processing next-generation sequencing data (i.e. RNA-seq and ChIP-seq). We have developed tools to identify ncRNAs and regions of intragenic transcription. We also have developed methods for finding transcription-factor binding sites by processing ChIP-seq reads and using the level of this binding to predict statistically the expression of target genes. See: Next-Gen and RNAseq papers.
Comparative Genomics: Pseudogenes as Molecular Fossils
Pseudogenes provide a contrasting annotation to binding sites and ncRNAs in being derived from comparative rather than functional genomics data. They provide information about human molecular history. We have developed methods for identifying them. We were one of the first groups to perform comprehensive surveys, illustrating the different pseudogene repertoires in different organisms. Moreover, we have found hints that some supposedly “dead” pseudogenes may actually harbor biochemical activity. See: pseudogene papers.
Protein Structure and Function: Macromolecular Motions
While non-coding regions play an important, if underappreciated, role in genome function and disease, we also work on characterizing coding sequences, drilling deep into their protein products. We have a particular focus on loss-of-function mutations. Moreover, by analyzing protein motions we can better predict how a mutation affects function. This effort involves devising a system for characterizing motions in standardized fashion in terms of key statistics, such as the degree of rotation about hinges. It is guided by the fact that protein mobility is highly restricted by tight packing. We have developed tools for measuring packing efficiency using specialized geometric constructions (e.g. Voronoi polyhedra). See: molecular motion and structure papers.
Analysis of Diverse Networks
Networks are a way of tying together much of our research. Network representations can be applied consistently to many different types of biological data; thus, we have developed tools to build and analyze regulatory networks, protein-protein interactions and metabolic pathways, identifying key nodes such as hubs and bottlenecks. Moreover, because they are generic and flexible representation, networks provide an ideal framework for data integration. We have integrated networks with dynamic gene-expression data (identifying transient hubs), 3D-protein structures, and even satellite imagery. Finally, as people have more intuition for commonplace networks, such as those in social and computer systems, we have found cross-disciplinary comparisons helpful elucidating system-level properties of biological networks, such as the association of greater connectivity with more evolutionary constraint. See: networks papers.
Genomics at the Forefront of Data Science
Overall the Gerstein lab acts a connector, bringing quantitative approaches from disciplines such as CS and statistics to bear on practical questions and large-scale data in molecular biology. In particular, we have focused on applying technical approaches in simulation, machine learning, and knowledge-base design. Often, we carry out our work in multi-disciplinary teams. Some of the key collaborative efforts that we are involved in include the KBase, Brainspan, ENCODE, modENCODE, 1000 Genomes, PCAWG, the exRNA consortia and the Centers for Mendelian Genomics.
As a discipline, genomics is an exemplar for using big data to construct a resource and answer questions. Consequently, it is at the forefront in the emerging field of data science and provides an ideal training for future data scientists.
Personal genomics also acts as a bridge connecting the biological sciences to larger issues facing other big-data disciplines. For instance, data mining generally poses questions related to privacy. We study the fundamental privacy implications of mining personal genomes, which contain immutable information, shared amongst relatives that will be increasingly revealing in generations to come. Also, we have examined how general knowledge-representation issues associated with publishing and digital libraries relate to biological databases. We envision a future of structured literature, with less distinction between databases and journals.