Project 2: Genome-wide Inference of Human Gene Function From Model Organism Data

Affiliated Groups


In Progress



An enormous amount of biological information has been painstakingly accumulated from experiments not only in human cells, but in many other organisms as well. This information has been critical for using genomics experiments to understand cancer progression and potential therapies, but current approaches make use of only a small fraction of the available information. This Project will dramatically increase the amount of biological information available, improving genomics analysis in studies of cancer and other diseases.

Pathway analysis of genomic data—the use of prior knowledge about how genes function together in biological systems—plays an increasingly critical role in gaining biological insights from large-scale genomic studies, and particularly in cancer research. However, even the richest source of computer-accessible biological pathway information, the Gene Ontology (GO), is very incomplete, hampering pathway analyses. Over the past three years, the GO Consortium has developed a project that has shown that, by utilizing a rigorous phylogenetic approach, we can increase the amount of knowledge for human genes by five-fold through careful use of experimental data obtained in model organisms such as the mouse, fruit fly, and yeast. The GOC project, however, relies on expert human biologists, and will not scale to the entire human genome.

The Project develops a computational approach that leverages the experience gained in the GOC project. We are developing an accurate, scalable computational solution to the gene function inference problem. The task is to integrate knowledge obtained from experiments across multiple organisms, in the context of the family tree that relates the genes, by constructing a probabilistic model of function conservation and divergence. The main application of the probabilistic model will be to infer the function of human genes, from experiments in other organisms. While each gene family will have a specific model depending on its own, unique history, to avoid over-fitting we will estimate only a small number of parameters that are shared across all families. We use the same, rigorous model of functional evolution as employed in the GOC project, which is based on evolutionary gain and loss of different kinds of functions (e.g. a catalytic function, binding function or even participation in a biological process or pathway), using not only GO annotations but additional information such as protein domain structure and active sites. We use the manually-curated examples from the GO Consortium as a training set for developing, as well as a test set for assessing, our computational inference method. We expect that this work will result in a dramatic increase in the number of GO annotations for human genes, resulting in much more informative results from pathway analysis, thus generating additional insights into human disease risk, progression and potential therapies. We focus manual validation on cancer-related pathways in order to ensure applicability specifically in cancer research.

Recent progress includes the submission of our algorithm to PLoS Comp. Bio., where it is currently under revision. It comprises a computationally efficient model of evolution of gene annotations using phylogenies based on a Bayesian framework using Markov ChainMonte Carlo for parameter estimation. Unlike previous approaches, our method is able to estimate parameters over many different phylogenetic trees and functions and for large phylogenetic trees. The resulting parameters agree with biological intuition, such as the increased probability of function change following gene duplication. The method performs well on leave-one-out cross validation, and we further validated some of the predictions in the experimental scientific literature.

We explored to what extent predictions from our model can be used to suggest function. Using the estimates from the model, we calculated posterior probabilities for all the genes involved in a selection of Gene Ontology trees. When we focused on leaves for which state was predicted with a greater degree of certainty, and compared the list of annotations to those available from the QuickGO API, regardless of the evidence code. Ten of the predictions were for genes in the mouse, a well-studied organism. We searched the literature for evidence of the predicted functions, and uncovered evidence for six of the ten predictions. This demonstrates how our method be be used to successfully predict unanotated function.

In other recent progress, as part of our role in the Gene Ontology project, we were part of an effort to increase the utility of Gene Ontology (GO) annotations for interpretation of genome-wide experimental data by developing GO-CAM, a structured framework for linking multiple GO annotations into an integrated model of a biological system. It is expected that GO-CAM will enable new applications in pathway and network analysis, as well as improve standard GO annotations for traditional GO-based applications. GO-CAM extends the existing annotation paradigm by introducing the concept of a model, which is a collection of connected GO annotations (plus contextual information from other ontologies) linked according to a defined schema. This work can be found here.

Explore Research

Getting to the Heart of HIV Stigma

Getting to the Heart of HIV Stigma

Affiliated GroupsPublication dateOctober 22, 2022StatusCompletedShareOverviewSystematic review of the literature on frameworks, measures, and interventions of HIV stigma.Investigators[pphs_api_faculty_card faculty_id="f240"...