Project 1: Integrated Analysis for Genetic Association

Affiliated Groups


In Progress



Cancer results from a complex series of alterations of the structure, function, and regulation of the genome. Integration of information across these multiple genomic ‘dimensions’ can provide insights into the development and progression of cancer and accelerate the discovery of novel biomarkers for prediction and prognosis. The goal of this Project is to develop novel statistical methods for integrating multiple levels of genomic information to elucidate the complex mechanisms of cancer development and progression and to investigate the determinants and predictors of cancer clinical outcomes. We apply these methods to two studies that have characterized germline and somatic variation in tumors, one of colorectal cancer patients followed for clinical outcomes, and one large consortium of colorectal cancer association studies.

This Project develops analytical tools that can integrate data from multiple genomic platforms and incorporate external omic information from publically available databases. These tools are applicable to both etiological studies geared toward causal discovery and to clinical and translational studies geared toward predictive modeling.​​Advances in high-throughput molecular technologies have enabled large-scale omic projects (e.g. Encode, The Cancer Genome Atlas, Epigenome Roadmap) to generate vast amounts of information on the structure, function and regulation of the genome. In addition to this publicly available data, individual studies are increasingly generating multiplatform genomic profiles (e.g. genotypes, gene expression, methylation copy number variation, miRNA) to elucidate the complex mechanisms of cancer development and progression, and investigate determinants and predictors of health and clinical outcomes.

Integration across these multiple genomic “dimensions” and incorporation of the available external information can increase the ability to discovery causal relationships (e.g. Cancer-SNP associations), enhance prediction and prognosis modeling (e.g. cancer aggressiveness), and provide insights into biological mechanisms. We exploit two analytic approaches aimed at addressing the challenges to effective integration across multiplatform genomic data and incorporation of external information from omic projects.

The first approach (Aim 1) is a Bayesian regression and feature selection method that can integrate prior omic information in a very flexible manner allowing the data to ‘speak for itself’ to determine which pieces of external information are relevant for the problem at hand. The method works with individual-level data and also with meta-analytic summaries, making it well suited for analyzing data from large multi-study consortia.

The second approach (Aim 2) is a regularized regression and feature selection method for integrating multiplatform genomic features measured on the same set of individuals. The method is designed to scale to the very large numbers of features typical of enomewide platforms, to account for the different properties of each genomic data type, and to incorporate relevant external information to increase efficiency.

Both approaches can be applied for causal discovery and for developing predictive and prognostic models. We will apply our methods to search for novel risk variants in the CORECT consortium of genome association studies, and to construct a prognostic model of CRC recurrence based on genomewide expression methylation data in the ColoCare consortium cohort of CRC patients. This work will provide new tools for analyzing high-dimensional multi-platform genomic that can take advantage of available external information.

An example of recent progress we have made is that, in collaboration with Project 3, we have published an integrative model to estimate latent unknown clusters (LUCID) aiming to both distinguish unique genomic, exposure and informative biomarkers/omic effects while jointly estimating subgroups relevant to the outcome of interest. The R package is available on CRAN and github and we currently have over 9000 downloads (see Software page). This work has been used in several applied studies with corresponding manuscripts published or under review (Alderete et al 2019; Jin et al. 2020; Stratakis et al 2020). We have two additional manuscripts in preparation describing the software and a methodological extension in which the method performs inference when a portion of individuals are missing measured omic data. These two additional manuscripts will be submitted in late 2020.

We have also recently published in The Comprehensive R Archive Network (CRAN) the ‘xrnet’ R package that implements our hierarchical regularized regression approach for incorporating external information. A paper describing the software has been published in the Journal of open software (Weaver et al. 2020). The main paper describing the approach will be submitted for publication within the next few months. We have also extended the approach, originally developed for quantitative outcomes in years 1 and 2, and binary outcomes in year 3, to time to event (survival) outcomes via a penalized hierarchical Cox regression. We have a manuscript in preparation describing the approach for survival outcomes and have written the corresponding C++ implementation, which will be incorporated into the ‘xrnet’ package.

Additionally, we have published in CRAN the ‘xtune’ R package that implements our alternative approach for integrating omic annotations via modeling the variance of the coefficients of the omic features as a function of the annotations. The paper describing the approach has been submitted for publication to Bionformatics. We have also extended the method to handle binary outcomes and the elastic net penalty, which has better selection characteristics for correlated features than the LASSO. We have a manuscript in preparation describing these two extensions and both have written the R implementation, which will be incorporated into the ‘xtune’ package. We have explored including the L0 penalty for inclusion into ‘xrnet’ but in addition to being computationally demanding, our simulation results did not show performance improvements the merit the computational effort. We will instead pursue incorporation of non-convex but more tractable penalties SCAD and NCP.

Explore Research

Getting to the Heart of HIV Stigma

Getting to the Heart of HIV Stigma

Affiliated GroupsPublication dateOctober 22, 2022StatusCompletedShareOverviewSystematic review of the literature on frameworks, measures, and interventions of HIV stigma.Investigators[pphs_api_faculty_card faculty_id="f240"...