Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Statistical learning methods for multi-omics data integration in dimension reduction, supervised and unsupervised machine learning

Kim, SungHwan (2015) Statistical learning methods for multi-omics data integration in dimension reduction, supervised and unsupervised machine learning. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

[img]
Preview
PDF
Submitted Version

Download (11MB)

Abstract

Over the decades, many statistical learning techniques such as supervised learning, unsupervised learning, dimension reduction technique have played ground breaking roles for important tasks in biomedical research. More recently, multi-omics data integration analysis has become increasingly popular to answer to many intractable biomedical questions, to improve statistical power by exploiting large size samples and different types omics data, and to replicate individual experiments for validation. This dissertation covers the several analytic methods and frameworks to tackle with practical problems in multi-omics data integration analysis.
Supervised prediction rules have been widely applied to high-throughput omics data to predict disease diagnosis, prognosis or survival risk. The top scoring pair (TSP) algorithm is a supervised discriminant rule that applies a robust simple rank-based algorithm to identify rank-altered gene pairs in case/control classes. TSP usually generates greatly reduced accuracy in inter-study prediction (i.e., the prediction model is established in the training study and applied to an independent test study). In the first part, we introduce a MetaTSP algorithm that combines multiple transcriptomic studies and generates a robust prediction model applicable to independent test studies.
One important objective of omics data analysis is clustering unlabeled patients in order to identify meaningful disease subtypes. In the second part, we propose a group structured integrative clustering method to incorporate a sparse overlapping group lasso technique and a tight clustering via regularization to integrate inter-omics regulation flow, and to encourage outlier samples scattering away from tight clusters. We show by two real examples and simulated data that our proposed methods improve the existing integrative clustering in clustering accuracy, biological interpretation, and are able to generate coherent tight clusters.
Principal component analysis (PCA) is commonly used for projection to low-dimensional space for visualization. In the third part, we introduce two meta-analysis frameworks of PCA (Meta-PCA) for analyzing multiple high-dimensional studies in common principal component space. Theoretically, Meta-PCA specializes to identify meta principal component (Meta-PC) space; (1) by decomposing the sum of variances and (2) by minimizing the sum of squared cosines. Applications to various simulated data shows that Meta-PCAs outstandingly identify true principal component space, and retain robustness to noise features and outlier samples. We also propose sparse Meta-PCAs that penalize principal components in order to selectively accommodate significant principal component projections. With several simulated and real data applications, we found Meta-PCA efficient to detect significant transcriptomic features, and to recognize visual patterns for multi-omics data sets.
In the future, the success of data integration analysis will play an important role in revealing the molecular and cellular process inside multiple data, and will facilitate disease subtype discovery and characterization that improve hypothesis generation towards precision medicine, and potentially advance public health research.


Share

Citation/Export:
Social Networking:
Share |

Details

Item Type: University of Pittsburgh ETD
Status: Unpublished
Creators/Authors:
CreatorsEmailPitt UsernameORCID
Kim, SungHwansuk73@pitt.eduSUK73
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairTseng, George C.ctseng@pitt.eduCTSENG
Committee MemberPark, YongSeokyongpark@pitt.eduYONGPARK
Committee MemberWeeks, Daniel E.weeks@pitt.eduWEEKS
Committee MemberWei, Chenweichen.mich@gmail.com
Committee MemberLei, Jingjinglei@andrew.cmu.edu
Date: 29 June 2015
Date Type: Publication
Defense Date: 9 April 2015
Approval Date: 29 June 2015
Submission Date: 9 April 2015
Access Restriction: 2 year -- Restrict access to University of Pittsburgh for a period of 2 years.
Number of Pages: 117
Institution: University of Pittsburgh
Schools and Programs: Graduate School of Public Health > Biostatistics
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: Top scoring pair, machine learning, clustering, dimension reduction
Date Deposited: 29 Jun 2015 15:27
Last Modified: 01 May 2017 05:15
URI: http://d-scholarship.pitt.edu/id/eprint/24698

Metrics

Monthly Views for the past 3 years

Plum Analytics


Actions (login required)

View Item View Item