Statistical learning methods for multi-omics data integration in dimension reduction, supervised and unsupervised machine learning

Kim, SungHwan (2015) Statistical learning methods for multi-omics data integration in dimension reduction, supervised and unsupervised machine learning. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Preview

PDF
Submitted Version
Download (11MB)

Abstract

Over the decades, many statistical learning techniques such as supervised learning, unsupervised learning, dimension reduction technique have played ground breaking roles for important tasks in biomedical research. More recently, multi-omics data integration analysis has become increasingly popular to answer to many intractable biomedical questions, to improve statistical power by exploiting large size samples and different types omics data, and to replicate individual experiments for validation. This dissertation covers the several analytic methods and frameworks to tackle with practical problems in multi-omics data integration analysis.
Supervised prediction rules have been widely applied to high-throughput omics data to predict disease diagnosis, prognosis or survival risk. The top scoring pair (TSP) algorithm is a supervised discriminant rule that applies a robust simple rank-based algorithm to identify rank-altered gene pairs in case/control classes. TSP usually generates greatly reduced accuracy in inter-study prediction (i.e., the prediction model is established in the training study and applied to an independent test study). In the first part, we introduce a MetaTSP algorithm that combines multiple transcriptomic studies and generates a robust prediction model applicable to independent test studies.
One important objective of omics data analysis is clustering unlabeled patients in order to identify meaningful disease subtypes. In the second part, we propose a group structured integrative clustering method to incorporate a sparse overlapping group lasso technique and a tight clustering via regularization to integrate inter-omics regulation flow, and to encourage outlier samples scattering away from tight clusters. We show by two real examples and simulated data that our proposed methods improve the existing integrative clustering in clustering accuracy, biological interpretation, and are able to generate coherent tight clusters.
Principal component analysis (PCA) is commonly used for projection to low-dimensional space for visualization. In the third part, we introduce two meta-analysis frameworks of PCA (Meta-PCA) for analyzing multiple high-dimensional studies in common principal component space. Theoretically, Meta-PCA specializes to identify meta principal component (Meta-PC) space; (1) by decomposing the sum of variances and (2) by minimizing the sum of squared cosines. Applications to various simulated data shows that Meta-PCAs outstandingly identify true principal component space, and retain robustness to noise features and outlier samples. We also propose sparse Meta-PCAs that penalize principal components in order to selectively accommodate significant principal component projections. With several simulated and real data applications, we found Meta-PCA efficient to detect significant transcriptomic features, and to recognize visual patterns for multi-omics data sets.
In the future, the success of data integration analysis will play an important role in revealing the molecular and cellular process inside multiple data, and will facilitate disease subtype discovery and characterization that improve hypothesis generation towards precision medicine, and potentially advance public health research.

Citation/Export:
Social Networking:	Share \|

Details

Item Type:

University of Pittsburgh ETD

Status:

Unpublished

Creators/Authors:

Creators	Email	Pitt Username	ORCID
Kim, SungHwan	suk73@pitt.edu	SUK73

ETD Committee:

Title	Member	Email Address	Pitt Username	ORCID
Committee Chair	Tseng, George C.	ctseng@pitt.edu	CTSENG
Committee Member	Park, YongSeok	yongpark@pitt.edu	YONGPARK
Committee Member	Weeks, Daniel E.	weeks@pitt.edu	WEEKS	0000-0001-9410-7228
Committee Member	Wei, Chen	weichen.mich@gmail.com
Committee Member	Lei, Jing	jinglei@andrew.cmu.edu

Date:

29 June 2015

Date Type:

Publication

Defense Date:

9 April 2015

Approval Date:

29 June 2015

Submission Date:

9 April 2015

Access Restriction:

2 year -- Restrict access to University of Pittsburgh for a period of 2 years.

Number of Pages:

117

Institution:

University of Pittsburgh

Schools and Programs:

School of Public Health > Biostatistics

Degree:

PhD - Doctor of Philosophy

Thesis Type:

Doctoral Dissertation

Refereed:

Yes

Uncontrolled Keywords:

Top scoring pair, machine learning, clustering, dimension reduction

Date Deposited:

29 Jun 2015 15:27

Last Modified:

30 Jun 2022 15:53

URI:

http://d-scholarship.pitt.edu/id/eprint/24698

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item

My Account

Search

Browse

Information

Statistical learning methods for multi-omics data integration in dimension reduction, supervised and unsupervised machine learning

Abstract

Share

Details

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

Connect with us

Send Comments or Questions

Feeds