Clustering methods with variable selection for data with mixed variable types or limits of detection

Wang, Shu (2019) Clustering methods with variable selection for data with mixed variable types or limits of detection. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Preview

PDF
Submitted Version
Download (1MB) | Preview

Abstract

Clustering has emerged as one of the most essential and popular techniques for discovering patterns in data. However, challenges exist in application of clustering. First, many of the existing clustering methods are only useful for data with either all continuous or all categorical variables, despite the abundance of data with mixed variable types. Second, clustering algorithms typically require complete data. But measurements for clinical biomarkers are often subject to limits of detection (LOD). In addition, researchers are getting more interest in knowing variable importance due to the increasing number of variables that become available for clustering. To overcome aforementioned challenges, this dissertation proposes clustering methods for mixed data with the ability of variable selection and handling censored biomarker variables.

In the first section, we propose a hybrid density- and partition-based (HyDaP) algorithm for mixed data. The HyDaP algorithm involves two steps: variable selection step and clustering step. In the first step, variables that have much contribution to clustering will be selected; in the second step, a novel dissimilarity measure will be applied on those selected variables and obtain final results. Simulations and real data analysis were conducted to compare the performance of the HyDaP algorithm with other commonly used clustering algorithms.

In the second section, we propose a Bayesian finite mixture model to simultaneously conduct variable selection, account for biomarker LOD and obtain clustering results. We put a spike-and-slab type of prior on each variable to obtain variable importance. To account for LOD, we added one more step in Gibbs sampling that iteratively fills in censored biomarker values. The same simulation settings and real data were used to evaluate its clustering performance.

PUBLIC HEALTH SIGNIFICANCE: This dissertation proposes clustering algorithms that can be applied to any mixed data with or without censored biomarkers like electronic health record (EHR) data and other clinical data. The identified patient subgroups could provide medical experts more knowledge of patient heterogeneity and the selected important variables could let them better know where the heterogeneity comes from. Thus these information could help develop precision medicine for better patient care.

Citation/Export:
Social Networking:	Share \|

Details

Item Type:

University of Pittsburgh ETD

Status:

Unpublished

Creators/Authors:

Creators	Email	Pitt Username	ORCID
Wang, Shu	shw97@pitt.edu	shw97

ETD Committee:

Title	Member	Email Address	Pitt Username
Committee Chair	Yabes, Jonathan	jgy2@pitt.edu	jgy2
Committee CoChair	Chang, Chung-Chou	changj@pitt.edu	changj
Committee Member	Anderson, Stewart	sja@pitt.edu	sja
Committee Member	Mi, Qi	qi.mi@pitt.edu	qi.mi
Committee Member	Seymour, Christopher	seymourc@pitt.edu	seymourc

Date:

27 June 2019

Date Type:

Publication

Defense Date:

15 April 2019

Approval Date:

27 June 2019

Submission Date:

3 April 2019

Access Restriction:

3 year -- Restrict access to University of Pittsburgh for a period of 3 years.

Number of Pages:

Institution:

University of Pittsburgh

Schools and Programs:

School of Public Health > Biostatistics

Degree:

PhD - Doctor of Philosophy

Thesis Type:

Doctoral Dissertation

Refereed:

Yes

Uncontrolled Keywords:

Clustering; Mixed data; Variable selection

Date Deposited:

27 Jun 2019 20:40

Last Modified:

01 May 2022 05:15

URI:

http://d-scholarship.pitt.edu/id/eprint/36243

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item

My Account

Search

Browse

Information

Clustering methods with variable selection for data with mixed variable types or limits of detection

Abstract

Share

Details

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

Connect with us

Send Comments or Questions

Feeds