Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Clustering methods with variable selection for data with mixed variable types or limits of detection

Wang, Shu (2019) Clustering methods with variable selection for data with mixed variable types or limits of detection. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Submitted Version

Download (1MB) | Preview


Clustering has emerged as one of the most essential and popular techniques for discovering patterns in data. However, challenges exist in application of clustering. First, many of the existing clustering methods are only useful for data with either all continuous or all categorical variables, despite the abundance of data with mixed variable types. Second, clustering algorithms typically require complete data. But measurements for clinical biomarkers are often subject to limits of detection (LOD). In addition, researchers are getting more interest in knowing variable importance due to the increasing number of variables that become available for clustering. To overcome aforementioned challenges, this dissertation proposes clustering methods for mixed data with the ability of variable selection and handling censored biomarker variables.

In the first section, we propose a hybrid density- and partition-based (HyDaP) algorithm for mixed data. The HyDaP algorithm involves two steps: variable selection step and clustering step. In the first step, variables that have much contribution to clustering will be selected; in the second step, a novel dissimilarity measure will be applied on those selected variables and obtain final results. Simulations and real data analysis were conducted to compare the performance of the HyDaP algorithm with other commonly used clustering algorithms.

In the second section, we propose a Bayesian finite mixture model to simultaneously conduct variable selection, account for biomarker LOD and obtain clustering results. We put a spike-and-slab type of prior on each variable to obtain variable importance. To account for LOD, we added one more step in Gibbs sampling that iteratively fills in censored biomarker values. The same simulation settings and real data were used to evaluate its clustering performance.

PUBLIC HEALTH SIGNIFICANCE: This dissertation proposes clustering algorithms that can be applied to any mixed data with or without censored biomarkers like electronic health record (EHR) data and other clinical data. The identified patient subgroups could provide medical experts more knowledge of patient heterogeneity and the selected important variables could let them better know where the heterogeneity comes from. Thus these information could help develop precision medicine for better patient care.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Wang, Shushw97@pitt.edushw97
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairYabes, Jonathanjgy2@pitt.edujgy2
Committee CoChairChang, Chung-Chouchangj@pitt.educhangj
Committee MemberAnderson, Stewartsja@pitt.edusja
Committee MemberMi, Qiqi.mi@pitt.eduqi.mi
Committee MemberSeymour, Christopherseymourc@pitt.eduseymourc
Date: 27 June 2019
Date Type: Publication
Defense Date: 15 April 2019
Approval Date: 27 June 2019
Submission Date: 3 April 2019
Access Restriction: 3 year -- Restrict access to University of Pittsburgh for a period of 3 years.
Number of Pages: 80
Institution: University of Pittsburgh
Schools and Programs: School of Public Health > Biostatistics
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: Clustering; Mixed data; Variable selection
Date Deposited: 27 Jun 2019 20:40
Last Modified: 01 May 2022 05:15


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item