Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Enhancements of Sparse Clustering with Resampling and Considerations on Tuning Parameter

Bi, Wenzhu (2012) Enhancements of Sparse Clustering with Resampling and Considerations on Tuning Parameter. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

This is the latest version of this item.

Primary Text

Download (1MB) | Preview


Clustering methods are widely used to explore subgroupings in data when the true group membership is unknown. These techniques are very useful when identifying potential subpopulations of interest in the medical and public health setting. Examples of these types of subpopulations include subjects who have certain gene expression profiles related to a cancer subtype, and subjects who are in the very early, asymptomatic phase, of a chronic illness. All of these examples are of great public health relevance.
Many of the datasets of interest arise from the development of new technologies and are subject to the common problem where p, the number of variables, is significantly larger than the sample size, n. The relatively small sample size, n, may result from the difficulties of subject recruitment and/or the financial burden of the actual data collection in fields such as imaging and genetic analysis. The earlier approaches to clustering treat all of the variables equally, which may not work well when not all of them are relevant to the subgroupings. Clustering methods with variable selection, also called sparse clustering, have been recently developed to deal with this problem. We propose a method to add resampling onto sparse clustering to improve upon the current clustering methodology. The addition of resampling methods to sparse clustering results in variable selection that is more accurate. The method is also used to assign an “observed proportion of cluster membership” to each observation, providing a new metric by which to measure membership certainty. The performance of the method is studied via simulation and illustrated in the motivating data example.
We also propose an alternative approach for the choice of tuning parameter based on an adjusted Bayesian Information Criterion (BIC). Variable selection in sparse clustering is realized by applying Lasso or related penalties and the tuning parameter for these penalties has to be determined beforehand. The gap statistic, a distance-based approach, is used to choose the tuning parameter through permutation and it may behave poorly at times. The proposed BIC approach is an alternative developed under the more sophisticated model-based likelihood framework. Its performance is evaluated with simulations.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Bi, Wenzhuweb10@pitt.eduWEB10
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairWeissfeld, Lisa A.lweis@pitt.eduLWEIS
Committee MemberTseng, George C.ctseng@pitt.eduCTSENG
Committee MemberLin, Yanyal14@pitt.eduYAL14
Committee MemberPrice, Julie C.pricjc@UPMC.EDU
Date: 29 June 2012
Date Type: Completion
Defense Date: 9 April 2012
Approval Date: 29 June 2012
Submission Date: 2 April 2012
Access Restriction: 5 year -- Restrict access to University of Pittsburgh for a period of 5 years.
Number of Pages: 57
Institution: University of Pittsburgh
Schools and Programs: School of Public Health > Biostatistics
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: Sparse clustering, Resampling, High-dimension but small sample size, Imaging, Microarray, Tuning Parameter
Date Deposited: 29 Jun 2012 18:15
Last Modified: 29 Jun 2017 05:15

Available Versions of this Item

  • Enhancements of Sparse Clustering with Resampling and Considerations on Tuning Parameter. (deposited 29 Jun 2012 18:15) [Currently Displayed]


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item