A stability analysis of sparse K-means

Apfel, Abraham (2017) A stability analysis of sparse K-means. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Preview

PDF
Submitted Version
Download (1MB) | Preview

Abstract

Sparse K-Means clustering is an established method of simultaneously excluding uninformative features and clustering the observations. This is particularly useful in a high dimensional setting such as micro-array. However the subsets of features selected is often inaccurate when there are overlapping clusters, which adversely affects the clustering results. The current method also tends to be inconsistent, yielding high variability in the number of features selected.
We propose to combine a stability analysis with Sparse K-Means via performing Sparse K-Means on subsamples of the original data to yield accurate and consistent feature selection. After reducing the dimensions to an accurate, small subset of features, the standard K-Means clustering procedure is performed to yield accurate clustering results. Our method demonstrates improvement in accuracy and reduction in variability providing consistent feature selection as well as a reduction in the clustering error rate (CER) from the previously established Sparse K-Means clustering methodology. Our method continues to perform well in situations with strong cluster overlap where the previous methods were unsuccessful.
Public health significance: Clustering analysis on transcriptomic data has shown success in disease phenotyping and subgroup discovery. However, with current methodology, there is a lack of confidence in terms of the accuracy and reliability of the results, as they can be highly variable. With our methodology, we hope to allow the researcher to use cluster analysis to achieve disease phenotyping and subgroup discovery with confidence that they are uncovering accurate and stable results thus ensuring that their findings will allow reliable public health decisions to be made from their work.

Citation/Export:
Social Networking:	Share \|

Details

Item Type:

University of Pittsburgh ETD

Status:

Unpublished

Creators/Authors:

Creators	Email	Pitt Username	ORCID
Apfel, Abraham	aba44@pitt.edu	aba44	0000-0003-4839-0979

ETD Committee:

Title	Member	Email Address	Pitt Username
Committee Chair	Anderson, Stewart	sja@pitt.edu	sja
Committee Member	Tseng, George	ctseng@pitt.edu	ctseng
Committee Member	Lin, Yan	yal14@pitt.edu	yal14
Committee Member	Tudorascu, Dana	dlt30@pitt.edu	dlt30

Date:

31 August 2017

Date Type:

Publication

Defense Date:

5 May 2017

Approval Date:

31 August 2017

Submission Date:

7 May 2017

Access Restriction:

2 year -- Restrict access to University of Pittsburgh for a period of 2 years.

Number of Pages:

Institution:

University of Pittsburgh

Schools and Programs:

School of Public Health > Biostatistics

Degree:

PhD - Doctor of Philosophy

Thesis Type:

Doctoral Dissertation

Refereed:

Yes

Uncontrolled Keywords:

cluster analysis sparse k-means high-dimensional

Date Deposited:

31 Aug 2017 15:19

Last Modified:

01 Jul 2019 05:15

URI:

http://d-scholarship.pitt.edu/id/eprint/32551

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item

My Account

Search

Browse

Information

A stability analysis of sparse K-means

Abstract

Share

Details

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

Connect with us

Send Comments or Questions

Feeds