Ran, Xinhui
(2019)
Using the dimension reduction method FAMD in the data pre-processing step for risk prediction and for unsupervised clustering.
Master's Thesis, University of Pittsburgh.
(Unpublished)
Abstract
High-dimensional data generated from various resources including the electronic health records (EHRs), Medicare, and Medicaid, are used in multiple research areas such as public health and medical research. However, working with high-dimensional data is a no easy task because of methodological challenges. Dimensionality reduction technique has been used to transform high-dimensional data into a lower dimensional space while preserving meaningful characteristics of the original data. Principal component Analysis (PCA) is the most widely used method for dimension reduction. However, it has its limitation on linearity assumption and is unsuitable for data containing both numeric and categorical types. Factor analysis of mixed data (FAMD) is a dimension reduction method that can be used for data with mixed types of variables. Dimension reduction is often used as a data pre-processing step prior to further analyses. However, this approach should be used with caution as it depends on the purpose of the application. In this thesis, I demonstrate that using the dimension reduction method FAMD in the data pre-processing step for risk prediction can achieve comparable prediction performance as the traditional variable selection procedure; however, when classifying individuals into similar groups using the unsupervising clustering techniques, the clustering results of using principal components generated from FAMD are substantially different from those of using the original variables.
PUBLIC HEALTH SIGNIFICANCE: High-dimensional data often present challenges in building a risk prediction model or in classifying individuals into groups with more homogeneous characteristics. Dimension reduction techniques, such as incorporating dimension reduction tools, can be incorporated in the data pre-processing step for high-dimensional data collected from public health or medical records. The results of the thesis show that using dimension reduction method (e.g., FAMD for mixed variable types) as a data pre-processing step should be used with caution.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
25 June 2019 |
Date Type: |
Publication |
Defense Date: |
12 April 2019 |
Approval Date: |
25 June 2019 |
Submission Date: |
19 April 2019 |
Access Restriction: |
3 year -- Restrict access to University of Pittsburgh for a period of 3 years. |
Number of Pages: |
41 |
Institution: |
University of Pittsburgh |
Schools and Programs: |
School of Public Health > Biostatistics |
Degree: |
MS - Master of Science |
Thesis Type: |
Master's Thesis |
Refereed: |
Yes |
Uncontrolled Keywords: |
Dimension reduction, high dimensional analysis |
Date Deposited: |
25 Jun 2019 17:24 |
Last Modified: |
01 May 2022 05:15 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/36564 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |