Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Using the dimension reduction method FAMD in the data pre-processing step for risk prediction and for unsupervised clustering

Ran, Xinhui (2019) Using the dimension reduction method FAMD in the data pre-processing step for risk prediction and for unsupervised clustering. Master's Thesis, University of Pittsburgh. (Unpublished)

Submitted Version

Download (1MB) | Preview


High-dimensional data generated from various resources including the electronic health records (EHRs), Medicare, and Medicaid, are used in multiple research areas such as public health and medical research. However, working with high-dimensional data is a no easy task because of methodological challenges. Dimensionality reduction technique has been used to transform high-dimensional data into a lower dimensional space while preserving meaningful characteristics of the original data. Principal component Analysis (PCA) is the most widely used method for dimension reduction. However, it has its limitation on linearity assumption and is unsuitable for data containing both numeric and categorical types. Factor analysis of mixed data (FAMD) is a dimension reduction method that can be used for data with mixed types of variables. Dimension reduction is often used as a data pre-processing step prior to further analyses. However, this approach should be used with caution as it depends on the purpose of the application. In this thesis, I demonstrate that using the dimension reduction method FAMD in the data pre-processing step for risk prediction can achieve comparable prediction performance as the traditional variable selection procedure; however, when classifying individuals into similar groups using the unsupervising clustering techniques, the clustering results of using principal components generated from FAMD are substantially different from those of using the original variables.
PUBLIC HEALTH SIGNIFICANCE: High-dimensional data often present challenges in building a risk prediction model or in classifying individuals into groups with more homogeneous characteristics. Dimension reduction techniques, such as incorporating dimension reduction tools, can be incorporated in the data pre-processing step for high-dimensional data collected from public health or medical records. The results of the thesis show that using dimension reduction method (e.g., FAMD for mixed variable types) as a data pre-processing step should be used with caution.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Ran, Xinhuixir7@pitt.eduxir7
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairChang,
Committee MemberYabes,
Committee MemberMayr,
Date: 25 June 2019
Date Type: Publication
Defense Date: 12 April 2019
Approval Date: 25 June 2019
Submission Date: 19 April 2019
Access Restriction: 3 year -- Restrict access to University of Pittsburgh for a period of 3 years.
Number of Pages: 41
Institution: University of Pittsburgh
Schools and Programs: School of Public Health > Biostatistics
Degree: MS - Master of Science
Thesis Type: Master's Thesis
Refereed: Yes
Uncontrolled Keywords: Dimension reduction, high dimensional analysis
Date Deposited: 25 Jun 2019 17:24
Last Modified: 01 May 2022 05:15


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item