Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Missing value imputation in high-dimensional phenomic data: Imputable or not, and how?

Liao, SG and Lin, Y and Kang, DD and Chandra, D and Bon, J and Kaminski, N and Sciurba, FC and Tseng, GC (2014) Missing value imputation in high-dimensional phenomic data: Imputable or not, and how? BMC Bioinformatics, 15 (1).

[img]
Preview
PDF
Published Version
Available under License : See the attached license file.

Download (1MB) | Preview
[img] Plain Text (licence)
Available under License : See the attached license file.

Download (1kB)

Abstract

Background: In modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods require complete data matrix, imputation is a common and practical solution. In high-throughput experiments such as microarray experiments, continuous intensities are measured and many mature missing value imputation methods have been developed and widely applied. Numerous methods for missing data imputation of microarray data have been developed. Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application of most methods. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation. Results: In this paper, we investigated existing imputation methods for phenomic data, proposed a self-training selection (STS) scheme to select the best imputation method and provide a practical guideline for general applications. We introduced a novel concept of "imputability measure" (IM) to identify missing values that are fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN) methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid (KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package "phenomeImpute" is made publicly available. Conclusions: Simulations and applications to real datasets showed that MICE often did not perform well, KNN-A, KNN-H and random forest were among the top performers although no method universally performed the best. Imputation of missing values with low imputability measures increased imputation errors greatly and could potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data analyses are available on the author's publication website.


Share

Citation/Export:
Social Networking:
Share |

Details

Item Type: Article
Status: Published
Creators/Authors:
CreatorsEmailPitt UsernameORCID
Liao, SG
Lin, Yyal14@pitt.eduYAL140000-0001-9413-3960
Kang, DD
Chandra, Ddic15@pitt.eduDIC15
Bon, J
Kaminski, N
Sciurba, FCfcs@pitt.eduFCS
Tseng, GCctseng@pitt.eduCTSENG
Date: 5 November 2014
Date Type: Publication
Journal or Publication Title: BMC Bioinformatics
Volume: 15
Number: 1
DOI or Unique Handle: 10.1186/s12859-014-0346-6
Schools and Programs: School of Public Health > Biostatistics
School of Public Health > Human Genetics
School of Medicine > Computational and Systems Biology
School of Medicine > Medicine
Refereed: Yes
Date Deposited: 21 Dec 2016 20:47
Last Modified: 10 Jun 2023 11:55
URI: http://d-scholarship.pitt.edu/id/eprint/29473

Metrics

Monthly Views for the past 3 years

Plum Analytics

Altmetric.com


Actions (login required)

View Item View Item