Ziegler, Melissa Lynn
(2006)
Variable Selection when Confronted with Missing Data.
Doctoral Dissertation, University of Pittsburgh.
(Unpublished)
Abstract
Variable selection is a common problem in linear regression.Stepwise methods, such as forward selection, are popular and areeasily available in most statistical packages. The models selectedby these methods have a number of drawbacks: they are oftenunstable, with changes in the set of variable selected due to smallchanges in the data, and they provide upwardly biased regressioncoefficient estimates. Recently proposed methods, such as the lasso,provide accurate predictions via a parsimonious, interpretablemodel.Missing data values are also a common problem, especially inlongitudinal studies. One approach to account for missing data ismultiple imputation. The simulation studies were conducted comparingthe lasso to standard variable selection methods under differentmissing data conditions, including the percentage of missing valuesand the missing data mechanism. Under missing at random mechanisms,missing data were created at the 25 and 50 percent levels with twotypes of regression parameters, one containing large effects and onecontaining several small, but nonzero, effects. Five correlationstructures were used in generating the data: independent,autoregressive with correlation 0.25 and 0.50, and equicorrelatedagain with correlation 0.25 and 0.50. Three different missing datamechanisms were used to create the missing data: linear, convex andsinister. These mechanismsLeast angle regression performed well under all conditions when thetrue regression parameter vector contained large effects, with itsdominance increasing as the correlation between the predictorvariables increased. This is consistent with complete datasimulations studies suggesting the lasso performed poorly insituations where the true beta vector contained small, nonzeroeffects. When the true beta vector contained small, nonzero effects,the performance of the variable selection methods considered wassituation dependent.Ordinary least squares had superior performance in terms confidenceinterval coverage under the independent correlation structure andwith correlated data when the true regression parameter vectorconsists of small, nonzero effects. A variety of methods performedwell when the regression parameter vector consisted of large effectsand the predictor variables were correlated depending on the missingdata situation.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
2 October 2006 |
Date Type: |
Completion |
Defense Date: |
2 June 2006 |
Approval Date: |
2 October 2006 |
Submission Date: |
16 August 2006 |
Access Restriction: |
No restriction; Release the ETD for access worldwide immediately. |
Institution: |
University of Pittsburgh |
Schools and Programs: |
Dietrich School of Arts and Sciences > Statistics |
Degree: |
PhD - Doctor of Philosophy |
Thesis Type: |
Doctoral Dissertation |
Refereed: |
Yes |
Uncontrolled Keywords: |
MICE; missing data; multiple imputation; lasso; variable selection |
Other ID: |
http://etd.library.pitt.edu/ETD/available/etd-08162006-001054/, etd-08162006-001054 |
Date Deposited: |
10 Nov 2011 19:59 |
Last Modified: |
15 Nov 2016 13:49 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/9122 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |