Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Variable Selection when Confronted with Missing Data

Ziegler, Melissa Lynn (2006) Variable Selection when Confronted with Missing Data. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Primary Text

Download (825kB) | Preview


Variable selection is a common problem in linear regression.Stepwise methods, such as forward selection, are popular and areeasily available in most statistical packages. The models selectedby these methods have a number of drawbacks: they are oftenunstable, with changes in the set of variable selected due to smallchanges in the data, and they provide upwardly biased regressioncoefficient estimates. Recently proposed methods, such as the lasso,provide accurate predictions via a parsimonious, interpretablemodel.Missing data values are also a common problem, especially inlongitudinal studies. One approach to account for missing data ismultiple imputation. The simulation studies were conducted comparingthe lasso to standard variable selection methods under differentmissing data conditions, including the percentage of missing valuesand the missing data mechanism. Under missing at random mechanisms,missing data were created at the 25 and 50 percent levels with twotypes of regression parameters, one containing large effects and onecontaining several small, but nonzero, effects. Five correlationstructures were used in generating the data: independent,autoregressive with correlation 0.25 and 0.50, and equicorrelatedagain with correlation 0.25 and 0.50. Three different missing datamechanisms were used to create the missing data: linear, convex andsinister. These mechanismsLeast angle regression performed well under all conditions when thetrue regression parameter vector contained large effects, with itsdominance increasing as the correlation between the predictorvariables increased. This is consistent with complete datasimulations studies suggesting the lasso performed poorly insituations where the true beta vector contained small, nonzeroeffects. When the true beta vector contained small, nonzero effects,the performance of the variable selection methods considered wassituation dependent.Ordinary least squares had superior performance in terms confidenceinterval coverage under the independent correlation structure andwith correlated data when the true regression parameter vectorconsists of small, nonzero effects. A variety of methods performedwell when the regression parameter vector consisted of large effectsand the predictor variables were correlated depending on the missingdata situation.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Ziegler, Melissa
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairIyengar, Satishsi@stat.pitt.eduSSI
Committee MemberWIlliamson, Douglas
Committee MemberBlock, Henryhwb@stat.pitt.eduHWB
Committee MemberGleser, Leon Jljg@stat.pitt.eduGLESER
Date: 2 October 2006
Date Type: Completion
Defense Date: 2 June 2006
Approval Date: 2 October 2006
Submission Date: 16 August 2006
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Institution: University of Pittsburgh
Schools and Programs: Dietrich School of Arts and Sciences > Statistics
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: MICE; missing data; multiple imputation; lasso; variable selection
Other ID:, etd-08162006-001054
Date Deposited: 10 Nov 2011 19:59
Last Modified: 15 Nov 2016 13:49


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item