Rospleszcz, Susanne; Janitza, Silke; Boulesteix, Anne-Laure
(7. August 2014):
Categorical variables with many categories are
preferentially selected in model selection procedures for multivariable regression models on bootstrap samples.
Department of Statistics: Technical Reports, No.164
To perform model selection in the context of multivariable regression, automated variable selection procedures such as backward elimination are commonly employed. However, these procedures are known to be highly unstable. Their stability can be investigated using bootstrap-based procedures: the idea is to perform model selection on a high number of bootstrap samples successively and to examine the obtained models, for instance in terms of the inclusion of specific predictor variables. However, from the literature such bootstrap-based procedures are known to yield misleading results in some cases. In this paper we aim to thoroughly investigate a particular important facet of these problems. More precisely, we assess the behaviour of regression models--with automated variable selection procedure based on the likelihood ratio test--fitted on bootstrap samples drawn with replacement and on subsamples drawn without replacement with respect to the number and type of included predictor variables. Our study includes both extensive simulations and a real data example from the NHANES study. The results indicate that models derived from bootstrap samples include more predictor variables than models fitted on original samples and that categorical predictor variables with many categories are preferentially selected over categorical predictor variables with fewer categories and over metric predictor variables. We conclude that using bootstrap samples to select variables for multivariable regression models may lead to overly complex models with a preferential selection of categorical predictor variables with many categories. We suggest the use of subsamples instead of bootstrap samples to bypass these drawbacks.