Bernau, Christoph and Boulesteix, AnneLaure
(28. January 2010):
Variable Selection and Parameter Tuning in HighDimensional Prediction.
Department of Statistics: Technical Reports, No.76

Preview 

PDF
448kB 
Abstract
In the context of classification using highdimensional data such as microarray gene expression data, it is often useful to perform preliminary variable selection. For example, the knearestneighbors classification procedure yields a much higher accuracy when applied on variables with high discriminatory power. Typical (univariate) variable selection methods for binary classification are, e.g., the twosample tstatistic or the MannWhitney test.
In small sample settings, the classification error rate is often estimated using crossvalidation (CV) or related approaches. The variable selection procedure has then to be applied for each considered training set anew, i.e. for each CV iteration successively. Performing variable selection based on the whole sample before the CV procedure would yield a downwardly biased error rate estimate. CV may also be used to tune parameters involved in a classification method. For instance, the penalty parameter in penalized regression or the cost in support vector machines are most often selected using CV. This type of CV is usually denoted as "internal CV" in contrast to the "external CV" performed to estimate the error rate, while the term "nested CV" refers to the whole procedure embedding two CV loops.
While variable selection and parameter tuning have been widely investigated in the context of highdimensional classification, it is still unclear how they should be combined if a classification method involves both variable selection and parameter tuning. For example, the knearestneighbors method usually requires variable selection and involves a tuning parameter: the number k of neighbors. It is wellknown that variable selection should be repeated for each external CV iteration. But should we also repeat variable selection for each it internal CV iteration or rather perform tuning based on fixed subset of variables? While the first variant seems more natural, it implies a huge computational expense and its benefit in terms of error rate remains unknown.
In this paper, we assess both variants quantitatively using real microarray data sets. We focus on two representative examples: knearestneighbors (with k as tuning parameter) and Partial Least Squares dimension reduction followed by linear discriminant analysis (with the number of components as tuning parameter). We conclude that the more natural but computationally expensive variant with repeated variable selection does not necessarily lead to better accuracy and point out the potential pitfalls of both variants.