Hornung, Roman; Bernau, Christoph; Truntzer, Caroline; Stadler, Thomas; Boulesteix, Anne-Laure
(16. April 2014):
Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation.
Department of Statistics: Technical Reports, No.159
In practical applications of supervised statistical learning the separation of the training and test data is often violated through performing one or several analysis steps prior to estimating the prediction error by cross-validation (CV) procedures. We refer to such practices as incomplete CV. For the special case of preliminary variable selection in high-dimensional microarray data the corresponding error estimate is well known to be strongly downwardly biased, resulting in over-optimistic conclusions regarding prediction accuracy of the fitted models. However, while other data preparation steps may also be affected by these types of problems, their impact on error estimation is far less acknowledged in the literature. In this paper we shed light on these issues. We present a new measure quantifying the impact of incomplete CV that is based on the ratio between the errors estimated by incomplete CV and by a formally correct "full CV." The new measure is illustrated through applications to several low- and high-dimensional biomedical data sets and various data preparation steps including preliminary variable selection, choice of tuning parameters, normalization of gene expression microarray data, and imputation of missing values. It may be used in biometrical applications to determine whether specific data preparation steps can be safely performed as preliminary steps before running the CV procedure, or if they should be repeatedly trained in each CV iteration.