Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation

www.lmu.de | UB | Blättern | Hilfe

Zur erweiterten Suche

English

Zur erweiterten Suche

Hornung, Roman; Bernau, Christoph; Truntzer, Caroline; Stadler, Thomas und Boulesteix, Anne-Laure (16. April 2014): Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation. Department of Statistics: Technical Reports, Nr. 159 [PDF, 580kB]

Vorschau

DOI: 10.5282/ubm/epub.20682

Abstract

In practical applications of supervised statistical learning the separation of the training and test data is often violated through performing one or several analysis steps prior to estimating the prediction error by cross-validation (CV) procedures. We refer to such practices as incomplete CV. For the special case of preliminary variable selection in high-dimensional microarray data the corresponding error estimate is well known to be strongly downwardly biased, resulting in over-optimistic conclusions regarding prediction accuracy of the fitted models. However, while other data preparation steps may also be affected by these types of problems, their impact on error estimation is far less acknowledged in the literature. In this paper we shed light on these issues. We present a new measure quantifying the impact of incomplete CV that is based on the ratio between the errors estimated by incomplete CV and by a formally correct "full CV." The new measure is illustrated through applications to several low- and high-dimensional biomedical data sets and various data preparation steps including preliminary variable selection, choice of tuning parameters, normalization of gene expression microarray data, and imputation of missing values. It may be used in biometrical applications to determine whether specific data preparation steps can be safely performed as preliminary steps before running the CV procedure, or if they should be repeatedly trained in each CV iteration.

Dokumententyp:	Paper
Keywords:	Cross-validation, Over-optimism, Good practice, Error estimation, Practical guidelines, Supervised learning
Fakultät:	Mathematik, Informatik und Statistik > Statistik > Technische Reports
Themengebiete:	500 Naturwissenschaften und Mathematik > 510 Mathematik
URN:	urn:nbn:de:bvb:19-epub-20682-6
Sprache:	Englisch
Dokumenten ID:	20682
Datum der Veröffentlichung auf Open Access LMU:	23. Apr. 2014 14:07
Letzte Änderungen:	04. Nov. 2020 13:01

Dokument bearbeiten