Home  |  Browse  |  Authors  |  Advanced Search  |  Help
Login | Create Account
Boulesteix, Anne-Laure and Strobl, Carolin (08. May 2009): Optimal classifier selection and negative bias in error rate estimation: An empirical study on high-dimensional prediction. Department of Statistics: Technical Reports, No.58

Metadaten exportieren

Autor(en) recherchieren

Lesezeichen anlegen

[img]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Reader
314Kb

Abstract

In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias. In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure. We then assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly. We conclude that the strategy to present only the optimal result is not acceptable, and suggest alternative approaches for properly reporting classification accuracy.

Item Type:Paper (Technical Report)
Published in:BMC Medical Research Methodology (accepted for publication)
Keywords:Class prediction, supervised classification, microarray data, false research findings, noise discovery, trial-and-error
Subjects:Mathematics, Computer Science and Statistics > Statistics > Technical Reports
Dewey Classification:600 Natural sciences and mathematics > 510 Mathematics
URN:urn:nbn:de:bvb:19-epub-10606-9
Language:English
ID Code:10606
Deposited On:08. May 2009 18:30
Last Modified:12. Jan 2012 16:56
Open Access LMU is powered by EPrints 3 which is developed by the School of Electronics and Computer Science at the University of Southampton. More information and software creditsAbout