Abstract
We revisit resampling procedures for error estimation in binary classification in terms of U-statistics. In particular, we exploit the fact that the error rate estimator involving all learning-testing splits is a U-statistic. Therefore, several standard theorems on properties of U-statistics apply. In particular, it has minimal variance among all unbiased estimators and is asymptotically normally distributed. Moreover, there is an unbiased estimator for this minimal variance if the total sample size is at least the double learning set size plus two. In this case, we exhibit such an estimator which is another U-statistic. It enjoys, again, various optimality properties and yields an asymptotically exact hypothesis test of the equality of error rates when two learning algorithms are compared. Our statements apply to any deterministic learning algorithms under weak non-degeneracy assumptions. In an application to tuning parameter choice in lasso regression on a gene expression data set, the test does not reject the null hypothesis of equal rates between two different parameters.
Item Type: | Paper |
---|---|
Keywords: | Unbiased Estimator; Penalized Regression Model; U-Statistic; Cross-Validation; Machine Learning; |
Faculties: | Mathematics, Computer Science and Statistics > Statistics > Technical Reports |
Subjects: | 500 Science > 510 Mathematics 600 Technology > 610 Medicine and health |
URN: | urn:nbn:de:bvb:19-epub-17654-2 |
Language: | English |
Item ID: | 17654 |
Date Deposited: | 18. Dec 2013, 17:12 |
Last Modified: | 04. Nov 2020, 12:59 |