Abstract
We consider simulation studies on supervised learning which measure the performance of a classification- or regression method based on i.i.d. samples randomly drawn from a pre- specified distribution. In a typical setting, a large number of data sets are generated and split into training and test sets used to train and evaluate models, respectively. Here, we consider the problem of the choice of an adequate number of test observations. In this setting, the expectation of the method’s performance is independent of this choice, but the variance and hence the convergence speed may depend substantially on the trade-off between the number of test observations and the number of simulation iterations. Therefore, it is an important matter of computational convenience to choose it carefully. Here, we show that this problem can be formulated in terms of a well-defined optimization problem that possesses a solution in terms of a simple closed-form expression. We give examples to show that the relative contributions of each term can vary considerably between data sets and settings. We discuss the statistical estimation of the solution, giving a confidence interval for the optimal number of test observations.
Dokumententyp: | Paper |
---|---|
Keywords: | simulation study supervised learning |
Fakultät: | Mathematik, Informatik und Statistik > Statistik > Technische Reports |
Themengebiete: | 000 Informatik, Informationswissenschaft, allgemeine Werke > 000 Informatik, Wissen, Systeme
000 Informatik, Informationswissenschaft, allgemeine Werke > 004 Informatik 500 Naturwissenschaften und Mathematik > 510 Mathematik |
URN: | urn:nbn:de:bvb:19-epub-26870-1 |
Sprache: | Englisch |
Dokumenten ID: | 26870 |
Datum der Veröffentlichung auf Open Access LMU: | 18. Jan. 2016, 19:12 |
Letzte Änderungen: | 04. Nov. 2020, 13:07 |
Literaturliste: | [1] Dougherty ER, Zollanvari A, Braga-Neto UM. The illusion of distribution-free small-sample classification in genomics. Current genomics. 2011;12(5):333. [2] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2nd ed. Springer Se ries in Statistics; Springer, New York; 2009; data mining, inference, and prediction; Available from: http://dx.doi.org/10.1007/978-0-387-84858-7. |