Abstract
We consider simulation studies on supervised learning which measure the performance of a classification- or regression method based on i.i.d. samples randomly drawn from a pre- specified distribution. In a typical setting, a large number of data sets are generated and split into training and test sets used to train and evaluate models, respectively. Here, we consider the problem of the choice of an adequate number of test observations. In this setting, the expectation of the method’s performance is independent of this choice, but the variance and hence the convergence speed may depend substantially on the trade-off between the number of test observations and the number of simulation iterations. Therefore, it is an important matter of computational convenience to choose it carefully. Here, we show that this problem can be formulated in terms of a well-defined optimization problem that possesses a solution in terms of a simple closed-form expression. We give examples to show that the relative contributions of each term can vary considerably between data sets and settings. We discuss the statistical estimation of the solution, giving a confidence interval for the optimal number of test observations.
Item Type: | Paper |
---|---|
Keywords: | simulation study supervised learning |
Faculties: | Mathematics, Computer Science and Statistics > Statistics > Technical Reports |
Subjects: | 000 Computer science, information and general works > 000 Computer science, knowledge, and systems 000 Computer science, information and general works > 004 Data processing computer science 500 Science > 510 Mathematics |
URN: | urn:nbn:de:bvb:19-epub-26870-1 |
Language: | English |
Item ID: | 26870 |
Date Deposited: | 18. Jan 2016, 19:12 |
Last Modified: | 04. Nov 2020, 13:07 |
References: | [1] Dougherty ER, Zollanvari A, Braga-Neto UM. The illusion of distribution-free small-sample classification in genomics. Current genomics. 2011;12(5):333. [2] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2nd ed. Springer Se ries in Statistics; Springer, New York; 2009; data mining, inference, and prediction; Available from: http://dx.doi.org/10.1007/978-0-387-84858-7. |