Fuchs, Mathias; Jiang, Xiaoyu; AnneLaure, Boulesteix (14. January 2016): The computationally optimal test set size in simulation studies on supervised learning. Department of Statistics: Technical Reports, No.189 

432kB 
Abstract
We consider simulation studies on supervised learning which measure the performance of a classification or regression method based on i.i.d. samples randomly drawn from a pre specified distribution. In a typical setting, a large number of data sets are generated and split into training and test sets used to train and evaluate models, respectively. Here, we consider the problem of the choice of an adequate number of test observations. In this setting, the expectation of the method’s performance is independent of this choice, but the variance and hence the convergence speed may depend substantially on the tradeoff between the number of test observations and the number of simulation iterations. Therefore, it is an important matter of computational convenience to choose it carefully. Here, we show that this problem can be formulated in terms of a welldefined optimization problem that possesses a solution in terms of a simple closedform expression. We give examples to show that the relative contributions of each term can vary considerably between data sets and settings. We discuss the statistical estimation of the solution, giving a confidence interval for the optimal number of test observations.
Item Type:  Paper (Technical Report) 

Keywords:  simulation study supervised learning 
Faculties:  Mathematics, Computer Science and Statistics > Statistics > Technical Reports 
Subjects:  000 Computer science, information and general works > 000 Computer science, knowledge, and systems 000 Computer science, information and general works > 004 Data processing computer science 500 Science > 510 Mathematics 
URN:  urn:nbn:de:bvb:19epub268701 
Language:  English 
ID Code:  26870 
Deposited On:  18. Jan 2016 19:12 
Last Modified:  18. Jan 2016 19:12 
References:  [1] Dougherty ER, Zollanvari A, BragaNeto UM. The illusion of distributionfree smallsample classification in genomics. Current genomics. 2011;12(5):333. [2] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2nd ed. Springer Se ries in Statistics; Springer, New York; 2009; data mining, inference, and prediction; Available from: http://dx.doi.org/10.1007/9780387848587. 