DeutschClear Cookie - decide language by browser settings
Fuchs, Mathias; Jiang, Xiaoyu; Anne-Laure, Boulesteix (14. January 2016): The computationally optimal test set size in simulation studies on supervised learning. Department of Statistics: Technical Reports, No.189




We consider simulation studies on supervised learning which measure the performance of a classification- or regression method based on i.i.d. samples randomly drawn from a pre- specified distribution. In a typical setting, a large number of data sets are generated and split into training and test sets used to train and evaluate models, respectively. Here, we consider the problem of the choice of an adequate number of test observations. In this setting, the expectation of the method’s performance is independent of this choice, but the variance and hence the convergence speed may depend substantially on the trade-off between the number of test observations and the number of simulation iterations. Therefore, it is an important matter of computational convenience to choose it carefully. Here, we show that this problem can be formulated in terms of a well-defined optimization problem that possesses a solution in terms of a simple closed-form expression. We give examples to show that the relative contributions of each term can vary considerably between data sets and settings. We discuss the statistical estimation of the solution, giving a confidence interval for the optimal number of test observations.