The computationally optimal test set size in simulation studies on supervised learning

www.lmu.de | UB | Blättern | Hilfe

Zur erweiterten Suche

English

Zur erweiterten Suche

Fuchs, Mathias; Jiang, Xiaoyu und Anne-Laure, Boulesteix (14. Januar 2016): The computationally optimal test set size in simulation studies on supervised learning. Department of Statistics: Technical Reports, Nr. 189 [PDF, 432kB]

Vorschau

DOI: 10.5282/ubm/epub.26870

Abstract

We consider simulation studies on supervised learning which measure the performance of a classification- or regression method based on i.i.d. samples randomly drawn from a pre- specified distribution. In a typical setting, a large number of data sets are generated and split into training and test sets used to train and evaluate models, respectively. Here, we consider the problem of the choice of an adequate number of test observations. In this setting, the expectation of the method’s performance is independent of this choice, but the variance and hence the convergence speed may depend substantially on the trade-off between the number of test observations and the number of simulation iterations. Therefore, it is an important matter of computational convenience to choose it carefully. Here, we show that this problem can be formulated in terms of a well-defined optimization problem that possesses a solution in terms of a simple closed-form expression. We give examples to show that the relative contributions of each term can vary considerably between data sets and settings. We discuss the statistical estimation of the solution, giving a confidence interval for the optimal number of test observations.

Dokumententyp:	Paper
Keywords:	simulation study supervised learning
Fakultät:	Mathematik, Informatik und Statistik > Statistik > Technische Reports
Themengebiete:	000 Informatik, Informationswissenschaft, allgemeine Werke > 000 Informatik, Wissen, Systeme 000 Informatik, Informationswissenschaft, allgemeine Werke > 004 Informatik 500 Naturwissenschaften und Mathematik > 510 Mathematik
URN:	urn:nbn:de:bvb:19-epub-26870-1
Sprache:	Englisch
Dokumenten ID:	26870
Datum der Veröffentlichung auf Open Access LMU:	18. Jan. 2016 19:12
Letzte Änderungen:	04. Nov. 2020 13:07
Literaturliste:	[1] Dougherty ER, Zollanvari A, Braga-Neto UM. The illusion of distribution-free small-sample classification in genomics. Current genomics. 2011;12(5):333. [2] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2nd ed. Springer Se ries in Statistics; Springer, New York; 2009; data mining, inference, and prediction; Available from: http://dx.doi.org/10.1007/978-0-387-84858-7.

Dokument bearbeiten