Abstract
It is common knowledge that certain characteristics of data sets -- such as linear separability or sample size -- determine the performance of learning algorithms. In this paper we propose a formal framework for investigations on this relationship.
The framework combines three, in their respective scientific discipline well-established, methods. Benchmark experiments are the method of choice in machine and statistical learning to compare algorithms with respect to a certain performance measure on particular data sets. To realize the interaction between data sets and algorithms, the data sets are characterized using statistical and information-theoretic measures; a common approach in the field of meta learning to decide which algorithms are suited to particular data sets. Finally, the performance ranking of algorithms on groups of data sets with similar characteristics is determined by means of recursively partitioning Bradley-Terry models, that are commonly used in psychology to study the preferences of human subjects. The result is a tree with splits in data set characteristics which significantly change the performances of the algorithms. The main advantage is the automatic detection of these important characteristics.
The framework is introduced using a simple artificial example. Its real-word usage is demonstrated by means of an application example consisting of thirteen well-known data sets and six common learning algorithms. All resources to replicate the examples are available online.
Item Type: | Paper |
---|---|
Keywords: | Benchmark experiments, data set characterization, recursive partitioning, preference scaling, Bradley-Terry model |
Faculties: | Mathematics, Computer Science and Statistics > Statistics > Technical Reports |
Subjects: | 300 Social sciences > 310 Statistics |
URN: | urn:nbn:de:bvb:19-epub-11425-9 |
Language: | English |
Item ID: | 11425 |
Date Deposited: | 09. Mar 2010, 13:54 |
Last Modified: | 04. Nov 2020, 12:52 |