Abstract
The Gini gain is one of the most common variable selection criteria in machine learning. We derive the exact distribution of the maximally selected Gini gain in the context of binary classification using continuous predictors by means of a combinatorial approach. This distribution provides a formal support for variable selection bias in favor of variables with a high amount of missing values when the Gini gain is used as split selection criterion, and we suggest to use the resulting p-value as an unbiased split selection criterion in recursive partitioning algorithms. We demonstrate the efficiency of our novel method in simulation- and real data- studies from veterinary gynecology in the context of binary classification and continuous predictor variables with different numbers of missing values. Our method is extendible to categorical and ordinal predictor variables and to other split selection criteria such as the cross-entropy criterion.
Item Type: | Paper |
---|---|
Faculties: | Mathematics, Computer Science and Statistics > Statistics > Collaborative Research Center 386 Special Research Fields > Special Research Field 386 |
Subjects: | 500 Science > 510 Mathematics |
URN: | urn:nbn:de:bvb:19-epub-1833-1 |
Language: | English |
Item ID: | 1833 |
Date Deposited: | 11. Apr 2007 |
Last Modified: | 04. Nov 2020, 12:45 |