Abstract
Diversity forests are a class of random forest type prediction methods that modifies the split selection procedure of conventional random forests to allow for complex split procedures. While random forests show strong prediction performance when using conventional univariate, binary splitting, the procedure still has disadvantages. For example, interactions between features are not exploited effectively. The split selection procedure of diversity forests consists of choosing the best splits from sets of 'nsplits' candidate splits obtained by random selection from repeatedly sampled, specifically structured collections of splits. This makes complex split procedures computationally tangible while avoiding overfitting. This paper focuses on introducing diversity forests and evaluating its performance for univariate, binary splitting. Specific, complex split procedures will be the focus of future work. Using a collection of 220 real data sets with binary target variables, diversity forests are compared with conventional random forests and random forests using extremely randomized trees. It is seen that randomizing the split selection, as performed by diversity forests, leads to slight improvements in prediction performance and that this performance is quite robust with regard to the specified 'nsplits' value. These results indicate that diversity forests are well suited for realizing complex split procedures in random forests.
Item Type: | Paper |
---|---|
Keywords: | Random forest; Ensemble Learning; Classification; Decision trees; Multivariate splitting |
Faculties: | Mathematics, Computer Science and Statistics > Statistics > Technical Reports |
Subjects: | 500 Science > 500 Science |
URN: | urn:nbn:de:bvb:19-epub-73377-3 |
Language: | English |
Item ID: | 73377 |
Date Deposited: | 08. Sep 2020, 12:13 |
Last Modified: | 04. Nov 2020, 13:53 |
References: | Bertsimas, D. and Dunn, J. (2017). Optimal classification trees. Machine Learning, 106, 1039–1082. Berzal, F., Cubero, J.-C., Marín, N., and Sánchez, D. (2004). Building multi-way decision trees with numerical attributes. Information Sciences, 165(1–2), 73–90. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. Breiman, L., Friedman, J. H., Olshen, R. A., and Ston, C. J. (1984). Classification and Regression Trees. Wadsworth International Group, Monterey, CA. Brodley, C. E. and Utgoff, P. E. (1995). Multivariate decision trees. Machine Learning, 19, 45–77. Calhoun, P., Hallett, M. J., Su, X., Cafri, G., Levine, R. A., and Fan, J. (2019). Random forest with acceptance-rejection trees. Computational Statistics, pages 1–17. Couronné, R., Probst, P., and Boulesteix, A.-L. (2018). Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinformatics, 19, 270. Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Y. W. Teh and M. Titterington, editors, Proceedings of the Thirteenth International Join Conference on Artificial Intelligence, pages 1022–1027. Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Mach Learn, 63(1), 3–42. Hornung, R. (2020). Ordinal forests. Journal of Classification, 37, 4–17. Hothorn, T., Hornik, K., and Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674. Ishwaran, H., Kogalur, U. B., Blackstone, E. H., and Lauer, M. S. (2008). Random survival forests. Ann Appl Stat, 2, 841–860. Janitza, S. and Hornung, R. (2018). On the overestimation of random forest's out-of-bag error. PLoS ONE, 13(8), e0201904. Murthy, S. K., Kasif, S., and Salzberg, S. (1994). A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2, 1–32. Probst, P., Boulesteix, A.-L., and Bischl, B. (2019). Tunability: Importance of hyperparameters of machine learning algorithms. Journal of Machine Learning Research, 20(53), 1–32. Su, X., Pena, A. T., Liu, L., and Levine, R. A. (2018). Random forests of interaction trees for estimating individualized treatment effects in randomized trials. Statistics in Medicine, 37, 2547–2560. Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. (2014). OpenML: networked science in machine learning. ACM SIGKDD Exploration News Letter, 15(2), 49–60. Wickramarachchi, D. C., Robertson, B. L., Reale, M., Price, C. J., and Brown, J. (2015). Hhcart: An oblique decision tree. Computational Statistics and Data Analysis, 96, 12–23. Wright, M. N. and Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. Wright, M. N., Ziegler, A., and König, I. R. (2016). Do little interactions get lost in dark random forests? BMC Bioinformatics, 17, 145. Yen, E. and Chu, I.-W. M. (2007). Relaxing instance boundaries for the search of splitting points of numerical attributes in classification trees. Information Sciences, 177(5), 1276–1289. |