Abstract
Although interaction effects can be exploited to improve predictions and allow for valuable insights into covariate interplay, they are given little attention in analysis. We introduce interaction forests, which are a variant of random forests for categorical, continuous, and survival outcomes, explicitly considering quantitative and qualitative interaction effects in bivariable splits performed by the trees constituting the forests. The new effect importance measure (EIM) associated with interaction forests allows ranking of the covariate pairs with respect to their interaction effects' importance for prediction. Using EIM, separate importance value lists for univariable effects, quantitative interaction effects, and qualitative interaction effects are obtained. In the spirit of interpretable machine learning, the bivariable split types of interaction forests target well interpretable interaction effects that are easy to communicate. To learn about the nature of the interplay between identified interacting covariate pairs it is convenient to visualise their estimated bivariable influence. We provide functions that perform this task in the R package diversityForest that implements interaction forests. In a large-scale empirical study using 220 data sets, interaction forests tended to deliver better predictions than conventional random forests and competing random forest variants that use multivariable splitting. In a simulation study, EIM delivered considerably better rankings for the relevant quantitative and qualitative interaction effects than competing approaches. These results indicate that interaction forests are suitable tools for the challenging task of identifying and making use of well interpretable interaction effects in predictive modelling.
Dokumententyp: | Paper |
---|---|
Keywords: | Interaction effects, Random forests, Feature importance, Non-parametric modelling, Machine learning |
Fakultät: | Mathematik, Informatik und Statistik > Statistik > Technische Reports |
Themengebiete: | 500 Naturwissenschaften und Mathematik > 500 Naturwissenschaften |
URN: | urn:nbn:de:bvb:19-epub-75432-0 |
Sprache: | Englisch |
Dokumenten ID: | 75432 |
Datum der Veröffentlichung auf Open Access LMU: | 24. Mrz. 2021, 08:12 |
Letzte Änderungen: | 24. Mrz. 2021, 08:12 |
Literaturliste: | Altmann, A., Toloşi, L., Sander, O., and Lengauer, T. (2010). Permutation importance: a corrected feature importance measure. Bioinformatics, 26, 1340–1347. Basu, S., Kumbier, K., Brown, J. B., and Yu, B. (2018). Iterative random forests to discover predictive and stable high-order interactions. Proceedings of the National Academy of Sciences (PNAS), 115(8), 1943–1948. Bertsimas, D. and Dunn, J. (2017). Optimal classification trees. Machine Learning, 106, 1039–1082. Boulesteix, A.-L., Janitza, S., Hapfelmeier, A., Van Steen, K., and Strobl, C. (2015a). Letter to the editor: on the term ‘interaction’ and related phrases in the literature on random forests. Briefings in Bioinformatics, 16(2), 338–345. Boulesteix, A.-L., Stierle, V., and Hapfelmeier, A. (2015b). Publication bias in methodological computational research. Cancer Informatics, 14, 11–19. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. Breiman, L., Friedman, J. H., Olshen, R. A., and Ston, C. J. (1984). Classification and Regression Trees. Wadsworth International Group, Monterey, CA. Bureau, A., Dupuis, J., Falls, K., Lunetta, K. L., Hayward, B., Keith, T. P., and Eerdewegh, P. V. (2005). Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology, 28, 171–182. Chen, Z. and Zhang, W. (2013). Integrative analysis using module-guided random forests reveals correlated genetic factors related to mouse weight. PLoS Computational Biology, 9(3), e1002956. Couronné, R., Probst, P., and Boulesteix, A.-L. (2018). Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinformatics, 19, 270. Dazard, J.-E., Ishwaran, H., Mehlotra, R., Weinberg, A., and Zimmerman, P. (2018). Ensemble survival tree models to reveal pairwise interactions of variables with time-to-events outcomes in low-dimensional setting. Statistical Applications in Genetics and Molecular Biology, 17(1), 20170038. Du, J. and Linero, A. (2019). Interaction Detection with Bayesian Decision Tree Ensembles. In K. Chaudhuri and M. Sugiyama, editors, The 22nd International Conference on Artificial Intelligence and Statistics, pages 108–117. Gashler, M., Giraud-Carrier, C., and Martinez, T. (2008). Decision tree ensemble: Small heterogeneous is better than large homogeneous. In M. A. Wani, X.-W. Chen, D. Casasent, L. A. Kurgan, T. Hu, and K. Hafeez, editors, Seventh International Conference on Machine Learning and Applications, pages 900–905. Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3–42. Hapfelmeier, A. and Ulm, K. (2013). A new variable selection approach using random forests. Computational Statistics & Data Analysis, 60, 50–69. Hapfelmeier, A., Hothorn, T., Ulm, K., and Strobl, C. (2014). A new variable importance measure for random forests with missing data. Statistics and Computing, 24, 21–34. Hornung, R. (2020). Diversity forests: Using split sampling to allow for complex split procedures in random forest. Technical report 234, Department of Statistics, University of Munich. Ishwaran, H. (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1, 519–537. Janitza, S., Celik, E., and Boulesteix, A.-L. (2018). A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification, 12, 885–915. Jiang, R., Tang, W., Wu, X., and Fu, W. (2009). A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics, 10(Suppl. 1), S65. Kelly, C. and Okada, K. (2012). Variable interaction measures with random forest classifiers. In Proceedings of the 9th IEEE International Symposium on Biomedical Imaging (ISBI), pages 154–157. Kim, H. and Loh, W.-Y. (2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96(454), 589–604. Li, J., Malley, J. D., Andrew, A. S., Karagas, M. R., and Moore, J. H. (2016). Detecting gene-gene interactions using a permutation-based random forest method. BioData Mining, 9, 14. Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12, 361–386. Menze, B. H., Kelm, B. M., Splitthoff, D. N., Koethe, U., and Hamprecht, F. A. (2011). On oblique random forests. In D. Gunopulos, T. Hofmann, D. Malerba, and M. Vazirgiannis, editors, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pages 453–469. Molnar, C., Casalicchio, G., and Bischl, B. (2020). Interpretable machine learning - a brief history, state-of-the-art and challenges. arXiv:2010.09337. Ng, V. W. and Breiman, L. (2005). Bivariate variable selection for classification problem. Technical report 692, Department of Statistics, University of California, Berkeley, CA. Peto, R. (1982). Statistical aspects of cancer trials. In K. E. Halnam, editor, Treatment of Cancer. Chapman & Hall: London. Probst, P., Boulesteix, A.-L., and Bischl, B. (2019). Tunability: Importance of hyperparameters of machine learning algorithms. Journal of Machine Learning Research, 20(53), 1–32. Rainforth, T. and Wood, F. (2015). Canonical correlation forests. arXiv:1507.05444. Rodríguez, J. J., Kuncheva, L. I., and Alonso, C. J. (2006). Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1619–1630. Sorokina, D., Caruana, R., and Riedewald, M. (2007). Additive groves of regression trees. In J. N. Kok, J. Koronacki, R. L. Mantaras, S. M. S, D. Mladenič, and A. Skowron, editors, Proceedings of the 18th European conference on Machine Learning, pages 323–334. Sorokina, D., Caruana, R., Riedewald, M., and Fink, D. (2008). Detecting statistical interactions with additive groves of trees. In W. Cohen, A. K. McCallum, and S. T. Roweis, editors, Proceedings of the 25th international conference on Machine learning, pages 1000–1007. Tang, R., Sinnwell, J. P., Li, J., Rider, D. N., de Andrade, M., and Biernacka, J. M. (2009). Identification of genes and haplotypes that predict rheumatoid arthritis using random forest. BMC Proceedings, 3, S68. Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. (2013). OpenML: Networked Science in Machine Learning. SIGKDD Explorations, 15(2), 49–60. Wright, M. N. and König, I. R. (2019). Splitting on categorical predictors in random forests. PeerJ , 7, e6339. Wright, M. N. and Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77, 1–17. Wright, M. N., Ziegler, A., and König, I. R. (2016). Do little interactions get lost in dark random forests? BMC Bioinformatics, 17, 145. Yoshida, M. and Koike, A. (2011). SNPInterForest: A new method for detecting epistatic interactions. BMC Bioinformatics, 12, 469. |