Janitza, Silke; Tutz, Gerhard; Boulesteix, Anne-Laure
(1. December 2014):
Random Forests for Ordinal Response Data: Prediction and Variable Selection.
Department of Statistics: Technical Reports, No.174
The random forest method is a commonly used tool for classification with high-dimensional data that is able to rank candidate predictors through its inbuilt variable importance measures (VIMs). It can be applied to various kinds of regression problems including nominal, metric and survival response variables. While classification and regression problems using random forest methodology have been extensively investigated in the past, there seems to be a lack of literature on handling ordinal regression problems, that is if response categories have an inherent ordering. The classical random forest version of Breiman ignores the ordering in the levels and implements standard classification trees. Or if the variable is treated like a metric variable, regression trees are used which, however, are not appropriate for ordinal response data. Further compounding the difficulties the currently existing VIMs for nominal or metric responses have not proven to be appropriate for ordinal response. The random forest version of Hothorn et al. utilizes a permutation test framework that is applicable to problems where both predictors and response are measured on arbitrary scales. It is therefore a promising tool for handling ordinal regression problems. However, for this random forest version there is also no specific VIM for ordinal response variables and the appropriateness of the error rate based VIM computed by default in the case of ordinal responses has to date not been investigated in the literature. We performed simulation studies using random forest based on conditional inference trees to explore whether incorporating the ordering information yields any improvement in prediction performance or variable selection. We present two novel permutation VIMs that are reasonable alternatives to the currently implemented VIM which was developed for nominal response and makes no use of the ordering in the levels of an ordinal response variable. Results based on simulated and real data suggest that predictor rankings can be improved by using our new permutation VIMs that explicitly use the ordering in the response levels in combination with the ordinal regression trees suggested by Hothorn et al. With respect to prediction accuracy in our studies, the performance of ordinal regression trees was similar to and in most settings even slightly better than that of classification trees. An explanation for the greater performance is that in ordinal regression trees there is a higher probability of selecting relevant variables for a split. The codes implementing our studies and our novel permutation VIMs for the statistical software R are available at http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html.