Logo Logo
Hilfe
Hilfe
Switch Language to English

Janitza, Silke; Celik, Ender und Boulesteix, Anne-Laure (22. Oktober 2015): A computationally fast variable importance test for random forests for high-dimensional data. Department of Statistics: Technical Reports, Nr. 185 [PDF, 1MB]

[thumbnail of TR185.pdf]
Vorschau
Download (1MB)

Abstract

Random forests are a commonly used tool for classification with high-dimensional data as well as for ranking candidate predictors based on the so-called variable importance measures. There are different importance measures for ranking predictor variables, the two most common measures are the Gini importance and the permutation importance. The latter has been found to be more reliable than the Gini importance. It is computed from the change in prediction accuracy when removing any association between the response and a predictor variable, with large changes indicating that the predictor variable is important. A drawback of those variable importance measures is that there is no natural cutoff that can be used to discriminate between important and non-important variables. Several approaches, for example approaches based on hypothesis testing, have been developed for addressing this problem. The existing testing approaches are permutation-based and require the repeated computation of forests. While for low-dimensional settings those permutation-based approaches might be computationally tractable, for high-dimensional settings typically including thousands of candidate predictors, computing time is enormous. A new computationally fast heuristic procedure of a variable importance test is proposed, that is appropriate for high-dimensional data where many variables do not carry any information. The testing approach is based on a modified version of the permutation variable importance measure, which is inspired by cross-validation procedures. The novel testing approach is tested and compared to the permutation-based testing approach of Altmann and colleagues using studies on complex high-dimensional binary classification settings. The new approach controlled the type I error and had at least comparable power at a substantially smaller computation time in our studies. The new variable importance test is implemented in the R package vita.

Dokument bearbeiten Dokument bearbeiten