A computationally fast variable importance test for random forests for high-dimensional data

www.lmu.de | UB | Blättern | Hilfe

Zur erweiterten Suche

English

Zur erweiterten Suche

Janitza, Silke; Celik, Ender und Boulesteix, Anne-Laure (2018): A computationally fast variable importance test for random forests for high-dimensional data. In: Advances in Data Analysis and Classification, Bd. 12, Nr. 4: S. 885-915

Volltext auf 'Open Access LMU' nicht verfügbar.

DOI: 10.1007/s11634-016-0276-4

Abstract

Random forests are a commonly used tool for classification and for ranking candidate predictors based on the so-called variable importance measures. These measures attribute scores to the variables reflecting their importance. A drawback of variable importance measures is that there is no natural cutoff that can be used to discriminate between important and non-important variables. Several approaches, for example approaches based on hypothesis testing, were developed for addressing this problem. The existing testing approaches require the repeated computation of random forests. While for low-dimensional settings those approaches might be computationally tractable, for high-dimensional settings typically including thousands of candidate predictors, computing time is enormous. In this article a computationally fast heuristic variable importance test is proposed that is appropriate for high-dimensional data where many variables do not carry any information. The testing approach is based on a modified version of the permutation variable importance, which is inspired by cross-validation procedures. The new approach is tested and compared to the approach of Altmann and colleagues using simulation studies, which are based on real data from high-dimensional binary classification settings. The new approach controls the type I error and has at least comparable power at a substantially smaller computation time in the studies. Thus, it might be used as a computationally fast alternative to existing procedures for high-dimensional data settings where many variables do not carry any information. The new approach is implemented in the R package vita.

Dokumententyp:	Zeitschriftenartikel
Fakultät:	Medizin > Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie
Themengebiete:	600 Technik, Medizin, angewandte Wissenschaften > 610 Medizin und Gesundheit
ISSN:	1862-5347
Sprache:	Englisch
Dokumenten ID:	61838
Datum der Veröffentlichung auf Open Access LMU:	09. Mai 2019 11:40
Letzte Änderungen:	04. Nov. 2020 13:39

Dokument bearbeiten