Abstract
Random Forests are commonly applied for data prediction and interpretation. The latter purpose is supported by variable importance measures that rate the relevance of predictors. Yet existing measures can not be computed when data contains missing values. Possible solutions are given by imputation methods, complete case analysis and a newly suggested importance measure. However, it is unknown to what extend these approaches are able to provide a reliable estimate of a variables relevance. An extensive simulation study was performed to investigate this property for a variety of missing data generating processes. Findings and recommendations: Complete case analysis should not be applied as it inappropriately penalized variables that were completely observed. The new importance measure is much more capable to reflect decreased information exclusively for variables with missing values and should therefore be used to evaluate actual data situations. By contrast, multiple imputation allows for an estimation of importances one would potentially observe in complete data situations.
Item Type: | Paper |
---|---|
Form of publication: | Preprint |
Keywords: | Random Forests, variable importance measures, missing data, multiple imputation, surrogates, complete case analysis |
Faculties: | Mathematics, Computer Science and Statistics > Statistics > Technical Reports |
Subjects: | 500 Science > 510 Mathematics |
URN: | urn:nbn:de:bvb:19-epub-12757-8 |
Language: | English |
Item ID: | 12757 |
Date Deposited: | 15. Feb 2012, 17:10 |
Last Modified: | 04. Nov 2020, 12:53 |