Improving Variable Importance in Random Forests
Author Information
Author(s): Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, Achim Zeileis
Primary Institution: Ludwig-Maximilians-Universität Munchen
Hypothesis
The original variable importance measures in random forests are biased towards correlated predictor variables.
Conclusion
The new conditional variable importance measure reflects the true impact of each predictor variable more reliably than the original marginal approach.
Supporting Evidence
- Random forests are popular for their high predictive accuracy in high-dimensional problems.
- The study identifies mechanisms that cause bias in variable importance measures.
- A new conditional permutation scheme is proposed to improve the reliability of variable importance.
Takeaway
This study shows that when using random forests to analyze data, we can get better results by considering how predictor variables are related to each other.
Methodology
The study developed a new conditional permutation scheme for computing variable importance in random forests, tested through simulation studies.
Potential Biases
The original permutation importance overestimates the importance of correlated predictor variables.
Limitations
The conditional permutation scheme cannot entirely eliminate the preference for correlated predictor variables.
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website