Improving Variable Importance in Random Forests

Sample size: 310 publication Evidence: moderate

Author Information

Author(s): Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, Achim Zeileis

Primary Institution: Ludwig-Maximilians-Universität Munchen

Hypothesis

The original variable importance measures in random forests are biased towards correlated predictor variables.

Conclusion

The new conditional variable importance measure reflects the true impact of each predictor variable more reliably than the original marginal approach.

Supporting Evidence

Random forests are popular for their high predictive accuracy in high-dimensional problems.
The study identifies mechanisms that cause bias in variable importance measures.
A new conditional permutation scheme is proposed to improve the reliability of variable importance.

Takeaway

This study shows that when using random forests to analyze data, we can get better results by considering how predictor variables are related to each other.

Methodology

The study developed a new conditional permutation scheme for computing variable importance in random forests, tested through simulation studies.

Potential Biases

The original permutation importance overestimates the importance of correlated predictor variables.

Limitations

The conditional permutation scheme cannot entirely eliminate the preference for correlated predictor variables.

Digital Object Identifier (DOI)

10.1186/1471-2105-9-307

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication

Home