Born Again Tree Ensembles

Week 2.

What problem does this paper solve?

Given a complex tree ensemble such as a Random Forest, they construct a single decision tree of minimum depth that represents the same function over the entire feature space. In other words, you will input a Random Forest which is an ensemble of trees, and get as output a single decision tree that has exact same behaviour over the entire feature space.

This simpler representation will help in greater interpretability without a loss in accuracy. A simple yet powerful model!

Basic Terminology

Cell [z] : When you split a feature on one/more values, you create hyperplanes. A cell represents the box contained between 2 consecutive hyperplanes for each feature.
Region [(z $l$ ,z $r$ )] : Region is defined for a feature space. It is a collection of cells. The span of the region is represented by the bottom-left and top-right cell. The region contains all the cells including and between those 2 cells.

How does it work?

The authors use a Dynamic Programming approach to construct an optimal single tree representation of the ensemble. The idea is that we can use the property of optimal substructure and claim that we can limit the search to those optimal born-again trees who left and right sub-trees also represent optimal born-again trees for their regions respectively.

This allows us to frame the search as a recursion like so-

The ID(z $l$ ,z $r$ ) condition is True when that region gives the same majority class in all its enclosing cells

Tricks for more efficiency

In the DP equation, we see that to evaluate the base case, we need to see the homogeneity of the class over the region every time. We can avoid this by integrating the check in the main recursion and using memoization to evaluate it just once and reuse it when needed.
In the DP equation, we search over all possible hyperplanes for each feature. Turns out, we do not need to do this linear search. We can discard a number of those candidates and do a binary search instead in order to find the best hyperplane for a feature. The authors give a detailed proof for the same.

Approximation for greater Interpretability

1. Post pruning

Since the born again decision tree is trying to have the exact same behaviour as the random forest on the entire feature space, it leads to some conditions that are either not possible in real life or do not have any training data satisfying that condition. This primarily happens because the trees in the random forest are independent and one tree that replicates all joint conditions might lead to some improbable cases. So, once the born again tree is made, pruning is done from the bottom up. The nodes that have no training point falling into them are removed and replaced by the child node of the branch that has some samples.

However, after this step we can no longer claim that the born again decision tree replicates the random forest over the entire feature space. But the good news is that this has very little effect on the performance (0.0001% accuracy delta on test data) and also leads to much more interpretable tree with lesser depth and lesser number of leaves.

2. Heuristic

Early on in the paper, the authors prove that this problem of finding the born again tree is NP-Hard by reducing it to 3 SAT problem. The computation time of finding the born again tree is not practical when features increase beyond 20. To deal with this, they suggest a heuristic wherein the born again tree would still be completely faithful to the Random Forest in the entire feature space but would no longer be of least depth. The heuristic is as follows-

a. For a region ( $zl,zr$ ), instead of going over each cell and starting recursion for each, check for randomly picked Nc=1000 cells.

b.If all those Nc cells have same class, use Integer Programming to confirm if the entire region ( $zl,zr$ ) is homogenous or not. If we find out its homogenous, then we call it a leaf and end the recursion. Else, we continue splitting till we find a homogenous region.

This heuristic reduces the computation time significantly for large datasets (1381 seconds -> 1.14 seconds ) with a 20-22% increase in number of leaves and depth which is a fair trade off.

Personal Opinions

I really liked the way they presented the paper by first starting out with theoretical proofs and the optimal algorithm but at the same time acknowledging that this won't be that useful for practical use cases and hence going on the journey of adding tricks and heuristics to make it practically applicable. With sound proofs, detailed experimental results and nifty tricks, it was a really engaging and interesting read!

Resources

The original paper by Thibaut Vidal and Maximilian Schiffer

Github Repository

Previous“Why Should I Trust You?” Explaining the Predictions of Any Classifier (LIME)NextRobust and Stable Black Box Explanations

Last updated 4 years ago

Was this helpful?