In two of my previous blog posts, I explained how the black box of a random forest can be opened up by tracking decision paths along the trees and computing feature contributions. This way, any prediction can be decomposed into contributions from features, such that (prediction = bias + feature_1contribution+..+feature_ncontribution).
However, this linear breakdown is inherently imperfect, since a linear combination of features cannot capture interactions between them. A classic example of a relation where a linear combination of inputs cannot capture the output is exclusive or (XOR), defined as
X1 | X2 | OUT |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
In this case, neither X1 nor X2 provide anything towards predicting the outcome in isolation. Their value only becomes predictive in conjunction with the the other input feature.
A decision tree can easily learn a function to classify the XOR data correctly via a two level tree (depicted below). However, if we consider feature contributions at each node, then at first step through the tree (when we have looked only at X1), we haven’t yet moved away from the bias, so the best we can predict at that stage is still “don’t know”, i.e. 0.5. And if we would have to write out the contribution from the feature at the root of the tree, we would (incorrectly) say that it is 0. After the next step down the tree, we would be able to make the correct prediction, at which stage we might say that the second feature provided all the predictive power, since we can move from a coin-flip (predicting 0.5), to a concrete and correct prediction, either 0 or 1. But of course attributing this to only the second level variable in the tree is clearly wrong, since the contribution comes from both features and should be equally attributed to both.
This information is of course available along the tree paths. We simply should gather together all conditions (and thus features) along the path that lead to a given node.
As you can see, the contribution of the first feature at the root of the tree is 0 (value staying at 0.5), while observing the second feature gives the full information needed for the prediction. We can now combine the features along the decision path, and correctly state that X1 and X2 together create the contribution towards the prediction.
The joint contribution calculation is supported by v0.2 of the treeinterpreter package (clone or install via pip). Joint contributions can be obtained by passing the joint_contributions argument to the *predict *method, returning the triple [prediction, contributions, bias], where contribution is a mapping from tuples of feature indices to absolute contributions.Here’s an example, comparing two datasets of the Boston housing data, and calculating which feature combinations contribute to the difference in estimated prices
21.932917.8863207547
The average predicted price is different for the two datasets. We can break down why and check the joint feature contribution for both datasets.
Since biases are equal for both datasets (because the the model is the same), the difference between the average predicted values has to come only from (joint) feature contributions. In other words, the sum of the feature contribution differences should be equal to the difference in average prediction.We can make use of the aggregated_contributions convenience method which takes the contributions for individual predictions and aggregates them together for the whole dataset
4.046579245284.04657924528
Indeed we see that the contributions exactly match the difference, as they should.
Finally, we can check which feature combination contributed by how much to the difference of the predictions in the too datasets:
(['RM', 'LSTAT'], 2.0317570671740883)(['RM'], 0.69252072064203141)(['CRIM', 'RM', 'LSTAT'], 0.37069750747155134)(['RM', 'AGE'], 0.11572468903150034)(['INDUS', 'RM', 'AGE', 'LSTAT'], 0.054158313631716165)(['CRIM', 'RM', 'AGE', 'LSTAT'], -0.030778806073267474)(['CRIM', 'RM', 'PTRATIO', 'LSTAT'], 0.022935961564662693)(['CRIM', 'INDUS', 'RM', 'AGE', 'TAX', 'LSTAT'], 0.022200426774483421)(['CRIM', 'RM', 'DIS', 'LSTAT'], 0.016906509656987388)(['CRIM', 'INDUS', 'RM', 'AGE', 'LSTAT'], -0.016840238405056267)
The majority of the delta came from the feature for number of rooms (RM), in conjunction with demographics data (LSTAT).
Making random forest predictions interpretable is pretty straightforward, leading to a similar level of interpretability as linear models. However, in some cases, tracking the feature interactions can be important, in which case representing the results as a linear combination of features can be misleading. By using the joint_contributions keyword for prediction in the treeinterpreter package, one can trivially take into account feature interactions when breaking down the contributions.