Robust and Stable Black Box Explanations

Week 3.

What problem does this paper solve?

The goal is to provide robust explanations to the predictions made by a Machine Learning model. Notice the emphasis on the word robust. Contrary to other explainability methods, this method specifically optimizes for robustness of the explanation. It does so by relating the concept of distributional robustness with that of Adversarial robustness and thereby using adversarial training.

Basic Terminology

Post Hoc Explanation Explaining the predictions made by a model after it has already been trained. The paper proposes a post-hoc method.
Adversarial Robustness In general, if your model is adversarially robust, then making small changes to the input point will not change the output of the model considerably. In context of explainability, the explanations for two points should not vary significantly if I make a little change to one point's features. Note that this is on a point level.
Distribution Shift This means that the training and test distributions vary. For instance, you use training data of last year to learn a credit model, but for some reason, say a pandemic, your current test distribution changes drastically. Here, you have a complete distribution shift scenario.Current explainability methods do not deal well with this distribution shift. Note that this is on a distribution level.
Adversarial Training Instead of simply minimizing the training loss on each sample/batch, in adversarial training, we minimize the loss around a perturbation generated near the sample/batch. Also, instead of a single optimization problem wherein we minimize loss, this uses a MinMax formulation, where first we maximize the perturbation around the point and then minimize the loss for the maximized perturbation. This is what leads to robustness in the region.

Methodology

The authors propose a method called ROPE(RObust Post hoc Explanations). Similar to other post hoc methods like LIME, ROPE also uses a surrogate model which is inherently interpretable to provide explanations for a complex black box model. They key difference lies in the optimization procedure of that surrogate model. In ROPE, adversarial training is used. So, instead of a standard learning procedure for a linear surrogate model that just minimizes the loss between the explanations provided by the surrogate model and the output of the blackbox model, ROPE also maximizes the perturbation around the point before minimizing the loss.

Mathematically, this is how methods like LIME generate their explanations-

Where the loss is between the Explanation E(x) and the blackbox output B*(x) and we minimize it.

However, for ROPE we first define a shifted distribution.Let p be a distribution over X , and let δ ∈ nR . Then , The δ-shifted distribution is pδ(x) =p(x − δ). We then formulate the explanation in this shifted distribution like so-

Here, we have an inner maximization problem where we maximize the perturbation δ around the point and then minimize the loss over that perturbed space.

The authors define a general class of distribution shifts for which they want their models to be stable and robust against. This is the Δ over which the inner maximization search for δ is performed.

The authors also prove that to generate robust linear explanations, this framework can be reduced to a framework similar to the one we use in Adversarial Training and then go on to perform gradient descent to learn these robust linear surrogate models.

To generate robust rule based explanations, they propose a sampling based strategy since adversarial training using gradient descent cannot be performed in this case.

Experiments

The experiments were carried out using thoughtfully chosen datasets.The authors use 3 real world datasets each of which had two groups of data. For instance, they use electronic health records data that was collected in two different hospitals. To test the method, they train the model and build explanations on one group (training data) and then evaluate those explanations on the other(test data). They show that they have the least drop in fidelity from one group to the other, proving that their explanations hold good even on shifted data. The comparisons are made with methods like LIME,SHAP and MUSE. They also run similar experiments on synthetic data and produce shifted test distributions manually by adding correlation, variance and mean shifts. Experiments are also performed to see the correctness of the explanations. To do this, black box models that are interpretable are trained by using the training data. Then, an explanation is generated for the shifted data. If this explanation model is similar to the interpretable black box model structurally, then the explanations generated are robust to being used on shifted data. An example of this is to first chose a linear model as the black box and train it on training data. Then adversarially train a linear surrogate model using ROPE on the shifted data(test data) and see how much the coefficients of the two linear models differ. The lesser they differ, the better!

Resources

Robust and Stable Black Box Explanations by Himabindu Lakkaraj, Nino Arsov, Osbert Bastani at ICML '20

If you did not have much context of post hoc explainability methods before reading this article, it might be he helpful for you to read the paper summary of “Why Should I Trust You?” Explaining the Predictions of Any Classifier (LIME) and then come back to this!

PreviousBorn Again Tree Ensembles NextInterpretML: A Unified Framework for Machine Learning Interpretability

Last updated 5 years ago

Was this helpful?