Alibi is an open source Python library aimed at machine learning model inspection and interpretation. The initial focus on the library is on blackbox, instance based model explanations.
Goals¶
Provide high quality reference implementations of blackbox ML model explanation and interpretation algorithms
Define a consistent API for interpretable ML methods
Support multiple use cases (e.g. tabular, text and image data classification, regression)
Getting Started¶
Features¶
Alibi is a Python package designed to help explain the predictions of machine learning models, gauge the confidence of predictions and eventually support wider capabilities of inspecting the performance of models with respect to concept drift and algorithmic bias. The focus of the library is to support the widest range of models using blackbox methods where possible.
To get a list of the latest available model explanation algorithms, you can type:
import alibi
alibi.explainers.__all__
['AnchorTabular',
'AnchorText',
'AnchorImage',
'CEM',
'CounterFactual',
'CounterFactualProto'
'KernelShap']
For gauging model confidence:
alibi.confidence.__all__
['linearity_measure',
'LinearityMeasure',
'TrustScore']
For detailed information on the methods:
Basic Usage¶
We will use the Anchor method on tabular data to illustrate the usage of explainers in Alibi.
First, we import the explainer:
from alibi.explainers import AnchorTabular
Next, we initialize it by passing it a prediction function and any other necessary arguments:
explainer = AnchorTabular(predict_fn, feature_names)
Some methods require an additional .fit
step which requires access to the training set the model
was trained on:
explainer.fit(X_train)
AnchorTabular(meta={
'name': 'AnchorTabular',
'type': ['blackbox'],
'explanations': ['local'],
'params': {'seed': None, 'disc_perc': (25, 50, 75)}
})
Finally, we can call the explainer on a test instance which will return an Explanation
object containing the
explanation and any additional metadata returned by the computation:
explanation = explainer.explain(x)
The returned Explanation
object has meta
and data
attributes which are dictionaries containing any explanation
metadata (e.g. parameters, type of explanation) and the explanation itself respectively:
explanation.meta
{'name': 'AnchorTabular',
'type': ['blackbox'],
'explanations': ['local'],
'params': {'seed': None,
'disc_perc': (25, 50, 75),
'threshold': 0.95,
'delta': ...truncated output...
explanation.data
{'anchor': ['petal width (cm) > 1.80', 'sepal width (cm) <= 2.80'],
'precision': 0.9839228295819936,
'coverage': 0.31724137931034485,
'raw': {'feature': [3, 1],
'mean': [0.6453362255965293, 0.9839228295819936],
'precision': [0.6453362255965293, 0.9839228295819936],
'coverage': [0.20689655172413793, 0.31724137931034485],
'examples': ...truncated output...
The top level keys of both meta
and data
dictionaries are also exposed as attributes for ease of use of the explanation:
explanation.anchor
['petal width (cm) > 1.80', 'sepal width (cm) <= 2.80']
The exact details will vary slightly from method to method, so we encourage the reader to become familiar with the types of algorithms supported in Alibi.
Algorithm overview¶
This page provides a highlevel overview of the algorithms and their features currently implemented in Alibi.
Model Explanations¶
These algorithms provide instancespecific (sometimes also called local) explanations of ML model predictions. Given a single instance and a model prediction they aim to answer the question “Why did my model make this prediction?” The following algorithms all work with blackbox models meaning that the only requirement is to have acces to a prediction function (which could be an API endpoint for a model in production). Note that local explanations can be combined to give insights into global model behaviour, but this comes at the expense of significant runtime increase.
The following table summarizes the capabilities of the current algorithms:
Explainer 
Model types 
Classification 
Categorical data 
Tabular 
Text 
Images 
Need training set 

blackbox 
✔ 
✔ 
✔ 
✔ 
✔ 
For Tabular 

blackbox, TF/Keras 
✔ 
✘ 
✔ 
✘ 
✔ 
Optional 

blackbox, TF/Keras 
✔ 
✘ 
✔ 
✘ 
✔ 
No 

blackbox 
✔ 
✔ 
✔ 
✘ 
✘ 
✔ 

blackbox, TF/Keras 
✔ 
✔ 
✔ 
✘ 
✔ 
Optional 
Anchor explanations: produce an “anchor”  a small subset of features and their ranges that will almost always result in the same model prediction. Documentation, tabular example, text classification, image classification.
Contrastive explanation method (CEM): produce a pertinent positive (PP) and a pertinent negative (PN) instance. The PP instance finds the features that should be minimally and sufficiently present to predict the same class as the original prediction (a PP acts as the “most compact” representation of the instance to keep the same prediction). The PN instance identifies the features that should be minimally and necessarily absent to maintain the original prediction (a PN acts as the closest instance that would result in a different prediction). Documentation, tabular example, image classification.
Counterfactual instances: generate counterfactual examples using a simple loss function. Documentation, image classification.
Kernel Shapley Additive Explanation (SHAP): attribute the change of a model output with respect to a given baseline (e.g., average over a training set) to each of the model features. This is achieved for each feature in turn, by averaging the difference in the model output observed when excluding a feature from the input. The exclusion of a feature is achieved by replacing it with values from the background dataset. Documentation, continuous data, more continous_data, categorical data.
Prototype Counterfactuals: generate counterfactuals guided by nearest class prototypes other than the class predicted on the original instance. It can use both an encoder or kd trees to define the prototypes. This method can speed up the search, especially for black box models, and create interpretable counterfactuals. Documentation, tabular example, tabular example with categorical features, image classification.
Model Confidence¶
These algorihtms provide instancespecific scores measuring the model confidence for making a particular prediction.
Algorithm 
Model types 
Classification 
Regression 
Categorical data 
Tabular 
Text 
Images 
Need training set 

blackbox 
✔ 
✘ 
✘ 
✔ 
✔1 
✔2 
Yes 

blackbox 
✔ 
✔ 
✘ 
✔ 
✘ 
✔ 
Optional 
Trust scores: produce a “trust score” of a classifier’s prediction. The trust score is the ratio between the distance to the nearest class different from the predicted class and the distance to the predicted class, higher scores correspond to more trustworthy predictions. Documentation, tabular example, image classification
Linearity measure: produces a score quantifying how linear the model is around a test instance. The linearity score measures the model linearity around a test instance by feeding the model linear superpositions of inputs and comparing the outputs with the linear combination of outputs from predictions on single inputs. Tabular example, image classification
Roadmap¶
Alibi aims to be the goto library for ML model interpretability and monitoring. There are multiple challenges for developing a highquality, productionready library that achieves this. In addition to having high quality reference implementations of the most promising algorithms, we need extensive documentation and case studies comparing the different interpretability methods and their respective pros and cons. A clean and a usable API is also a priority.
Short term¶
Ongoing optimizations of existing algorithms (speed, parallelisation, explanation quality)
Finalize a unified API (Github PR)
Initial visualizations and visualization backends (Github issue)
Whitebox explanation methods (e.g. Integrated Gradients)
Support both TensorFlow and PyTorch for whitebox methods
Medium term¶
Migrate counterfactual methods to TensorFlow 2.0 (Github issue)
Additional blackbox explanation methods (ALE)
Additional model confidence/calibration methods
Long term¶
Explanations for regression models
Explanations for sequential and structured data
This page was generated from doc/source/methods/Anchors.ipynb.
Anchors¶
Overview¶
The anchor algorithm is based on the Anchors: HighPrecision ModelAgnostic Explanations paper by Ribeiro et al. and builds on the open source code from the paper’s first author.
The algorithm provides modelagnostic (black box) and human interpretable explanations suitable for classification models applied to images, text and tabular data. The idea behind anchors is to explain the behaviour of complex models with highprecision rules called anchors. These anchors are locally sufficient conditions to ensure a certain prediction with a high degree of confidence.
Anchors address a key shortcoming of local explanation methods like LIME which proxy the local behaviour of the model in a linear way. It is however unclear to what extent the explanation holds up in the region around the instance to be explained, since both the model and data can exhibit nonlinear behaviour in the neighborhood of the instance. This approach can easily lead to overconfidence in the explanation and misleading conclusions on unseen but similar instances. The anchor algorithm tackles this issue by incorporating coverage, the region where the explanation applies, into the optimization problem. A simple example from sentiment classification illustrates this (Figure 1). Dependent on the sentence, the occurrence of the word not is interpreted as positive or negative for the sentiment by LIME. It is clear that the explanation using not is very local. Anchors however aim to maximize the coverage, and require not to occur together with good or bad to ensure respectively negative or positive sentiment.
Ribeiro et al., Anchors: HighPrecision ModelAgnostic Explanations, 2018
As highlighted by the above example, an anchor explanation consists of ifthen rules, called the anchors, which sufficiently guarantee the explanation locally and try to maximize the area for which the explanation holds. This means that as long as the anchor holds, the prediction should remain the same regardless of the values of the features not present in the anchor. Going back to the sentiment example: as long as not good is present, the sentiment is negative, regardless of the other words in the movie review.
Text¶
For text classification, an interpretable anchor consists of the words that need to be present to ensure a prediction, regardless of the other words in the input. The words that are not present in a candidate anchor can be sampled in 2 ways:
Replace word token by UNK token.
Replace word token by sampled token from a corpus with the same POS tag and probability proportional to the similarity in the embedding space. By sampling similar words, we keep more context than simply using the UNK token.
Tabular Data¶
Anchors are also suitable for tabular data with both categorical and continuous features. The continuous features are discretized into quantiles (e.g. deciles), so they become more interpretable. The features in a candidate anchor are kept constant (same category or bin for discretized features) while we sample the other features from a training set. As a result, anchors for tabular data need access to training data. Let’s illustrate this with an example. Say we want to predict whether a person makes less or more than £50,000 per year based on the person’s characteristics including age (continuous variable) and marital status (categorical variable). The following would then be a potential anchor: Hugo makes more than £50,000 because he is married and his age is between 35 and 45 years.
Images¶
Similar to LIME, images are first segmented into superpixels, maintaining local image structure. The interpretable representation then consists of the presence or absence of each superpixel in the anchor. It is crucial to generate meaningful superpixels in order to arrive at interpretable explanations. The algorithm supports a number of standard image segmentation algorithms (felzenszwalb, slic and quickshift) and allows the user to provide a custom segmentation function.
The superpixels not present in a candidate anchor can be masked in 2 ways:
Take the average value of that superpixel.
Use the pixel values of a superimposed picture over the masked superpixels.
Ribeiro et al., Anchors: HighPrecision ModelAgnostic Explanations, 2018
Efficiently Computing Anchors¶
The anchor needs to return the same prediction as the original instance with a minimal confidence of e.g. 95%. If multiple candidate anchors satisfy this constraint, we go with the anchor that has the largest coverage. Because the number of potential anchors is exponential in the feature space, we need a faster approximate solution.
The anchors are constructed bottomup in combination with beam search. We start with an empty rule or anchor, and incrementally add an ifthen rule in each iteration until the minimal confidence constraint is satisfied. If multiple valid anchors are found, the one with the largest coverage is returned.
In order to select the best candidate anchors for the beam width efficiently during each iteration, we formulate the problem as a pure exploration multiarmed bandit problem. This limits the number of model prediction calls which can be a computational bottleneck.
For more details, we refer the reader to the original paper.
Usage¶
While each data type has specific requirements to initialize the explainer and return explanations, the underlying algorithm to construct the anchors is the same.
In order to efficiently generate anchors, the following hyperparameters need to be set to sensible values when calling the explain
method:
threshold
: the previously discussed minimal confidence level.threshold
defines the minimum fraction of samples for a candidate anchor that need to lead to the same prediction as the original instance. A higher value gives more confidence in the anchor, but also leads to more computation time. The default value is 0.95.tau
: determines when we assume convergence for the multiarmed bandit. A bigger value fortau
means faster convergence but also looser anchor conditions. By default equal to 0.15.beam_size
: the size of the beam width. A bigger beam width can lead to a better overall anchor at the expense of more computation time.batch_size
: the batch size used for sampling. A bigger batch size gives more confidence in the anchor, again at the expense of computation time since it involves more model prediction calls. The default value is 100.coverage_samples
: number of samples used to compute the coverage of the anchor. By default set to 10000.
Text¶
Initialization¶
Since the explainer works on black box models, only access to a predict function is needed. The model below is a simple logistic regression trained on movie reviews with negative or positive sentiment and preprocessed with a CountVectorizer:
predict_fn = lambda x: clf.predict(vectorizer.transform(x))
If we choose to sample similar words from a corpus, we first need to load a spaCy model:
import spacy
from alibi.utils.download import spacy_model
model = 'en_core_web_md'
spacy_model(model=model)
nlp = spacy.load(model)
We can now initialize our explainer:
explainer = AnchorText(nlp, predict_fn)
Explanation¶
Let’s define the instance we want to explain and verify that the sentiment prediction on the original instance is positive:
text = 'This is a good book .'
class_names = ['negative', 'positive']
pred = class_names[predict_fn([text])[0]]
Now we can explain the instance:
explanation = explainer.explain(text, threshold=0.95, use_similarity_proba=False,
use_unk=True, sample_proba=0.5)
We set the confidence threshold
at 95%. use_unk
equals True means that we replace words outside of the candidate anchor with UNK tokens with a sample probability equal to sample_proba
. Instead of using UNK tokens, we can sample from the top_n
similar words to the ground truth word in the corpus by setting use_unk
to False.
explanation = explainer.explain(text, threshold=0.95, use_unk=False, sample_proba=0.5, top_n=100)
It is also possible to sample words from the corpus proportional to the word similarity with the ground truth word by setting use_similarity_proba
to True and use_unk
to False. We can put more weight on similar words by decreasing the temperature
argument. The following explanation perturbs original tokens with probability equal to sample_proba
. The perturbed tokens are then sampled from the top_n
most similar tokens in the corpus with sample probability proportional to the
word similarity with the original token.
explanation = explainer.explain(text, threshold=0.95, use_similarity_proba=True, use_unk=False,
sample_proba=0.5, top_n=20, temperature=0.2)
The explain
method returns a dictionary that contains key: value pairs for:
names: the words in the anchor.
precision: the fraction of times the sampled instances where the anchor holds yields the same prediction as the original instance. The precision will always be \(\geq\)
threshold
for a valid anchor.coverage: the coverage of the anchor over a sampled part of the training set.
Under the raw key, the dictionary also contains example instances where the anchor holds and the prediction is the same as on the original instance, as well as examples where the anchor holds but the prediction changed to give the user a sense of where the anchor fails. raw also stores information on the names, precision and coverage of partial anchors. This allows the user to track the improvement in for instance the precision as more features (words in the case of text) are added to the anchor.
Tabular Data¶
Initialization and fit¶
To initialize the explainer, we provide a predict function, a list with the feature names to make the anchors easy to understand as well as an optional mapping from the encoded categorical features to a description of the category. An example for categorical_names
would be category_map = {0: list(‘married’, ‘divorced’), 3: list(‘high school diploma’, ‘master’s degree’)}. Each key in category_map refers to the column index in the input for the relevant categorical variable, while the
values are lists with the options for each categorical variable. To make it easy, we provide a utility function gen_category_map
to generate this map automatically from a Pandas dataframe:
from alibi.utils.data import gen_category_map
category_map = gen_category_map(df)
Then initialize the explainer:
predict_fn = lambda x: clf.predict(preprocessor.transform(x))
explainer = AnchorTabular(predict_fn, feature_names, categorical_names=category_map)
Tabular data requires a fit step to map the ordinal features into quantiles and therefore needs access to a representative set of the training data. disc_perc
is a list with percentiles used for binning:
explainer.fit(X_train, disc_perc=[25, 50, 75])
Explanation¶
Let’s check the prediction of the model on the original instance and explain:
class_names = ['<=50K', '>50K']
pred = class_names[explainer.predict_fn(X)[0]]
explanation = explainer.explain(X, threshold=0.95)
The returned explanation dictionary contains the same key: value pairs as the text explainer, so you could explain a prediction as follows:
Prediction: <=50K
Anchor: Marital Status = NeverMarried AND Relationship = Ownchild
Precision: 1.00
Coverage: 0.13
Images¶
Initialization¶
Besides the predict function, we also need to specify either a built in or custom superpixel segmentation function. The built in methods are felzenszwalb, slic and quickshift. It is important to create sensible superpixels in order to speed up convergence and generate interpretable explanations. Tuning the hyperparameters of the segmentation method is recommended.
explainer = AnchorImage(predict_fn, image_shape, segmentation_fn='slic',
segmentation_kwargs={'n_segments': 15, 'compactness': 20, 'sigma': .5},
images_background=None)
Example of superpixels generated for the Persian cat picture using the slic method:
The following function would be an example of a custom segmentation function dividing the image into rectangles.
def superpixel(image, size=(4, 7)):
segments = np.zeros([image.shape[0], image.shape[1]])
row_idx, col_idx = np.where(segments == 0)
for i, j in zip(row_idx, col_idx):
segments[i, j] = int((image.shape[1]/size[1]) * (i//size[0]) + j//size[1])
return segments
The images_background
parameter allows the user to provide images used to superimpose on the masked superpixels, not present in the candidate anchor, instead of taking the average value of the masked superpixel. The superimposed images need to have the same shape as the explained instance.
Explanation¶
We can then explain the instance in the usual way:
explanation = explainer.explain(image, p_sample=.5)
p_sample
determines the fraction of superpixels that are either changed to the average superpixel value or that are superimposed.
The explanation dictionary again contains information about the anchor’s precision, coverage and examples where the anchor does or does not hold. On top of that, it also contains a masked image with only the anchor superpixels visible under the anchor key (see image below) as well as the image’s superpixels under segments.
This page was generated from doc/source/methods/CEM.ipynb.
Contrastive Explanation Method¶
Overview¶
The Contrastive Explanation Method (CEM) is based on the paper Explanations based on the Missing: Towards Constrastive Explanations with Pertinent Negatives and extends the code open sourced by the authors. CEM generates instance based local black box explanations for classification models in terms of Pertinent Positives (PP) and Pertinent Negatives (PN). For a PP, the method finds the features that should be minimally and sufficiently present (e.g. important pixels in an image) to predict the same class as on the original instance. PN’s on the other hand identify what features should be minimally and necessarily absent from the instance to be explained in order to maintain the original prediction class. The aim of PN’s is not to provide a full set of characteristics that should be absent in the explained instance, but to provide a minimal set that differentiates it from the closest different class. Intuitively, the Pertinent Positives could be compared to Anchors while Pertinent Negatives are similar to Counterfactuals. As the authors of the paper state, CEM can generate clear explanations of the form: “An input x is classified in class y because features \(f_{i}\), …, \(f_{k}\) are present and because features \(f_{m}\), …, \(f_{p}\) are absent.” The current implementation is most suitable for images and tabular data without categorical features.
In order to create interpretable PP’s and PN’s, featurewise perturbation needs to be done in a meaningful way. To keep the perturbations sparse and close to the original instance, the objective function contains an elastic net (\(\beta\)\(L_{1}\) + \(L_{2}\)) regularizer. Optionally, an autoencoder can be trained to reconstruct instances of the training set. We can then introduce the \(L_{2}\) reconstruction error of the perturbed instance as an additional loss term in our objective function. As a result, the perturbed instance lies close to the training data manifold.
The ability to add or remove features to arrive at respectively PN’s or PP’s implies that there are feature values that contain no information with regards to the model’s predictions. Consider for instance the MNIST image below where the pixels are scaled between 0 and 1. The pixels with values close to 1 define the number in the image while the background pixels have value 0. We assume that perturbations towards the background value 0 are equivalent to removing features, while perturbations towards 1 imply adding features.
It is intuitive to understand that adding features to get a PN means changing 0’s into 1’s until a different number is formed, in this case changing a 4 into a 9.
To find the PP, we do the opposite and change 1’s from the original instance into 0’s, the background value, and only keep a vague outline of the original 4.
It is however often not trivial to find these noninformative feature values and domain knowledge becomes very important.
For more details, we refer the reader to the original paper.
Usage¶
Initialization¶
The optimizer is defined in TensorFlow (TF) internally. We first load our MNIST classifier and the (optional) autoencoder. The example below uses Keras or TF models. This allows optimization of the objective function to run entirely with automatic differentiation because the TF graph has access to the underlying model architecture. For models built in different frameworks (e.g. scikitlearn), the gradients of part of the loss function with respect to the input features need to be evaluated numerically. We’ll handle this case later.
# define models
cnn = load_model('mnist_cnn.h5')
ae = load_model('mnist_ae.h5')
We can now initialize the CEM explainer:
# initialize CEM explainer
shape = (1,) + x_train.shape[1:]
mode = 'PN'
cem = CEM(cnn, mode, shape, kappa=0., beta=.1,
feature_range=(x_train.min(), x_train.max()),
gamma=100, ae_model=ae, max_iterations=1000,
c_init=1., c_steps=10, learning_rate_init=1e2,
clip=(1000.,1000.), no_info_val=1.)
Besides passing the the predictive and autoencoder models, we set a number of hyperparameters …
… general:
mode
: ‘PN’ or ‘PP’.shape
: shape of the instance to be explained, starting with batch dimension. Currently only single explanations are supported, so the batch dimension should be equal to 1.feature_range
: global or featurewise min and max values for the perturbed instance.
… related to the optimizer:
max_iterations
: number of loss optimization steps for each value of c; the multiplier of the first loss term.learning_rate_init
: initial learning rate, follows polynomial decay.clip
: min and max gradient values.
… related to the noninformative value:
no_info_val
: as explained in the previous section, it is important to define which feature values are considered background and not crucial for the class predictions. For MNIST images scaled between 0 and 1 or 0.5 and 0.5 as in the notebooks, pixel perturbations in the direction of the (low) background pixel value can be seen as removing features, moving towards the noninformative value. As a result, theno_info_val
parameter is set at a low value like 1.no_info_val
can be defined globally or featurewise. For most applications, domain knowledge becomes very important here. If a representative sample of the training set is available, we can always (naively) infer ano_info_val
by taking the featurewise median or mean:
cem.fit(x_train, no_info_type='median')
… related to the objective function:
c_init
andc_steps
: the multiplier \(c\) of the first loss term is updated forc_steps
iterations, starting atc_init
. The first loss term encourages the perturbed instance to be predicted as a different class for a PN and the same class for a PP. If we find a candidate PN or PP for the current value of \(c\), we reduce the value of \(c\) for the next optimization cycle to put more emphasis on the regularization terms and improve the solution. If we cannot find a solution, \(c\) is increased to put more weight on the prediction class restrictions of the PN and PP before focusing on the regularization.kappa
: the first term in the loss function is defined by a difference between the predicted probabilities for the perturbed instance of the original class and the max of the other classes. \(\kappa \geq 0\) defines a cap for this difference, limiting its impact on the overall loss to be optimized. Similar to the original paper, we set \(\kappa\) to 0. in the examples.beta
: \(\beta\) is the \(L_{1}\) loss term multiplier. A higher value for \(\beta\) means more weight on the sparsity restrictions of the perturbations. Similar to the paper, we set \(\beta\) to 0.1 for the MNIST and Iris datasets.gamma
: multiplier for the optional \(L_{2}\) reconstruction error. A higher value for \(\gamma\) means more emphasis on the reconstruction error penalty defined by the autoencoder. Similar to the paper, we set \(\gamma\) to 100 when we have an autoencoder available.
While the paper’s default values for the loss term coefficients worked well for the simple examples provided in the notebooks, it is recommended to test their robustness for your own applications.
Explanation¶
We can finally explain the instance:
explanation = cem.explain(X)
The explain
method returns a dictionary with the following key: value pairs:
X: original instance
X_pred: predicted class of original instance
PN or PP: Pertinent Negative or Pertinant Positive
PN_pred or PP_pred: predicted class of PN or PP
grads_graph: gradient values computed from the TF graph with respect to the input features at the PN or PP
grads_num: numerical gradient values with respect to the input features at the PN or PP
Numerical Gradients¶
So far, the whole optimization problem could be defined within the internal TF graph, making autodiff possible. It is however possible that we do not have access to the model architecture and weights, and are only provided with a predict
function returning probabilities for each class. We initialize the CEM in the same way as before:
# define model
lr = load_model('iris_lr.h5')
predict_fn = lambda x: lr.predict(x)
# initialize CEM explainer
shape = (1,) + x_train.shape[1:]
mode = 'PP'
cem = CEM(predict_fn, mode, shape, kappa=0., beta=.1,
feature_range=(x_train.min(), x_train.max()),
eps=(1e2, 1e2), update_num_grad=100)
In this case, we need to evaluate the gradients of the loss function with respect to the input features numerically:
where \(L\) is the loss function, \(p\) the predict function and \(x\) the input features to optimize. There are now 2 additional hyperparameters to consider:
eps
: a tuple to define the perturbation size used to compute the numerical gradients.eps[0]
andeps[1]
are used respectively for \(^{\delta L}/_{\delta p}\) and \(^{\delta p}/_{\delta x}\).eps[0]
andeps[1]
can be a combination of float values or numpy arrays. Foreps[0]
, the array dimension should be (1 x nb of prediction categories) and foreps[1]
it should be (1 x nb of features). For the Iris dataset,eps
could look as follows:
eps0 = np.array([[1e2, 1e2, 1e2]]) # 3 prediction categories, equivalent to 1e2
eps1 = np.array([[1e2, 1e2, 1e2, 1e2]]) # 4 features, also equivalent to 1e2
eps = (eps0, eps1)
update_num_grad
: for complex models with a high number of parameters and a high dimensional feature space (e.g. Inception on ImageNet), evaluating numerical gradients can be expensive as they involve prediction calls for each perturbed instance. Theupdate_num_grad
parameter allows you to set a batch size on which to evaluate the numerical gradients, reducing the number of prediction calls required.
This page was generated from doc/source/methods/CF.ipynb.
Counterfactual Instances¶
Overview¶
A counterfactual explanation of an outcome or a situation \(Y\) takes the form “If \(X\) had not occured, \(Y\) would not have occured” (Interpretable Machine Learning). In the context of a machine learning classifier \(X\) would be an instance of interest and \(Y\) would be the label predicted by the model. The task of finding a counterfactual explanation is then to find some \(X^\prime\) that is in some way related to the original instance \(X\) but leading to a different prediction \(Y^\prime\). Reasoning in counterfactual terms is very natural for humans, e.g. asking what should have been done differently to achieve a different result. As a consequence counterfactual instances for machine learning predictions is a promising method for humaninterpretable explanations.
The counterfactual method described here is the most basic way of defining the problem of finding such \(X^\prime\). Our algorithm loosely follows Wachter et al. (2017): Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR. For an extension to the basic method which provides ways of finding higher quality counterfactual instances \(X^\prime\) in a quicker time, please refer to Counterfactuals Guided by Prototypes.
We can reason that the most basic requirements for a counterfactual \(X^\prime\) are as follows:
The predicted class of \(X^\prime\) is different from the predicted class of \(X\)
The difference between \(X\) and \(X^\prime\) should be humaninterpretable.
While the first condition is straightforward, the second condition does not immediately lend itself to a condition as we need to first define “interpretability” in a mathematical sense. For this method we restrict ourselves to a particular definition by asserting that \(X^\prime\) should be as close as possible to \(X\) without violating the first condition. The main issue with this definition of “interpretability” is that the difference between \(X^\prime\) and \(X\) required to change the model prediciton might be so small as to be uninterpretable to the human eye in which case we need a more sophisticated approach.
That being said, we can now cast the search for \(X^\prime\) as a simple optimization problem with the following loss:
where the first loss term \(L_{\text{pred}}\) guides the search towards points \(X^\prime\) which would change the model prediction and the second term \(\lambda L_{\text{dist}}\) ensures that \(X^\prime\) is close to \(X\). This form of loss has a single hyperparameter \(\lambda\) weighing the contributions of the two competing terms.
The specific loss in our implementation is as follows:
Here \(t\) is the desired target class for \(X^\prime\) which can either be specified in advance or left up to the optimization algorithm to find, \(p_t\) is the target probability of this class (typically \(p_t=1\)), \(f_t\) is the model prediction on class \(t\) and \(L_1\) is the distance between the proposed counterfactual instance \(X^\prime\) and the instance to be explained \(X\). The use of the \(L_1\) distance should ensure that the \(X^\prime\) is a sparse counterfactual  minimizing the number of features to be changed in order to change the prediction.
The optimal value of the hyperparameter \(\lambda\) will vary from dataset to dataset and even within a dataset for each instance to be explained and the desired target class. As such it is difficult to set and we learn it as part of the optimization algorithm, i.e. we want to optimize
subject to
where \(\epsilon\) is a tolerance parameter. In practice this is done in two steps, on the first pass we sweep a broad range of \(\lambda\), e.g. \(\lambda\in(10^{1},\dots,10^{10}\)) to find lower and upper bounds \(\lambda_{\text{lb}}, \lambda_{\text{ub}}\) where counterfactuals exist. Then we use bisection to find the maximum \(\lambda\in[\lambda_{\text{lb}}, \lambda_{\text{ub}}]\) such that the counterfactual constraint still holds. The result is a set of counterfactual instances \(X^\prime\) with varying distance from the test instance \(X\).
Usage¶
Initialization¶
The counterfactual (CF) explainer method works on fully blackbox models, meaning they can work with arbitrary functions that take arrays and return arrays. However, if the user has access to a full TensorFlow (TF) or Keras model, this can be passed in as well to take advantage of the automatic differentiation in TF to speed up the search. This section describes the initialization for a TF/Keras model, for fully blackbox models refer to numerical gradients.
First we load the TF/Keras model:
model = load_model('my_model.h5')
Then we can initialize the counterfactual object:
shape = (1,) + x_train.shape[1:]
cf = CounterFactual(model, shape, distance_fn='l1', target_proba=1.0,
target_class='other', max_iter=1000, early_stop=50, lam_init=1e1,
max_lam_steps=10, tol=0.05, learning_rate_init=0.1,
feature_range=(1e10, 1e10), eps=0.01, init='identity',
decay=True, write_dir=None, debug=False)
Besides passing the model, we set a number of hyperparameters …
… general:
shape
: shape of the instance to be explained, starting with batch dimension. Currently only single explanations are supported, so the batch dimension should be equal to 1.feature_range
: global or featurewise min and max values for the perturbed instance.write_dir
: write directory for Tensorboard logging of the loss terms. It can be helpful when tuning the hyperparameters for your use case. It makes it easy to verify that e.g. not 1 loss term dominates the optimization, that the number of iterations is OK etc. You can access Tensorboard by runningtensorboard logdir {write_dir}
in the terminal.debug
: flag to enable/disable writing to Tensorboard.
… related to the optimizer:
max_iterations
: number of loss optimization steps for each value of \(\lambda\); the multiplier of the distance loss term.learning_rate_init
: initial learning rate, follows linear decay.decay
: flag to disable learning rate decay if desiredearly_stop
: early stopping criterion for the search. If no counterfactuals are found for this many steps or if this many counterfactuals are found in a row we change \(\lambda\) accordingly and continue the search.init
: how to initialize the search, currently only"identity"
is supported meaning the search starts from the original instance.
… related to the objective function:
distance_fn
: distance function between the test instance \(X\) and the proposed counterfactual \(X^\prime\), currently only"l1"
is supported.target_proba
: desired target probability for the returned counterfactual instance. Defaults to1.0
, but it could be useful to reduce it to allow a looser definition of a counterfactual instance.tol
: the tolerance within thetarget_proba
, this works in tandem withtarget_proba
to specify a range of acceptable predicted probability values for the counterfactual.target_class
: desired target class for the returned counterfactual instance. Can be either an integer denoting the specific class membership or the stringother
which will find a counterfactual instance whose predicted class is anything other than the class of the test instance.lam_init
: initial value of the hyperparameter \(\lambda\). This is set to a high value \(\lambda=1e^{1}\) and annealed during the search to find good bounds for \(\lambda\) and for most applications should be fine to leave as default.max_lam_steps
: the number of steps (outer loops) to search for with a different value of \(\lambda\).
While the default values for the loss term coefficients worked well for the simple examples provided in the notebooks, it is recommended to test their robustness for your own applications.
Fit¶
The method is purely unsupervised so no fit method is necessary.
Explanation¶
We can now explain the instance \(X\):
explanation = cf.explain(X)
The explain
method returns a dictionary with the following key: value pairs:
cf: dictionary containing the counterfactual instance found with the smallest distance to the test instance, it has the following keys:
X: the counterfactual instance
distance: distance to the original instance
lambda: value of \(\lambda\) corresponding to the counterfactual
index: the step in the search procedure when the counterfactual was found
class: predicted class of the counterfactual
proba: predicted class probabilities of the counterfactual
loss: counterfactual loss
orig_class: predicted class of original instance
orig_proba: predicted class probabilites of the original instance
all: dictionary of all instances encountered during the search that satisfy the counterfactual constraint but have higher distance to the original instance than the returned counterfactual. This is organized by levels of \(\lambda\), i.e.
explanation['all'][0]
will be a list of dictionaries corresponding to instances satisfying the counterfactual condition found in the first iteration over \(\lambda\) during bisection.
Numerical Gradients¶
So far, the whole optimization problem could be defined within the TF graph, making automatic differentiation possible. It is however possible that we do not have access to the model architecture and weights, and are only provided with a predict
function returning probabilities for each class. The counterfactual can then be initialized in the same way as before, but using a prediction function:
# define model
model = load_model('mnist_cnn.h5')
predict_fn = lambda x: cnn.predict(x)
# initialize explainer
shape = (1,) + x_train.shape[1:]
cf = CounterFactual(predict_fn, shape, distance_fn='l1', target_proba=1.0,
target_class='other', max_iter=1000, early_stop=50, lam_init=1e1,
max_lam_steps=10, tol=0.05, learning_rate_init=0.1,
feature_range=(1e10, 1e10), eps=0.01, init
In this case, we need to evaluate the gradients of the loss function with respect to the input features \(X\) numerically:
where \(L_\text{pred}\) is the predict function loss term, \(p\) the predict function and \(x\) the input features to optimize. There is now an additional hyperparameter to consider:
eps
: a float or an array of floats to define the perturbation size used to compute the numerical gradients of \(^{\delta p}/_{\delta X}\). If a single float, the same perturbation size is used for all features, if the array dimension is (1 x nb of features), then a separate perturbation value can be used for each feature. For the Iris dataset,eps
could look as follows:
eps = np.array([[1e2, 1e2, 1e2, 1e2]]) # 4 features, also equivalent to eps=1e2
Examples¶
This page was generated from doc/source/methods/CFProto.ipynb.
Counterfactuals Guided by Prototypes¶
Overview¶
This method is based on the Interpretable Counterfactual Explanations Guided by Prototypes paper which proposes a fast, model agnostic method to find interpretable counterfactual explanations for classifier predictions by using class prototypes.
Humans often think about how they can alter the outcome of a situation. What do I need to change for the bank to approve my loan? is a common example. This form of counterfactual reasoning comes natural to us and explains how to arrive at a desired outcome in an interpretable manner. Moreover, examples of counterfactual instances resulting in a different outcome can give powerful insights of what is important to the the underlying decision process. This makes it a compelling method to explain predictions of machine learning models. In the context of predictive models, a counterfactual instance describes the necessary change in input features of a test instance that alter the prediction to a predefined output (e.g. a prediction class). The counterfactual is found by iteratively perturbing the input features of the test instance during an optimization process until the desired output is achieved.
A high quality counterfactual instance \(x_{cf}\) should have the following desirable properties:
The model prediction on \(x_{cf}\) needs to be close to the predefined output.
The perturbation \(\delta\) changing the original instance \(x_{0}\) into \(x_{cf} = x_{0} + \delta\) should be sparse.
The counterfactual \(x_{cf}\) should be interpretable. This implies that \(x_{cf}\) needs to lie close to both the overall and counterfactual class specific data distribution.
The counterfactual \(x_{cf}\) needs to be found fast enough so it can be used in a real life setting.
We can obtain those properties by incorporating additional loss terms in the objective function that is optimized using gradient descent. A basic loss function for a counterfactual can look like this:
\(Loss\) = \(cL_{pred}\) + \(\beta\)\(L_{1}\) + \(L_{2}\)
The first loss term, \(cL_{pred}\), encourages the perturbed instance to predict another class than the original instance. The \(\beta\)\(L_{1}\) + \(L_{2}\) terms act as an elastic net regularizer and introduce sparsity by penalizing the size of the difference between the counterfactual and the perturbed instance. While we can obtain sparse counterfactuals using this objective function, these are often not very interpretable because the training data distribution is not taken into account, and the perturbations are not necessarily meaningful.
The Contrastive Explanation Method (CEM) uses an autoencoder which is trained to reconstruct instances of the training set. We can then add the \(L_{2}\) reconstruction error of the perturbed instance as a loss term to keep the counterfactual close to the training data distribution. The loss function becomes:
\(Loss\) = \(cL_{pred}\) + \(\beta\)\(L_{1}\) + \(L_{2}\) + \(L_{AE}\)
The \(L_{AE}\) does however not necessarily lead to interpretable solutions or speed up the counterfactual search. The lack of interpretability occurs because the overall data distribution is followed, but not the class specific one. That’s where the prototype loss term \(L_{proto}\) comes in. To define the prototype for each prediction class, we can use the encoder part of the previously mentioned autoencoder. We also need the training data or at least a representative sample. We use the model to make predictions on this data set. For each predicted class, we encode the instances belonging to that class. The class prototype is simply the average of the k closest encodings in that class to the encoding of the instance that we want to explain. When we want to generate a counterfactual, we first find the nearest prototype other than the one for the predicted class on the original instance. The \(L_{proto}\) loss term tries to minimize the \(L_{2}\) distance between the counterfactual and the nearest prototype. As a result, the perturbations are guided to the closest prototype, speeding up the counterfactual search and making the perturbations more meaningful as they move towards a typical indistribution instance. If we do not have a trained encoder available, we can build class representations using kd trees for each class. The prototype is then the k nearest instance from a kd tree other than the tree which represents the predicted class on the original instance. The loss function now looks as follows:
\(Loss\) = \(cL_{pred}\) + \(\beta\)\(L_{1}\) + \(L_{2}\) + \(L_{AE}\) + \(L_{proto}\)
The method allows us to select specific prototype classes to guide the counterfactual to. For example, in MNIST the closest prototype to a 9 could be a 4. However, we can specify that we want to move towards the 7 prototype and avoid 4.
In order to help interpretability, we can also add a trust score constraint on the proposed counterfactual. The trust score is defined as the ratio of the distance between the encoded counterfactual and the prototype of the class predicted on the original instance, and the distance between the encoded counterfactual and the prototype of the class predicted for the counterfactual instance. Intuitively, a high trust score implies that the counterfactual is far from the originally predicted class compared to the counterfactual class. For more info on trust scores, please check out the documentation.
Because of the \(L_{proto}\) term, we can actually remove the prediction loss term and still obtain an interpretable counterfactual. This is especially relevant for fully black box models. When we provide the counterfactual search method with a Keras or TensorFlow model, it is incorporated in the TensorFlow graph and evaluated using automatic differentiation. However, if we only have access to the model’s prediction function, the gradient updates are numerical and typically require a large number of prediction calls because of \(L_{pred}\). These prediction calls can slow the search down significantly and become a bottleneck. We can represent the gradient of the loss term as follows:
where \(p\) is the prediction function and \(x\) the input features to optimize. For a 28 by 28 MNIST image, the \(^{\delta p}/_{\delta x}\) term alone would require a prediction call with batch size 28x28x2 = 1568. By using the prototypes to guide the search however, we can remove the prediction loss term and only make a single prediction at the end of each gradient update to check whether the predicted class on the proposed counterfactual is different from the original class.
Categorical Variables¶
It is crucial for many machine learning applications to deal with both continuous numerical and categorical data. Explanation methods which rely on perturbations or sampling of the input features need to make sure those perturbations are meaningful and capture the underlying structure of the data. If not done properly, the perturbed or sampled instances are possibly out of distribution compared to the training data and result in misleading explanations. The perturbation or sampling process becomes tricky for categorical features. For instance random perturbations across possible categories or enforcing a ranking between categories based on frequency of occurrence in the training data do not capture this structure.
Our method first computes the pairwise distances between categories of a categorical variable based on either the model predictions (MVDM) or the context provided by the other variables in the dataset (ABDM). For MVDM, we use the difference between the conditional model prediction probabilities of each category. This method is based on the Modified Value Difference Metric (MVDM) by Cost et al (1993). ABDM stands for AssociationBased Distance Metric, a categorical distance measure introduced by Le et al (2005). ABDM infers context from the presence of other variables in the data and computes a dissimilarity measure based on the KullbackLeibler divergence. Both methods can also be combined as ABDMMVDM. We can then apply multidimensional scaling to project the pairwise distances into Euclidean space. More details will be provided in a forthcoming paper.
The different use cases are highlighted in the example notebooks linked at the bottom of the page.
Usage¶
Initialization¶
The counterfactuals guided by prototypes method works on fully blackbox models. This means that they can work with arbitrary functions that take arrays and return arrays. However, if the user has access to a full TensorFlow (TF) or Keras model, this can be passed in as well to take advantage of the automatic differentiation in TF to speed up the search. This section describes the initialization for a TF/Keras model. Please see the numerical gradients section for black box models.
We first load our MNIST classifier and the (optional) autoencoder and encoder:
cnn = load_model('mnist_cnn.h5')
ae = load_model('mnist_ae.h5')
enc = load_model('mnist_enc.h5')
We can now initialize the counterfactual:
shape = (1,) + x_train.shape[1:]
cf = CounterFactualProto(cnn, shape, kappa=0., beta=.1, gamma=100., theta=100.,
ae_model=ae, enc_model=enc, max_iterations=500,
feature_range=(.5, .5), c_init=1., c_steps=5,
learning_rate_init=1e2, clip=(1000., 1000.), write_dir='./cf')
Besides passing the predictive, and (optional) autoencoder and models, we set a number of hyperparameters …
… general:
shape
: shape of the instance to be explained, starting with batch dimension. Currently only single explanations are supported, so the batch dimension should be equal to 1.feature_range
: global or featurewise min and max values for the perturbed instance.write_dir
: write directory for Tensorboard logging of the loss terms. It can be helpful when tuning the hyperparameters for your use case. It makes it easy to verify that e.g. not 1 loss term dominates the optimization, that the number of iterations is OK etc. You can access Tensorboard by runningtensorboard logdir {write_dir}
in the terminal. The figure below for example shows the loss to be optimized over different \(c\) iterations. It is clear that within each iteration, the number ofmax_iterations
steps is too high and we can speed up the search.
… related to the optimizer:
max_iterations
: number of loss optimization steps for each value of c; the multiplier of the first loss term.learning_rate_init
: initial learning rate, follows polynomial decay.clip
: min and max gradient values.
… related to the objective function:
c_init
andc_steps
: the multiplier \(c\) of the first loss term is updated forc_steps
iterations, starting atc_init
. The first loss term encourages the perturbed instance to be predicted as a different class than the original instance. If we find a candidate counterfactual for the current value of \(c\), we reduce the value of \(c\) for the next optimization cycle to put more emphasis on the other loss terms and improve the solution. If we cannot find a solution, \(c\) is increased to put more weight on the prediction class restrictions of the counterfactual.kappa
: the first term in the loss function is defined by a difference between the predicted probabilities for the perturbed instance of the original class and the max of the other classes. \(\kappa \geq 0\) defines a cap for this difference, limiting its impact on the overall loss to be optimized. Similar to CEM, we set \(\kappa\) to 0 in the examples.beta
: \(\beta\) is the \(L_{1}\) loss term multiplier. A higher value for \(\beta\) means more weight on the sparsity restrictions of the perturbations. \(\beta\) equal to 0.1 works well for the example datasets.gamma
: multiplier for the optional \(L_{2}\) reconstruction error. A higher value for \(\gamma\) means more emphasis on the reconstruction error penalty defined by the autoencoder. A value of 100 is reasonable for the examples.theta
: multiplier for the \(L_{proto}\) loss term. A higher \(\theta\) means more emphasis on the gradients guiding the counterfactual towards the nearest class prototype. A value of 100 worked well for the examples.
When the dataset contains categorical variables, we need to additionally pass the following arguments:
cat_vars
: if the categorical variables have ordinal encodings, this is a dictionary with as keys the categorical columns and values the number of categories for the categorical variable in the dataset. If onehot encoding is applied to the data, then the keys of thecat_vars
dictionary represent the column where each categorical variable starts while the values still return the number of categories.ohe
: a flag (True or False) whether the categories are onehot encoded.
It is also important to remember that the perturbations are applied in the numerical feature space, after the categorical variables have been transformed into numerical features. This has to be reflected by the dimension and values of feature_range
. Imagine for example that we have a dataset with 10 columns. Two of the features are categorical and onehot encoded. They can both take 3 values each. As a result, the number of columns in the dataset is reduced to 6 when we transform those
categorical features to numerical features. As a result, the feature_range
needs to contain the upper and lower ranges for 6 features.
While the default values for the loss term coefficients worked well for the simple examples provided in the notebooks, it is recommended to test their robustness for your own applications.
Fit¶
If we use an encoder to find the class prototypes, we need an additional fit
step on the training data:
cf.fit(x_train)
We also need the fit
step if the data contains categorical features so we can compute the numerical transformations. In practice, most of these optional arguments do not need to be changed.
cf.fit(x_train, d_type='abdm', w=None, disc_perc=[25, 50, 75], standardize_cat_vars=False,
smooth=1., center=True, update_feature_range=True)
d_type
: the distance metric used to compute the pairwise distances between the categories of each categorical variable. As discussed in the introduction, the options are"abdm"
,"mvdm"
or"abdmmvdm"
.w
: if the combined metric"abdmmvdm"
is used,w
is the weight (between 0 and 1) given toabdm
.disc_perc
: forabdm
, we infer context from the other features. If there are continuous numerical features present, these are binned according to the quartiles indisc_perc
before computing the similarity metric.standardize_car_vars
: whether to return the standardized values for the numerical distances of each categorical feature.smooth
: if the difference in the distances between the categorical variables is too large, then a lower value of thesmooth
argument (0, 1) can smoothen out this difference. This would only be relevant if one categorical variable has significantly larger differences between its categories than others. As a result, the counterfactual search process will likely leave that categorical variable unchanged.center
: whether to center the numerical distances of the categorical variables between the min and max feature ranges.update_feature_range
: whether to update thefeature_range
parameter for the categorical variables based on the min and max values it computed in thefit
step.
Explanation¶
We can now explain the instance:
explanation = cf.explain(X, Y=None, target_class=None, k=20, k_type='mean',
threshold=0., verbose=True, print_every=100, log_every=100)
X
: original instanceY
: onehotencoding of class label forX
, inferred from the prediction onX
if None.target_class
: classes considered for the nearest class prototype. Either a list with class indices or None.k
: number of nearest instances used to define the prototype for a class. Defaults to using all instances belonging to the class.k_type
: use either the average encoding of thek
nearest instances in a class as the class prototype (k_type
=’mean’) or the knearest encoding in the class (k_type
=’point’). This parameter is only relevant if an encoder is used to define the prototypes.threshold
: threshold level for the ratio between the distance of the counterfactual to the prototype of the predicted class for the original instance over the distance to the prototype of the predicted class for the counterfactual. If the trust score is below the threshold, the proposed counterfactual does not meet the requirements and is rejected.verbose
: if True, print progress of counterfactual search everyprint_every
steps.log_every
: ifwrite_dir
for Tensorboard is specified, then log losses everylog_every
steps.
The explain
method returns a dictionary with the following key: value pairs:
cf: a dictionary with the overall best counterfactual found. explanation[‘cf’] has the following key: value pairs:
X: the counterfactual instance
class: predicted class for the counterfactual
proba: predicted class probabilities for the counterfactual
grads_graph: gradient values computed from the TF graph with respect to the input features at the counterfactual
grads_num: numerical gradient values with respect to the input features at the counterfactual
orig_class: predicted class for original instance
orig_proba: predicted class probabilities for original instance
all: a dictionary with the iterations as keys and for each iteration a list with counterfactuals found in that iteration as values. So for instance, during the first iteration, explanation[‘all’][0], initially we typically find fairly noisy counterfactuals that improve over the course of the iteration. The counterfactuals for the subsequent iterations then need to be better (sparser) than the previous best counterfactual. So over the next few iterations, we probably find less but better solutions.
Numerical Gradients¶
So far, the whole optimization problem could be defined within the TF graph, making automatic differentiation possible. It is however possible that we do not have access to the model architecture and weights, and are only provided with a predict
function returning probabilities for each class. The counterfactual can then be initialized in the same way:
# define model
cnn = load_model('mnist_cnn.h5')
predict_fn = lambda x: cnn.predict(x)
ae = load_model('mnist_ae.h5')
enc = load_model('mnist_enc.h5')
# initialize explainer
shape = (1,) + x_train.shape[1:]
cf = CounterFactualProto(predict_fn, shape, gamma=100., theta=100.,
ae_model=ae, enc_model=enc, max_iterations=500,
feature_range=(.5, .5), c_init=1., c_steps=4,
eps=(1e2, 1e2), update_num_grad=100)
In this case, we need to evaluate the gradients of the loss function with respect to the input features numerically:
where \(L_{pred}\) is the loss term related to the prediction function, \(p\) is the prediction function and \(x\) are the input features to optimize. There are now 2 additional hyperparameters to consider:
eps
: a tuple to define the perturbation size used to compute the numerical gradients.eps[0]
andeps[1]
are used respectively for \(^{\delta L_{pred}}/_{\delta p}\) and \(^{\delta p}/_{\delta x}\).eps[0]
andeps[1]
can be a combination of float values or numpy arrays. Foreps[0]
, the array dimension should be (1 x nb of prediction categories) and foreps[1]
it should be (1 x nb of features). For the Iris dataset,eps
could look as follows:
eps0 = np.array([[1e2, 1e2, 1e2]]) # 3 prediction categories, equivalent to 1e2
eps1 = np.array([[1e2, 1e2, 1e2, 1e2]]) # 4 features, also equivalent to 1e2
eps = (eps0, eps1)
update_num_grad
: for complex models with a high number of parameters and a high dimensional feature space (e.g. Inception on ImageNet), evaluating numerical gradients can be expensive as they involve prediction calls for each perturbed instance. Theupdate_num_grad
parameter allows you to set a batch size on which to evaluate the numerical gradients, reducing the number of prediction calls required.
We can also remove the prediction loss term by setting c_init
to 0 and only run 1 c_steps
, and still obtain an interpretable counterfactual. This dramatically speeds up the counterfactual search (e.g. by 100x in the MNIST example notebook):
cf = CounterFactualProto(predict_fn, shape, gamma=100., theta=100.,
ae_model=ae, enc_model=enc, max_iterations=500,
feature_range=(.5, .5), c_init=0., c_steps=1)
kd trees¶
So far, we assumed that we have a trained encoder available to find the nearest class prototype. This is however not a hard requirement. As mentioned in the Overview section, we can use kd trees to build class representations, find prototypes by querying the trees for each class and return the k nearest class instance as the closest prototype. We can run the counterfactual as follows:
cf = CounterFactualProto(cnn, shape, use_kdtree=True, theta=10., feature_range=(.5, .5))
cf.fit(x_train, trustscore_kwargs=None)
explanation = cf.explain(X, k=2)
trustscore_kwargs
: keyword arguments for the trust score object used to define the kd trees for each class. Please check the trust scores documentation for more info.
This page was generated from doc/source/methods/KernelSHAP.ipynb.
Kernel SHAP¶
Overview¶
The kernel SHAP (SHapley Additive exPlanations) algorithm is based on the paper A Unified Approach to Interpreting Model Predictions by Lundberg et al. and builds on the open source shap library from the paper’s first author.
The algorithm provides modelagnostic (black box), human interpretable explanations suitable for regression and classification models applied to tabular data. This method is a member of the additive feature attribution methods class; feature attribution refers to the fact that the change of an outcome to be explained (e.g., a class probability in a classification problem) with respect to a baseline (e.g., average prediction probability for that class in the training set) can be attributed in different proportions to the model input features.
A simple illustration of the explanation process is shown in Figure 1. Here we see depicted a model which takes as an input features such as Age
, BMI
or Sex
and outputs a continuous value. We know that the average value of that output in a dataset of interest is 0.1
. Using the kernel SHAP algorithm, we attribute the 0.3
difference to the input features. Because the sum of the attribute values equals output  base rate
, this method is additive. We can see for example
that the Sex
feature contributes negatively to this prediction whereas the remainder of the features have a positive contribution. For explaining this particular data point, the BMI
feature seems to be the most important. See our examples on how to perform explanations with this algorithm and visualise the results using the shap
library visualisations here, here and
here.
Figure 1: Cartoon ilustration of blackbox explanation models with Shap
Image Credit: Scott Lundberg (see source here)
How it works¶
Consider a model \(f\) that takes as an input \(M\) features and an explanation model \(g\). Assume that we want to explain the output of the model \(f\) when applied to an input \(x\). In reality, we may explain the change of the model output, \(f(x)\) with respect to the average output over a set of records, \(\mathcal{D}\). Since the model is additive we can write
where \(\mathcal{D}\) is also known as a background dataset and \(\phi_i\) is the portion of the change attributed to the \(i\)th feature. This portion is also known as feature importance, effect or simply shap value.
One can conceptually imagine the estimation process for the shap value of the \(i^{th}\) feature \(x_i\) as consisting of the following steps:
enumerate all subsets \(S\) of the set \(F = \{1, ..., M\} \setminus \{i\}\)
for each \(S \subseteq F \setminus \{i\}\), compute the contribution of feature \(i\) as \(C(iS) = f(S \cup \{i\})  f(S)\)
compute the shap value according to
\[\phi_i := \frac{1}{n} \sum \limits_{{S \subseteq F \setminus \{i\}}} \frac{1}{ n  1 \choose S} C(iS).\]
The semantics of \(f(S)\) in the above is to compute \(f\) by treating \(\bar{S}\) as missing inputs. Thus, we can imagine the process of computing the SHAP explanation as starting with \(S\) that does not contain our feature, adding feature \(i\) and then observing the difference in the function value. For a nonlinear function the value obtained will depend on which features are already in \(S\), so we average the contribution over all possible ways to choose a subset of size \(S\) and over all subset sizes. The issue with this method is that:
the summation contains \(2^M\) terms, so the algorithm complexity is \(O(M2^M)\)
since most models cannot accept an arbitrary pattern of missing inputs at inference time, calculating \(f(S)\) would involve model retraining the model an exponential number of times
To overcome this issue, the following approximations are made:
the missing features are simulated by replacing them with values from the background dataset
the feature attributions are estimated instead by solving
where
is the Shapley kernel (Figure 2).
Figure 2: Shapley Kernel
Note that the optimisation objective implies above an exponential number of terms. In practice, one considers a finite number of samples n
, selecting n
subsets \(S_1, ..., S_n\) according to the probability distribution induced by the kernel weights. We can see that the kernel favours either small or large subset sizes, since most of the information about the effect of a particular feature for an outcome change can be obtained by excluding that feature or excluding all the features
except for it from the input set.
Therefore, KernelShap returns an approximation of the true Shapley values, whose variability depends on factors such as the size of the structure of the background dataset used to estimate the feature attributions and the number of subsets of missing features sampled. Whenever possible, algorithms specialised for specific model structures (e.g., Tree SHAP, Linear SHAP, integrated gradients) should be used since they are faster and more accurate.
Comparison to other methods¶
Like LIME, this method provides local explanations, in the sense that the attributions are estimated to explain the change from a baseline for a given data point, \(x\). LIME computes the feature attributions by optimising the following objective in order to obtain a locally accurate explanation model (i.e., one that approximates the model to explained well around an instance \(x\)):
Here \(f\) is the model to be explained, \(g\) is the explanation model (assumed linear), \(\pi\) is a local kernel around instance \(x\) (usually cosine or \(\ell_2\) kernel) and \(\Omega(g)\) penalises explanation model complexity. The choices for \(L, \pi\) and \(\Omega\) in LIME are heuristic, which can lead to unintuitive behaviour (see Section 5 of Lundberg et al. for a study). Instead, by computing the shap values according to the weighted regression in the previous section, the feature attributions estimated by kernel SHAP have desirable properties such as local accuracy , consistency and missingness, detailed in Section 3 of Lundberg et al..
Although, in general, local explanations are limited in that it is not clear to what a given explanation applies around and instance \(x\) (see anchors algorithm overview here for a discussion), insights into global model behaviour can be drawn by aggregating the results from local explanations (see the work of Lundberg et al. here). In the future, a distributed version of the kernel SHAP algorithm will be available in order to reduce the runtime requirements necessary for explaining large datasets.
Usage¶
In order to compute the shap values , the following hyperparameters can be set when calling the explain
method:
nsamples
: Determines the number of subsets used for the estimation of the shap values. A default of2*M + 2**11
is provided whereM
is the number of features. One is encouraged to experiment with the number of samples in order to determine a value that balances explanation accuracy and runtime.l1_reg
: can take values0
,False
to disable,auto
for automatic regularisation selection,bic
oraic
to use \(\ell_1\) regularised regression with the Bayes/Akaike information criteria for regularisation parameter selection,num_features(10)
to specify the number of feature effects to be returned or a float value that is used as the regularisation coefficient for the \(\ell_1\) penalised regression. The default optionauto
uses the least angle regression algorithm with the Akaike Information Criterion if a fraction smaller than0.2
of the total number of subsets is enumerated.
If the dataset to be explained contains categorical variables, then the following options can be specified unless the categorical variables have been grouped (see example below):
summarise_result
: if True, the shap values estimated for dimenensions of an encoded categorical variable are summed and a single shap value is returned for the categorical variable. This requires that both arguments below are specified:cat_var_start_idx
: a list containing the column indices where categorical variables start. For example if the feature matrix contains a categorical feature starting at index0
and one at index10
, thencat_var_start_idx=[0, 10]
.cat_vars_enc_dim
: a list containing the dimension of the encoded categorical variables. The number of columns specified in this list is summed for each categorical variable starting with the corresponding index incat_var_start_idx
. So ifcat_var_start_idx=[0, 10]
andcat_vars_enc_dim=[3, 5]
, then the columns with indices0, 1
and2
and10, 11, 12, 13
and14
will be combined to return one shap value for each categorical variable, as opposed to3
and5
.
Explaining continous datasets¶
Initialisation and fit¶
The explainer is initialised by specifying:
a predict function.
optionally, setting
link='logit'
if the the model to be explained is a classifier that outputs probabilities. This will apply the logit function to convert outputs to margin space.optionally, providing a list of
feature_names
Hence assuming the classifier takes in 4 inputs and returns probabilities of 3 classes, we initialise its explainer as:
from alibi.explainers import KernelShap
predict_fn = lambda x: clf.predict_proba(x)
explainer = KernelShap(predict_fn, link='logit', feature_names=['a','b','c','d'])
To fit our classifier, we simply pass our background or ‘reference’ dataset to the explainer:
explainer.fit(X_reference)
Note that X_reference
is expected to have a samples x features
layout.
Explanation¶
To explain an instance X, we simply pass it to the explain method:
explanation = explainer.explain(X)
The returned explanation object has the following fields:
explanation.meta
:
{'name': 'KernelShap',
'type': 'blackbox',
'explanations': ['local', 'global'],
'params': {'groups': None,
'group_names': None,
'weights': None,
'summarise_background': False
}
}
This field contains metadata such as the explainer name and type as well as the type of explanations this method can generate. In this case, the params
attribute shows that none of the fit
method optional parameters have been set.
explanation.data
:
{'shap_values': [array([ 0.8340445 , 0.12000589, 0.07984099, 0.61758141]),
array([0.71522546, 0.31749045, 0.3146705 , 0.13365639]),
array([0.12984616, 0.47194649, 0.23036243, 0.52314911])],
'expected_value': array([0.74456904, 1.05058744, 1.15837362]),
'link': 'logit',
'feature_names': ['a', 'b', 'c', 'd'],
'categorical_names': {},
'raw': {
'raw_prediction': array([ 2.23635984, 0.83386654, 0.19693058]),
'prediction': array([0]),
'instances': array([ 0.93884707, 0.63216607, 0.4350103 , 0.91969562]),
'importances': {
'0': {'ranked_effect': array([0.8340445 , 0.61758141, 0.12000589, 0.07984099]),
'names': ['a', 'd', 'b', 'c']},
'1': {'ranked_effect': array([0.71522546, 0.31749045, 0.3146705 , 0.13365639]),
'names': ['a', 'b', 'c', 'd']},
'2': {'ranked_effect': array([0.52314911, 0.47194649, 0.23036243, 0.12984616]),
'names': ['d', 'b', 'c', 'a']},
'aggregated': {'ranked_effect': array([1.67911611, 1.27438691, 0.90944283, 0.62487392]),
'names': ['a', 'd', 'b', 'c']}
}
}
}
This field contains:
shap_values
: a list of length equal to the number of model outputs, where each entry is an array of dimensionsamples x features
of shap values. For the example above , only one instance with 4 features has been explained so the shap values for each class are of dimension1 x 4
expected_value
: a list of the expected value for each model output acrossX_reference
link
: which function has been applied to the model output prior to computing theexpected_value
and estimation of theshap_values
feature_names
: a list with the feature names, if provided. Defaults to a list containing strings of with the formatfeature_{}
if no names are passedcategorical_names
: a mapping of the categorical variables (represented by indices in theshap_values
columns) to the description of the categoryraw
: this field contains:raw_prediction
: ansamples x n_outputs
array of predictions for each instance to be explained. Note that this is calculated by applying the link function specified inlink
to the output ofpred_fn
prediction
: ansamples
array containing the index of the maximum value in theraw_prediction
arrayinstances
: ansamples x n_features
array of instances which have been explainedimportances
: a dictionary where each entry is a dictionary containing the sorted average magnitude of the shap value (ranked_effect
) along with a list of feature names corresponding to the reordered shap values (names
). There aren_outputs + 1
keys, corresponding ton_outputs
and to the aggregated output (obtained by summing all the arrays inshap_values
)
Please see our examples on how to visualise these outputs using the shap
library visualisations here, here and here.
Explaining heterogeneous (continuous and categorical) datasets¶
When the dataset contains both continuous and categorical variables, categorical_names
, an optional mapping from the encoded categorical features to a description of the category can be passed in addition to the feature_names
list. This mapping is currently used for determining what type of summarisation should be applied if X_reference
is large and the fit
argument summarise_background='auto'
or summarise_background=True
but in the future it might be used for anotating
visualisations. The definition of the map depends on what method is used to handle the categorical varialble.
By grouping categorical data¶
By grouping categorical data we estimate a single shap value for each categorical variable.
Initialiastion and Fit¶
Assume that we have a dataset with features such as Marital Status
(first column), Age
(2nd column), Income
(3rd column) and Education
(4th column). The 2nd and 3rd column are continuous variables, whereas the 1st and 4th are categorical ones.
The mapping of categorical variables could be generated from a Pandas dataframe using the utility gen_category_map
, imported from alibi.utils.data
. For this example the output could look like:
category_map = {
0: ["married", "divorced"],
3: ["high school diploma", "master's degree"],
}
Hence, using the same predict function as before, we initialise the explainer as:
explainer = KernelShap(
predict_fn,
link='logit',
feature_names=["Marital Status", "Age", "Income", "Education"],
categorical_names=category_map,
)
To group our data, we have to provide the groups
list, which contains lists with indices that are grouped together. In our case this would be:
groups = [[0, 1], [2], [3], [4, 5]]
Similary, the group_names are the same as the feature names
group_names = ["Marital Status", "Age", "Income", "Education"]
Note that, in this case, the keys of the category_map
are indices into groups
. To fit our explainer we pass onehot encoded data to the explainer along with the grouping information.
explainer.fit(
X_reference,
group_names=group_names,
groups=groups,
)
Explanation¶
To perform an explanation, we pass one hot encoded instances X
to the explain
method:
explanation = explainer.explain(X)
The explanation returned will contain the grouping information in its meta
attribute
{'name': 'KernelShap',
'type': 'blackbox',
'explanations': ['local', 'global'],
'params': {'groups': [[0, 1], [2], [3], [4, 5]],
'group_names': ["Marital Status", "Age", "Income", "Education"] ,
'weights': None,
'summarise_background': False
}
}
whereas inspecting the data
attribute shows that one shap value is estimated for each of the four groups:
{'shap_values': [array([ 0.8340445 , 0.12000589, 0.07984099, 0.61758141]),
array([0.71522546, 0.31749045, 0.3146705 , 0.13365639]),
array([0.12984616, 0.47194649, 0.23036243, 0.52314911])],
'expected_value': array([0.74456904, 1.05058744, 1.15837362]),
'link': 'logit',
'feature_names': ["Marital Status", "Age", "Income", "Education"],
'categorical_names': {},
'raw': {
'raw_prediction': array([ 2.23635984, 0.83386654, 0.19693058]),
'prediction': array([0]),
'instances': array([ 0.93884707, 0.63216607, 0.4350103 , 0.91969562]),
'importances': {
'0': {'ranked_effect': array([0.8340445 , 0.61758141, 0.12000589, 0.07984099]),
'names': ['a', 'd', 'b', 'c']},
'1': {'ranked_effect': array([0.71522546, 0.31749045, 0.3146705 , 0.13365639]),
'names': ['a', 'b', 'c', 'd']},
'2': {'ranked_effect': array([0.52314911, 0.47194649, 0.23036243, 0.12984616]),
'names': ['d', 'b', 'c', 'a']},
'aggregated': {'ranked_effect': array([1.67911611, 1.27438691, 0.90944283, 0.62487392]),
'names': ['a', 'd', 'b', 'c']}
}
}
}
By summing output¶
An alternative to grouping, with a higher runtime cost, is to estimate one shap value for each dimension of the onehot encoded data and sum the shap values of the encoded dimensions to obtain only one shap value per categorical variable.
Initialiastion and Fit¶
The initialisation step is as before:
explainer = KernelShap(
predict_fn,
link='logit',
feature_names=["Marital Status", "Age", "Income", "Education"],
categorical_names=category_map,
)
However, note that the keys of the category_map
have to correspond to the locations of the categorical variables after the effects for the encoded dimensions have been summed up (see details below).
The fit step requires one hot encoded data and simply takes the reference dataset:
explainer.fit(X_reference)
Explanation¶
To obtain a single shap value per cageorical result, we have to specify the following arguments to the explain
method:
summarise_result
: indicates that some shap values will be summedcat_vars_start_idx
: the column indices where the first encoded dimension is for each categorical variablecat_vars_enc_dim
: the length of the encoding dimensions for each categorical variable
explanation = explainer.explain(
X,
summarise_result=True,
cat_vars_start_idx=[0, 4],
cat_vars_enc_dim=[2, 2],
)
In our case Marital Status
starts at column 0
and occupies 2 columns, Age
and Income
occupy columns 2
and 3
and Education
occupies columns 4
and 5
.
By combining preprocessor and predictor¶
Finally, an alternative is to combine the preprocessor and the predictor together in the same object, and fit the explainer on data before preprocessing.
Initialisation and fit¶
To do so, we first redefine our predict function as
predict_fn = lambda x: clf.predict(preprocessor.transform(x))
The explainer can be initialised as:
explainer = KernelShap(
predict_fn,
link='logit',
feature_names=["Marital Status", "Age", "Income", "Education"],
categorical_names=category_map,
)
Then, the explainer should be fitted on unprocessed data:
explainer.fit(X_referennce_unprocessed)
Explanation¶
We can explain unprocessed records simpy by calling explain
:
explanation = explainer.explain(X_unprocessed)
Miscellaneous¶
Runtime considerations¶
For a given instance, the runtime of the algorithm depends on:
the size of the reference dataset
the dimensionality of the data
the number of samples used to estimate the shap values
Adjusting the size of the reference dataset¶
The algorithm automatically warns the user if a background dataset size of more than 300
samples is passed. If the runtime of an explanation with the original dataset is too large, then the algorithm can automatically subsample the background dataset during the fit
step. This can be achieve by specifying the fit step as
explainer.fit(
X_reference,
summarise_background=True,
n_background_samples=150,
)
or
explainer.fit(
X_reference,
summarise_background='auto'
)
The auto
option will select 300
examples, whereas using the boolean argument allows the user to directly control the size of the reference set. If categorical variables or grouping options are specified, the algorithm uses subsampling of the data. Otherwise, a kmeans clustering algorithm is used to select the background dataset and the samples are weighted according to the frequency of occurence of the cluster they are assigned to, which is reflected in the expected_value
attribute
of the explainer.
As describe above, the explanations are performed with respect to the expected (or weightedaverage) output over this dataset so the shap values will be affected by the dataset selection. We recommend experimenting with various ways to choose the background dataset before deploying explanations.
The dimensionality of the data and the number of samples used in shap value estimation¶
The dimensionality of the data has a slight impact on the runtime, since by default the number of samples used for estimation is 2*n_features + 2**11
. In our experiments, we found that either grouping the data or fitting the explainer on unprocessed data resulted in run time savings (but did not run rigorous comparison experiments). If grouping/fitting on unprocessed data alone does not give enough runtime savings, the background dataset could be adjusted. Additionally (or alternatively),
the number of samples could be reduced as follows:
explanation = explainer.explain(X, nsamples=500)
We recommend experimenting with this setting to understand the variance in the shap values before deplyoing such configurations.
Imbalanced datasets¶
In some situations, the reference datasets might be imbalanced so one might wish to perform an explanation of the model behaviour around \(x\) with respect to \(\sum_{i} w_i f(y_i)\) as opposed to \(\mathbb{E}_{\mathcal{D}}[f(y)]\). This can be achived by passing a list or an 1D numpy array containing a weight for each data point in X_reference
as the weights
argument of the fit
method.
Examples¶
Continuous Data¶
This page was generated from doc/source/methods/LinearityMeasure.ipynb.
Measuring the linearity of machine learning models¶
Overview¶
Machine learning models include in general linear and nonlinear operations: neural networks may include several layers consisting of linear algebra operations followed by nonlinear activation functions, while models based on decision trees are by nature highly nonlinear. The linearity measure function and class provide an operational definition for the amount of nonlinearity of a map acting on vector spaces. Roughly speaking, the amount of nonlinearity of the map is defined based on how much the output of the map applied to a linear superposition of input vectors differs from the linear superposition of the map’s outputs for each individual vector. In the context of supervised learning, this definition is immediately applicable to machine learning models, which are fundamentally maps from a input vector space (the feature space) to an output vector space that may represent probabilities (for classification models) or actual values of quantities of interest (for regression models).
Given an input vector space \(V\), an output vector space \(W\) and a map \(M: V \rightarrow W\), the amount of nonlinearity of the map \(M\) in a region \(\beta\) of the input space \(V\) and relative to some coefficients \(\alpha(v)\) is defined as
where \(v \in V\) and \(\\cdot\\) denotes the norm of a vector. If we consider a finite number of vectors \(N\), the amount of nonlinearity can be defined as
where, with an abuse of notation, \(\beta\) is no longer a continuous region in the input space but a collection of input vectors \(\{v_i\}\) and \(\alpha\) is no longer a function but a collection of real coefficients \(\{\alpha_i \}\) with \(i \in \{1, ..., N\}\). Note that the second expression may be interpreted as an approximation of the integral quantity defined in the first expression, where the vectors \(\{v_i\}\) are sampled uniformly in the region \(\beta\).
Application to machine learning models¶
In supervised learning, a model can be considered as a function \(M\) mapping vectors from the input space (feature vectors) to vectors in the output space. The output space may represents probabilities in the case of a classification model or values of the target quantities in the case of a regression model. The definition of the linearity measure given above can be applied to the case of a regression model (either a single target regression or a multi target regression) in a straightforward way.
In case of a classifier, let us denote by \(z\) the logits vector of the model such that the probabilities of the model \(M\) are given by \(\text{softmax}(z).\) Since the activation function of the last layer is usually highly nonlinear, it is convenient to apply the definition of linearity given above to the logits vector \(z.\) In the “white box” scenario, in which we have access to the internal architecture of the model, the vector \(z\) is accessible and the amount of nonlinearity can be calculated immediately. On the other hand, if the only accessible quantities are the output probabilities (the “black box” scenario), we need to invert the last layer’s activation function in order to retrieve \(z.\) In other words, that means defining a new map \(M^\prime = f^{1} \circ M(v)\) where \(f\) is the activation function at the last layer and considering \(L_{\beta, \alpha}^{(M^\prime)}\) as a measure of the nonlinearity of the model. The activation function of the last layer is usually a sigmoid function for binary classification tasks or a softmax function for multiclass classification. The inversion of the sigmoid function does not present any particular challenge, and the map \(M^\prime\) can be written as
On the other hand, the softmax probabilities \(p\) are defined in terms of the vector \(z\) as \(p_j = e^{z_j}/\sum_j{e^{z_j}},\) where \(z_j\) are the components of \(z\). The inverse of the softmax function is thus defined up to a constant \(C\) which does not depend on \(j\) but might depend on the input vector \(v.\) The inverse map \(M^\prime = \text{softmax}^{1} \circ M(v)\) is then given by:
where \(C(v)\) is an arbitrary constant depending in general on the input vector \(v.\)
Since in the black box scenario it is not possible to assess the value of \(C\), henceforth we will ignore it and define the amount of nonlinearity of a machine learning model whose output is a probability distribution as
It must be noted that the quantity above may in general be different from the “actual” amount of nonlinearity of the model, i.e. the quantity calculated by accessing the activation vectors \(z\) directly.
Implementation¶
Sampling¶
The module implements two different methods for the sampling of vectors in a neighbourhood of the instance of interest \(v.\)
The first sampling method
grid
consists of defining the region \(\beta\) as a discrete lattice of a given size around the instance of interest, with the size defined in terms of the L1 distance in the lattice; the vectors are then sampled from the lattice according to a uniform distribution. The density and the size of the lattice are controlled by the resolution parameterres
and the size parameterepsilon
. This method is highly efficient and scalable from a computational point of view.The second sampling method
knn
consists of sampling from the same probability distribution the instance \(v\) was drawn from; this method is implemented by simply selecting the \(K\) nearest neighbours to \(v\) from a training set, when this is available. Theknn
method imposes the constraint that the neighbourhood of \(v\) must include only vectors from the training set, and as a consequence it will exclude outofdistribution instances from the computation of linearity.
Pairwise vs global linearity¶
The module implements two different methods to associate a value of the linearity measure to \(v.\)
The first method consists of measuring the
global
linearity in a region around \(v.\) This means that we sample \(N\) vectors \(\{v_i\}\) from a region \(\beta\) of the input space around \(v\) and apply\begin{equation} L_{\beta, \alpha}^{(M)} = \left\ \sum_{i=1}^N \alpha_{i} M(v_i)  M\left(\sum_{i=1}^N \alpha_i v_i \right) \right\, \end{equation}The second method consists of measuring the
pairwise
linearity between the instance of interest and other vectors close to it, averaging over all such pairs. In other words, we sample \(N\) vectors \(\{v_i\}\) from \(\beta\) as in the global method, but in this case we calculate the amount of nonlinearity \(L_{(v,v_i),\alpha}\) for every pair of vectors \((v, v_i)\) and average over all the pairs. Given two coefficients \(\{\alpha_0, \alpha_1\}\) such that \(\alpha_0 + \alpha_1 = 1,\) we can define the pairwise linearity measure relative to the instance of interest \(v\) as\begin{equation}\label{pairwiselin} L^{(M)} = \frac{1}{N} \sum_{i=0}^N \left\\alpha_0 M(v) + \alpha_1 M(v_i)  M(\alpha_0 v + \alpha_1 v_i)\right\. \end{equation}
The two methods are slightly different from a conceptual point of view: the global linearity measure combines all \(N\) vectors sampled in \(\beta\) in a single superposition, and can be conceptually regarded as a direct approximation of the integral quantity. Thus, the quantity is strongly linked to the model behavior in the whole region \(\beta.\) On the other hand, the pairwise linearity measure is an averaged quantity over pairs of superimposed vectors, with the instance of interest \(v\) included in each pair. For that reason, it is conceptually more tied to the instance \(v\) itself rather than the region \(\beta\) around it.
Usage¶
LinearityMeasure class¶
Given a model
class with a predict
method that return probabilities distribution in case of a classifier or numeric values in case of a regressor, the linearity measure \(L\) around an instance of interest \(X\) can be calculated using the class LinearityMeasure
as follows:
from alibi.confidence.model_linearity import LinearityMeasure
predict_fn = lambda x: model.predict(x)
lm = LinearityMeasure(method='grid',
epsilon=0.04,
nb_samples=10,
res=100,
alphas=None,
model_type='classifier',
agg='pairwise',
verbose=False)
lm.fit(X_train)
L = lm.score(predict_fn, X)
Where x_train
is the dataset the model was trained on. The feature_range
is inferred form x_train
in the fit
step.
linearity_measure function¶
Given a model
class with a predict
method that return probabilities distribution in case of a classifier or numeric values in case of a regressor, the linearity measure \(L\) around an instance of interest \(X\) can also be calculated using the linearity_measure
function as follows:
from alibi.confidence.model_linearity import linearity_measure, _infer_feature_range
predict_fn = lambda x: model.predict(x)
feature_range = _infer_feature_range(X_train)
L = linearity_measure(predict_fn,
X,
feature_range=feature_range
method='grid',
X_train=None,
epsilon=0.04,
nb_samples=10,
res=100,
alphas=None,
agg='global',
model_type='classifier')
Note that in this case the feature_range
must be explicitly passed to the function and it is inferred beforehand.
This page was generated from doc/source/methods/TrustScores.ipynb.
Trust Scores¶
Overview¶
It is important to know when a machine learning classifier’s predictions can be trusted. Relying on the classifier’s (uncalibrated) prediction probabilities is not optimal and can be improved upon. Enter trust scores. Trust scores measure the agreement between the classifier and a modified nearest neighbor classifier on the predicted instances. The trust score is the ratio between the distance of the instance to the nearest class different from the predicted class and the distance to the predicted class. A score of 1 would mean that the distance to the predicted class is the same as to the nearest other class. Higher scores correspond to more trustworthy predictions. The original paper on which the algorithm is based is called To Trust Or Not To Trust A Classifier. Our implementation borrows heavily from and extends the authors’ open source code.
The method requires labeled training data to build kd trees for each prediction class. When the classifier makes predictions on a test instance, we measure the distance of the instance to each of the trees. The trust score is then calculated by taking the ratio of the smallest distance to any other class than the predicted class and the distance to the predicted class. The distance is measured to the \(k\)th nearest neighbor in each tree or by using the average distance from the first to the \(k\)th neighbor.
In order to filter out the impact of outliers in the training data, they can optionally be removed using 2 filtering techniques. The first technique builds a kd tree for each class and removes a fraction \(\alpha\) of the training instances with the largest k nearest neighbor (kNN) distance to the other instances in the class. The second fits a kNNclassifier to the training set, and removes a fraction \(\alpha\) of the training instances with the highest prediction class disagreement. Be aware that the first method operates on the prediction class level while the second method runs on the whole training set. It is also important to keep in mind that kNN methods might not be suitable when there are significant scale differences between the input features.
Trust scores can for instance be used as a warning flag for machine learning predictions. If the score drops below a certain value and there is disagreement between the model probabilities and the trust score, the prediction can be explained using techniques like anchors or contrastive explanations.
Trust scores work best for low to medium dimensional feature spaces. When working with high dimensional observations like images, dimensionality reduction methods (e.g. autoencoders or PCA) could be applied as a preprocessing step before computing the scores. This is demonstrated by the following example notebook.
Usage¶
Initialization and fit¶
At initialization, the optional filtering method used to remove outliers during the fit
stage needs to be specified as well:
from alibi.confidence import TrustScore
ts = TrustScore(alpha=.05,
filter_type='distance_knn',
k_filter=10,
leaf_size=40,
metric='euclidean',
dist_filter_type='point')
All the hyperparameters are optional:
alpha
: target fraction of instances to filter out.filter_type
: filter method; one of None (no filtering), distance_knn (first technique discussed in Overview) or probability_knn (second technique).k_filter
: number of neighbors used for the distance or probability based filtering method.leaf_size
: affects the speed and memory usage to build the kd trees. The memory scales with the ratio between the number of samples and the leaf size.metric
: distance metric used for the kd trees. Euclidean by default.dist_filter_type
: point uses the distance to the \(k\)nearest point while mean uses the average distance from the 1st to the \(k\)th nearest point during filtering.
In this example, we use the distance_knn method to filter out 5% of the instances of each class with the largest distance to its 10th nearest neighbor in that class:
ts.fit(X_train, y_train, classes=3)
classes
: equals the number of prediction classes.
X_train is the training set and y_train represents the training labels, either using onehot encoding (OHE) or simple class labels.
Scores¶
The trust scores are simply calculated through the score
method. score
also returns the class labels of the closest not predicted class as a numpy array:
score, closest_class = ts.score(X_test,
y_pred,
k=2,
dist_type='point')
y_pred can again be represented using both OHE or via class labels.
k
: \(k\)th nearest neighbor used to compute distance to for each class.dist_type
: similar to the filtering step, we can compute the distance to each class either to the \(k\)th nearest point (point) or by using the average distance from the 1st to the \(k\)th nearest point (mean).
This page was generated from examples/anchor_tabular_adult.ipynb.
Anchor explanations for income prediction¶
In this example, we will explain predictions of a Random Forest classifier whether a person will make more or less than $50k based on characteristics like age, marital status, gender or occupation. The features are a mixture of ordinal and categorical data and will be preprocessed accordingly.
[1]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from alibi.explainers import AnchorTabular
from alibi.datasets import fetch_adult
Load adult dataset¶
The fetch_adult
function returns a Bunch
object containing the features, the targets, the feature names and a mapping of categorical variables to numbers which are required for formatting the output of the Anchor explainer.
[2]:
adult = fetch_adult()
adult.keys()
[2]:
dict_keys(['data', 'target', 'feature_names', 'target_names', 'category_map'])
[3]:
data = adult.data
target = adult.target
feature_names = adult.feature_names
category_map = adult.category_map
Note that for your own datasets you can use our utility function gen_category_map to create the category map:
[4]:
from alibi.utils.data import gen_category_map
Define shuffled training and test set
[5]:
np.random.seed(0)
data_perm = np.random.permutation(np.c_[data, target])
data = data_perm[:,:1]
target = data_perm[:,1]
[6]:
idx = 30000
X_train,Y_train = data[:idx,:], target[:idx]
X_test, Y_test = data[idx+1:,:], target[idx+1:]
Create feature transformation pipeline¶
Create feature preprocessor. Needs to have ‘fit’ and ‘transform’ methods. Different types of preprocessing can be applied to all or part of the features. In the example below we will standardize ordinal features and apply onehotencoding to categorical features.
Ordinal features:
[7]:
ordinal_features = [x for x in range(len(feature_names)) if x not in list(category_map.keys())]
ordinal_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
Categorical features:
[8]:
categorical_features = list(category_map.keys())
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
Combine and fit:
[9]:
preprocessor = ColumnTransformer(transformers=[('num', ordinal_transformer, ordinal_features),
('cat', categorical_transformer, categorical_features)])
preprocessor.fit(X_train)
[9]:
ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
transformer_weights=None,
transformers=[('num',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(add_indicator=False,
copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('scaler',
StandardScaler(copy=True,
with_mean=True,
with_std=True))],
verbose=False),
[0, 8, 9, 10]),
('cat',
Pipeline(memory=None,
steps=[('imputer',
SimpleImputer(add_indicator=False,
copy=True,
fill_value=None,
missing_values=nan,
strategy='median',
verbose=0)),
('onehot',
OneHotEncoder(categories='auto',
drop=None,
dtype=<class 'numpy.float64'>,
handle_unknown='ignore',
sparse=True))],
verbose=False),
[1, 2, 3, 4, 5, 6, 7, 11])],
verbose=False)
Train Random Forest model¶
Fit on preprocessed (imputing, OHE, standardizing) data.
[10]:
np.random.seed(0)
clf = RandomForestClassifier(n_estimators=50)
clf.fit(preprocessor.transform(X_train), Y_train)
[10]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
Define predict function
[11]:
predict_fn = lambda x: clf.predict(preprocessor.transform(x))
print('Train accuracy: ', accuracy_score(Y_train, predict_fn(X_train)))
print('Test accuracy: ', accuracy_score(Y_test, predict_fn(X_test)))
Train accuracy: 0.9655333333333334
Test accuracy: 0.855859375
Initialize and fit anchor explainer for tabular data¶
[12]:
explainer = AnchorTabular(predict_fn, feature_names, categorical_names=category_map, seed=1)
Discretize the ordinal features into quartiles
[13]:
explainer.fit(X_train, disc_perc=[25, 50, 75])
[13]:
AnchorTabular(meta={
'name': 'AnchorTabular',
'type': ['blackbox'],
'explanations': ['local'],
'params': {'seed': 1, 'disc_perc': [25, 50, 75]}
})
Getting an anchor¶
Below, we get an anchor for the prediction of the first observation in the test set. An anchor is a sufficient condition  that is, when the anchor holds, the prediction should be the same as the prediction for this instance.
[14]:
idx = 0
class_names = adult.target_names
print('Prediction: ', class_names[explainer.predictor(X_test[idx].reshape(1, 1))[0]])
Prediction: <=50K
We set the precision threshold to 0.95. This means that predictions on observations where the anchor holds will be the same as the prediction on the explained instance at least 95% of the time.
[15]:
explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation.anchor)))
print('Precision: %.2f' % explanation.precision)
print('Coverage: %.2f' % explanation.coverage)
Anchor: Marital Status = Separated AND Sex = Female
Precision: 0.95
Coverage: 0.18
…or not?¶
Let’s try getting an anchor for a different observation in the test set  one for the which the prediction is >50K
.
[16]:
idx = 6
class_names = adult.target_names
print('Prediction: ', class_names[explainer.predictor(X_test[idx].reshape(1, 1))[0]])
explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation.anchor)))
print('Precision: %.2f' % explanation.precision)
print('Coverage: %.2f' % explanation.coverage)
Prediction: >50K
Could not find an result satisfying the 0.95 precision constraint. Now returning the best noneligible result.
Anchor: Capital Loss > 0.00 AND Relationship = Husband AND Marital Status = Married AND Age > 37.00 AND Race = White AND Country = UnitedStates AND Sex = Male
Precision: 0.71
Coverage: 0.05
Notice how no anchor is found!
This is due to the imbalanced dataset (roughly 25:75 high:low earner proportion), so during the sampling stage feature ranges corresponding to lowearners will be oversampled. This is a feature because it can point out an imbalanced dataset, but it can also be fixed by producing balanced datasets to enable anchors to be found for either class.
This page was generated from examples/anchor_tabular_iris.ipynb.
Anchor explanations on the Iris dataset¶
[1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from alibi.explainers import AnchorTabular
Load iris dataset¶
[2]:
dataset = load_iris()
feature_names = dataset.feature_names
class_names = list(dataset.target_names)
Define training and test set
[3]:
idx = 145
X_train,Y_train = dataset.data[:idx,:], dataset.target[:idx]
X_test, Y_test = dataset.data[idx+1:,:], dataset.target[idx+1:]
Train Random Forest model¶
[4]:
np.random.seed(0)
clf = RandomForestClassifier(n_estimators=50)
clf.fit(X_train, Y_train)
[4]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
Define predict function
[5]:
predict_fn = lambda x: clf.predict_proba(x)
Initialize and fit anchor explainer for tabular data¶
[6]:
explainer = AnchorTabular(predict_fn, feature_names)
Discretize the ordinal features into quartiles
[7]:
explainer.fit(X_train, disc_perc=(25, 50, 75))
[7]:
AnchorTabular(meta={
'name': 'AnchorTabular',
'type': ['blackbox'],
'explanations': ['local'],
'params': {'seed': None, 'disc_perc': (25, 50, 75)}
})
Getting an anchor¶
Below, we get an anchor for the prediction of the first observation in the test set. An anchor is a sufficient condition  that is, when the anchor holds, the prediction should be the same as the prediction for this instance.
[8]:
idx = 0
print('Prediction: ', class_names[explainer.predictor(X_test[idx].reshape(1, 1))[0]])
Prediction: virginica
We set the precision threshold to 0.95. This means that predictions on observations where the anchor holds will be the same as the prediction on the explained instance at least 95% of the time.
[9]:
explanation = explainer.explain(X_test[idx], threshold=0.95)
print('Anchor: %s' % (' AND '.join(explanation.anchor)))
print('Precision: %.2f' % explanation.precision)
print('Coverage: %.2f' % explanation.coverage)
Anchor: petal width (cm) > 1.80 AND sepal width (cm) <= 2.80
Precision: 0.98
Coverage: 0.32
This page was generated from examples/anchor_text_movie.ipynb.
Anchor explanations for movie sentiment¶
In this example, we will explain why a certain sentence is classified by a logistic regression as having negative or positive sentiment. The logistic regression is trained on negative and positive movie reviews.
[1]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import spacy
from alibi.explainers import AnchorText
from alibi.datasets import fetch_movie_sentiment
from alibi.utils.download import spacy_model
Load movie review dataset¶
The fetch_movie_sentiment
function returns a Bunch
object containing the features, the targets and the target names for the dataset.
[2]:
movies = fetch_movie_sentiment()
movies.keys()
[2]:
dict_keys(['data', 'target', 'target_names'])
[3]:
data = movies.data
labels = movies.target
target_names = movies.target_names
Define shuffled training, validation and test set
[4]:
train, test, train_labels, test_labels = train_test_split(data, labels, test_size=.2, random_state=42)
train, val, train_labels, val_labels = train_test_split(train, train_labels, test_size=.1, random_state=42)
train_labels = np.array(train_labels)
test_labels = np.array(test_labels)
val_labels = np.array(val_labels)
Apply CountVectorizer to training set¶
[5]:
vectorizer = CountVectorizer(min_df=1)
vectorizer.fit(train)
[5]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
Fit model¶
[6]:
np.random.seed(0)
clf = LogisticRegression(solver='liblinear')
clf.fit(vectorizer.transform(train), train_labels)
[6]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False)
Define prediction function¶
[7]:
predict_fn = lambda x: clf.predict(vectorizer.transform(x))
Make predictions on train and test sets¶
[8]:
preds_train = predict_fn(train)
preds_val = predict_fn(val)
preds_test = predict_fn(test)
print('Train accuracy', accuracy_score(train_labels, preds_train))
print('Validation accuracy', accuracy_score(val_labels, preds_val))
print('Test accuracy', accuracy_score(test_labels, preds_test))
Train accuracy 0.9801624284382905
Validation accuracy 0.7544910179640718
Test accuracy 0.7589841878294202
Load spaCy model¶
English multitask CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, contextspecific token vectors, POS tags, dependency parse and named entities.
[9]:
model = 'en_core_web_md'
spacy_model(model=model)
nlp = spacy.load(model)
Initialize anchor text explainer¶
[10]:
explainer = AnchorText(nlp, predict_fn)
Explain a prediction¶
[11]:
class_names = movies.target_names
[12]:
text = data[4]
print(text)
a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification .
Prediction:
[13]:
pred = class_names[predict_fn([text])[0]]
alternative = class_names[1  predict_fn([text])[0]]
print('Prediction: %s' % pred)
Prediction: negative
Explanation:
[14]:
np.random.seed(0)
explanation = explainer.explain(text, threshold=0.95, use_unk=True)
use_unk=True means we will perturb examples by replacing words with UNKs. Let us now take a look at the anchor. The word ‘exercise’ basically guarantees a negative prediction.
[15]:
print('Anchor: %s' % (' AND '.join(explanation.anchor)))
print('Precision: %.2f' % explanation.precision)
print('\nExamples where anchor applies and model predicts %s:' % pred)
print('\n'.join([x for x in explanation.raw['examples'][1]['covered_true']]))
print('\nExamples where anchor applies and model predicts %s:' % alternative)
print('\n'.join([x for x in explanation.raw['examples'][1]['covered_false']]))
Anchor: flashy
Precision: 0.99
Examples where anchor applies and model predicts negative:
a UNK flashy UNK UNK opaque and emotionally vapid exercise in style UNK mystification .
a UNK flashy UNK UNK UNK and emotionally UNK exercise UNK UNK and UNK UNK
a UNK flashy UNK narratively opaque UNK UNK UNK exercise in style and UNK UNK
UNK visually flashy UNK narratively UNK and emotionally UNK UNK UNK UNK UNK mystification .
UNK UNK flashy UNK UNK opaque and emotionally UNK UNK in UNK and UNK .
a visually flashy but UNK UNK and UNK UNK UNK in style UNK mystification .
a visually flashy but UNK opaque UNK emotionally vapid UNK in UNK and mystification .
a UNK flashy but narratively UNK UNK emotionally vapid exercise in style UNK mystification UNK
a UNK flashy but narratively opaque UNK emotionally vapid exercise in style and mystification .
a visually flashy UNK UNK opaque UNK UNK UNK exercise in UNK UNK UNK .
Examples where anchor applies and model predicts positive:
UNK UNK flashy but narratively UNK and UNK UNK UNK in style and UNK UNK
Changing the perturbation distribution¶
Let’s try this with another perturbation distribution, namely one that replaces words by similar words instead of UNKs.
Explanation:
[16]:
np.random.seed(0)
explanation = explainer.explain(text, threshold=0.95, use_unk=False, sample_proba=0.5)
The anchor now shows that we need more to guarantee the negative prediction:
[17]:
print('Anchor: %s' % (' AND '.join(explanation.anchor)))
print('Precision: %.2f' % explanation.precision)
print('\nExamples where anchor applies and model predicts %s:' % pred)
print('\n'.join([x for x in explanation.raw['examples'][1]['covered_true']]))
print('\nExamples where anchor applies and model predicts %s:' % alternative)
print('\n'.join([x for x in explanation.raw['examples'][1]['covered_false']]))
Anchor: exercise AND emotionally
Precision: 0.97
Examples where anchor applies and model predicts negative:
a visually ballsy but philosophically opaque and emotionally vapid exercise if signature and negation .
a instantly flashy but narratively opaque and emotionally vapid exercise than style and mystification .
a visually flashy but narratively opaque and emotionally vapid exercise in style and fear .
some visually quirky but narratively dainty and emotionally patronising exercise until style and hopelessness .
another visually decorate but narratively opaque and emotionally vapid exercise in outfit and mystification .
both curiously unwieldy but wonderfully hollow and emotionally ridiculous exercise in style and mystification .
a visually artsy but gloriously opaque and emotionally vapid exercise in style and mystification .
a visually eclectic but narratively opaque and emotionally vapid exercise in vogue and mystification .
another stunningly flashy but narratively bright and emotionally unscientific exercise on style and falsehood .
a exceptionally whimsical but disturbingly opaque and emotionally vapid exercise about vibe and mystification .
Examples where anchor applies and model predicts positive:
both visually unconventional but socially opaque and emotionally caricature exercise around style and woe .
both wonderfully artsy but similarly opaque and emotionally vapid exercise in style and mystification .
a perfectly groovy but supremely smooth and emotionally vapid exercise towards style and mystification .
a visually stylish but narratively truthful and emotionally moronic exercise in style and oxymoron .
any remarkably inventive but narratively opaque and emotionally babble exercise despite style and naivety .
We can make the token perturbation distribution sample words that are more similar to the ground truth word via the top_n
argument. Smaller values (default=100) should result in sentences that are more coherent and thus more in the distribution of natural language which could influence the returned anchor. By setting the use_probability_proba
to True, the sampling distribution for perturbed tokens is proportional to the similarity score between the possible perturbations and the original
word. We can also put more weight on similar words via the temperature
argument. Lower values of temperature
increase the sampling weight of more similar words. The following example will perturb tokens in the original sentence with probability equal to sample_proba
. The sampling distribution for the perturbed tokens is proportional to the similarity score between the ground truth word and each of the top_n
words.
[18]:
np.random.seed(0)
explanation = explainer.explain(text, threshold=0.95, use_similarity_proba=True, sample_proba=0.5,
use_unk=False, top_n=20, temperature=.2)
print('Anchor: %s' % (' AND '.join(explanation.anchor)))
print('Precision: %.2f' % explanation.precision)
print('\nExamples where anchor applies and model predicts %s:' % pred)
print('\n'.join([x for x in explanation.raw['examples'][1]['covered_true']]))
print('\nExamples where anchor applies and model predicts %s:' % alternative)
print('\n'.join([x for x in explanation.raw['examples'][1]['covered_false']]))
Anchor: exercise AND emotionally
Precision: 0.98
Examples where anchor applies and model predicts negative:
a visually exquisite but narratively opaque and emotionally vapid exercise before style and mystification .
each mechanically eccentric but narratively transparent and emotionally unremarkable exercise in style and falsehood .
a incredibly extravagant but artistically bright and emotionally vapid exercise of style and mystification .
any visually shiny but artistically glide and emotionally vapid exercise around temperament and materialism .
another clearly flashy but aesthetically opaque and emotionally vapid exercise whether flair and mystification .
a visually snazzy but narratively opaque and emotionally mindless exercise within style and negation .
a visually ingenious but narratively opaque and emotionally unimaginative exercise in artistry and mystification .
a visually flashy but narratively colorful and emotionally vapid exercise than style and mystification .
a graphically punchy but narratively opaque and emotionally vapid exercise of vibe and insanity .
a artistically flashy but narratively opaque and emotionally vapid exercise in ballroom and mystification .
Examples where anchor applies and model predicts positive:
any vividly outlandish but supremely opaque and emotionally vapid exercise throughout streetwear and mystification .
another precisely elaborate but delightfully realistic and emotionally muddled exercise in brevity and paranoia .
This page was generated from examples/anchor_image_imagenet.ipynb.
Anchor explanations for ImageNet¶
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_input, decode_predictions
from alibi.datasets import fetch_imagenet
from alibi.explainers import AnchorImage
Load InceptionV3 model pretrained on ImageNet¶
[2]:
model = InceptionV3(weights='imagenet')
Download and preprocess some images from ImageNet¶
The fetch_imagenet function takes as arguments one of the following ImageNet categories: ‘Persian cat’, ‘volcano’, ‘strawberry’, ‘jellyfish’ or ‘centipede’ as well as the number of images to return and the target size of the image.
[3]:
category = 'Persian cat'
image_shape = (299, 299, 3)
data, labels = fetch_imagenet(category, nb_images=10, target_size=image_shape[:2], seed=2, return_X_y=True)
print('Images shape: {}'.format(data.shape))
Images shape: (10, 299, 299, 3)
Apply image preprocessing, make predictions and map predictions back to categories. The output label is a tuple which consists of the class name, description and the prediction probability.
[4]:
images = preprocess_input(data)
preds = model.predict(images)
label = decode_predictions(preds, top=3)
print(label[0])
[('n02123394', 'Persian_cat', 0.909348), ('n03207941', 'dishwasher', 0.0027691855), ('n03832673', 'notebook', 0.0020055235)]
Define prediction function¶
[5]:
predict_fn = lambda x: model.predict(x)
Initialize anchor image explainer¶
The segmentation function will be used to generate superpixels. It is important to have meaningful superpixels in order to generate a useful explanation. Please check scikitimage’s segmentation methods (felzenszwalb, slic and quickshift built in the explainer) for more information.
In the example, the pixels not in the proposed anchor will take the average value of their superpixel. Another option is to superimpose the pixel values from other images which can be passed as a numpy array to the images_background argument.
[6]:
segmentation_fn = 'slic'
kwargs = {'n_segments': 15, 'compactness': 20, 'sigma': .5}
explainer = AnchorImage(predict_fn, image_shape, segmentation_fn=segmentation_fn,
segmentation_kwargs=kwargs, images_background=None)
Explain a prediction¶
The explanation of the below image returns a mask with the superpixels that constitute the anchor.
[7]:
i = 0
plt.imshow(data[i]);
The threshold, p_sample and tau parameters are also key to generate a sensible explanation and ensure fast enough convergence. The threshold defines the minimum fraction of samples for a candidate anchor that need to lead to the same prediction as the original instance. While a higher threshold gives more confidence in the anchor, it also leads to longer computation time. p_sample determines the fraction of superpixels that are changed to either the average value of the superpixel or the pixel value for the superimposed image. The pixels in the proposed anchors are of course unchanged. The parameter tau determines when we assume convergence. A bigger value for tau means faster convergence but also looser anchor restrictions.
[8]:
image = images[i]
np.random.seed(0)
explanation = explainer.explain(image, threshold=.95, p_sample=.5, tau=0.25)
Superpixels in the anchor:
[9]:
plt.imshow(explanation.anchor);
A visualization of all the superpixels:
[10]:
plt.imshow(explanation.segments);
This page was generated from examples/anchor_image_fashion_mnist.ipynb.
Anchor explanations for fashion MNIST¶
[1]:
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D, Input
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical
from alibi.explainers import AnchorImage
Load and prepare fashion MNIST data¶
[2]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
print('x_train shape:', x_train.shape, 'y_train shape:', y_train.shape)
x_train shape: (60000, 28, 28) y_train shape: (60000,)
[3]:
idx = 0
plt.imshow(x_train[idx]);
Scale, reshape and categorize data
[4]:
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
x_train = np.reshape(x_train, x_train.shape + (1,))
x_test = np.reshape(x_test, x_test.shape + (1,))
print('x_train shape:', x_train.shape, 'x_test shape:', x_test.shape)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
print('y_train shape:', y_train.shape, 'y_test shape:', y_test.shape)
x_train shape: (60000, 28, 28, 1) x_test shape: (10000, 28, 28, 1)
y_train shape: (60000, 10) y_test shape: (10000, 10)
Define CNN model¶
[5]:
def model():
x_in = Input(shape=(28, 28, 1))
x = Conv2D(filters=64, kernel_size=2, padding='same', activation='relu')(x_in)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Conv2D(filters=32, kernel_size=2, padding='same', activation='relu')(x)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x_out = Dense(10, activation='softmax')(x)
cnn = Model(inputs=x_in, outputs=x_out)
cnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return cnn
[6]:
cnn = model()
cnn.summary()
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 28, 28, 1)] 0
_________________________________________________________________
conv2d (Conv2D) (None, 28, 28, 64) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 64) 0
_________________________________________________________________
dropout (Dropout) (None, 14, 14, 64) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 14, 14, 32) 8224
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 7, 7, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 1568) 0
_________________________________________________________________
dense (Dense) (None, 256) 401664
_________________________________________________________________
dropout_2 (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 2570
=================================================================
Total params: 412,778
Trainable params: 412,778
Nontrainable params: 0
_________________________________________________________________
Train model¶
[7]:
cnn.fit(x_train, y_train, batch_size=64, epochs=3)
Train on 60000 samples
Epoch 1/3
60000/60000 [==============================]  29s 481us/sample  loss: 0.5932  acc: 0.7819
Epoch 2/3
60000/60000 [==============================]  33s 542us/sample  loss: 0.4066  acc: 0.8506
Epoch 3/3
60000/60000 [==============================]  32s 525us/sample  loss: 0.3624  acc: 0.8681
[7]:
<tensorflow.python.keras.callbacks.History at 0x7fae6dd5cb70>
[8]:
# Evaluate the model on test set
score = cnn.evaluate(x_test, y_test, verbose=0)
print('Test accuracy: ', score[1])
Test accuracy: 0.8867
Define superpixels¶
Function to generate rectangular superpixels for a given image. Alternatively, use one of the built in methods. It is important to have meaningful superpixels in order to generate a useful explanation. Please check scikitimage’s segmentation methods (felzenszwalb, slic and quickshift built in the explainer) for more information on the built in methods.
[9]:
def superpixel(image, size=(4, 7)):
segments = np.zeros([image.shape[0], image.shape[1]])
row_idx, col_idx = np.where(segments == 0)
for i, j in zip(row_idx, col_idx):
segments[i, j] = int((image.shape[1]/size[1]) * (i//size[0]) + j//size[1])
return segments
[10]:
segments = superpixel(x_train[idx])
plt.imshow(segments);
Define prediction function¶
[11]:
predict_fn = lambda x: cnn.predict(x)
Initialize anchor image explainer¶
[12]:
image_shape = x_train[idx].shape
explainer = AnchorImage(predict_fn, image_shape, segmentation_fn=superpixel)
Explain a prediction¶
The explanation returns a mask with the superpixels that constitute the anchor.
Image to be explained:
[13]:
i = 1
image = x_test[i]
plt.imshow(image[:,:,0]);
Model prediction:
[14]:
cnn.predict(image.reshape(1, 28, 28, 1)).argmax()
[14]:
2
The predicted category correctly corresponds to the class Pullover
:
Label 
Description 

0 
Tshirt/top 
1 
Trouser 
2 
Pullover 
3 
Dress 
4 
Coat 
5 
Sandal 
6 
Shirt 
7 
Sneaker 
8 
Bag 
9 
Ankle boot 
Generate explanation:
[15]:
explanation = explainer.explain(image, threshold=.95, p_sample=.8, seed=0)
Show anchor:
[16]:
plt.imshow(explanation.anchor[:,:,0]);
From the example, it looks like the end of the sleeve alone is sufficient to predict a pullover.
This page was generated from examples/cem_mnist.ipynb.
Contrastive Explanations Method (CEM) applied to MNIST¶
The Contrastive Explanation Method (CEM) can generate black box model explanations in terms of pertinent positives (PP) and pertinent negatives (PN). For PP, it finds what should be minimally and sufficiently present (e.g. important pixels in an image) to justify its classification. PN on the other hand identify what should be minimally and necessarily absent from the explained instance in order to maintain the original prediction.
The original paper where the algorithm is based on can be found on arXiv.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
import tensorflow.keras as keras
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D, Input, UpSampling2D
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
from alibi.explainers import CEM
Load and prepare MNIST data¶
[2]:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
print('x_train shape:', x_train.shape, 'y_train shape:', y_train.shape)
plt.gray()
plt.imshow(x_test[4]);
x_train shape: (60000, 28, 28) y_train shape: (60000,)
Prepare data: scale, reshape and categorize
[3]:
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
x_train = np.reshape(x_train, x_train.shape + (1,))
x_test = np.reshape(x_test, x_test.shape + (1,))
print('x_train shape:', x_train.shape, 'x_test shape:', x_test.shape)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
print('y_train shape:', y_train.shape, 'y_test shape:', y_test.shape)
x_train shape: (60000, 28, 28, 1) x_test shape: (10000, 28, 28, 1)
y_train shape: (60000, 10) y_test shape: (10000, 10)
[4]:
xmin, xmax = .5, .5
x_train = ((x_train  x_train.min()) / (x_train.max()  x_train.min())) * (xmax  xmin) + xmin
x_test = ((x_test  x_test.min()) / (x_test.max()  x_test.min())) * (xmax  xmin) + xmin
Define and train CNN model¶
[5]:
def cnn_model():
x_in = Input(shape=(28, 28, 1))
x = Conv2D(filters=64, kernel_size=2, padding='same', activation='relu')(x_in)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Conv2D(filters=32, kernel_size=2, padding='same', activation='relu')(x)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Conv2D(filters=32, kernel_size=2, padding='same', activation='relu')(x)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x_out = Dense(10, activation='softmax')(x)
cnn = Model(inputs=x_in, outputs=x_out)
cnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return cnn
[6]:
cnn = cnn_model()
cnn.summary()
cnn.fit(x_train, y_train, batch_size=64, epochs=5, verbose=1)
cnn.save('mnist_cnn.h5', save_format='h5')
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 28, 28, 1)] 0
_________________________________________________________________
conv2d (Conv2D) (None, 28, 28, 64) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 64) 0
_________________________________________________________________
dropout (Dropout) (None, 14, 14, 64) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 14, 14, 32) 8224
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 7, 7, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 7, 7, 32) 4128
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 3, 3, 32) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 3, 3, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 288) 0
_________________________________________________________________
dense (Dense) (None, 256) 73984
_________________________________________________________________
dropout_3 (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 2570
=================================================================
Total params: 89,226
Trainable params: 89,226
Nontrainable params: 0
_________________________________________________________________
Evaluate the model on test set
[7]:
cnn = load_model('mnist_cnn.h5')
score = cnn.evaluate(x_test, y_test, verbose=0)
print('Test accuracy: ', score[1])
Test accuracy: 0.9867
Define and train autoencoder¶
[8]:
def ae_model():
x_in = Input(shape=(28, 28, 1))
x = Conv2D(16, (3, 3), activation='relu', padding='same')(x_in)
x = Conv2D(16, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
encoded = Conv2D(1, (3, 3), activation=None, padding='same')(x)
x = Conv2D(16, (3, 3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation='relu', padding='same')(x)
decoded = Conv2D(1, (3, 3), activation=None, padding='same')(x)
autoencoder = Model(x_in, decoded)
autoencoder.compile(optimizer='adam', loss='mse')
return autoencoder
[9]:
ae = ae_model()
ae.summary()
ae.fit(x_train, x_train, batch_size=128, epochs=4, validation_data=(x_test, x_test), verbose=0)
ae.save('mnist_ae.h5', save_format='h5')
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 28, 28, 1)] 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 28, 28, 16) 160
_________________________________________________________________
conv2d_4 (Conv2D) (None, 28, 28, 16) 2320
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 14, 14, 16) 0
_________________________________________________________________
conv2d_5 (Conv2D) (None, 14, 14, 1) 145
_________________________________________________________________
conv2d_6 (Conv2D) (None, 14, 14, 16) 160
_________________________________________________________________
up_sampling2d (UpSampling2D) (None, 28, 28, 16) 0
_________________________________________________________________
conv2d_7 (Conv2D) (None, 28, 28, 16) 2320
_________________________________________________________________
conv2d_8 (Conv2D) (None, 28, 28, 1) 145
=================================================================
Total params: 5,250
Trainable params: 5,250
Nontrainable params: 0
_________________________________________________________________
Compare original with decoded images
[10]:
ae = load_model('mnist_ae.h5')
decoded_imgs = ae.predict(x_test)
n = 5
plt.figure(figsize=(20, 4))
for i in range(1, n+1):
# display original
ax = plt.subplot(2, n, i)
plt.imshow(x_test[i].reshape(28, 28))
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
# display reconstruction
ax = plt.subplot(2, n, i + n)
plt.imshow(decoded_imgs[i].reshape(28, 28))
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
Generate contrastive explanation with pertinent negative¶
Explained instance:
[11]:
idx = 15
X = x_test[idx].reshape((1,) + x_test[idx].shape)
[12]:
plt.imshow(X.reshape(28, 28));
Model prediction:
[13]:
cnn.predict(X).argmax(), cnn.predict(X).max()
[13]:
(5, 0.9999138)
CEM parameters:
[14]:
mode = 'PN' # 'PN' (pertinent negative) or 'PP' (pertinent positive)
shape = (1,) + x_train.shape[1:] # instance shape
kappa = 0. # minimum difference needed between the prediction probability for the perturbed instance on the
# class predicted by the original instance and the max probability on the other classes
# in order for the first loss term to be minimized
beta = .1 # weight of the L1 loss term
gamma = 100 # weight of the optional autoencoder loss term
c_init = 1. # initial weight c of the loss term encouraging to predict a different class (PN) or
# the same class (PP) for the perturbed instance compared to the original instance to be explained
c_steps = 10 # nb of updates for c
max_iterations = 1000 # nb of iterations per value of c
feature_range = (x_train.min(),x_train.max()) # feature range for the perturbed instance
clip = (1000.,1000.) # gradient clipping
lr = 1e2 # initial learning rate
no_info_val = 1. # a value, float or featurewise, which can be seen as containing no info to make a prediction
# perturbations towards this value means removing features, and away means adding features
# for our MNIST images, the background (0.5) is the least informative,
# so positive/negative perturbations imply adding/removing features
Generate pertinent negative:
[15]:
# initialize CEM explainer and explain instance
cem = CEM(cnn, mode, shape, kappa=kappa, beta=beta, feature_range=feature_range,
gamma=gamma, ae_model=ae, max_iterations=max_iterations,
c_init=c_init, c_steps=c_steps, learning_rate_init=lr, clip=clip, no_info_val=no_info_val)
explanation = cem.explain(X)
Pertinent negative:
[16]:
print('Pertinent negative prediction: {}'.format(explanation.PN_pred))
plt.imshow(explanation.PN.reshape(28, 28));
Pertinent negative prediction: 8
Generate pertinent positive¶
[17]:
mode = 'PP'
[18]:
# initialize CEM explainer and explain instance
cem = CEM(cnn, mode, shape, kappa=kappa, beta=beta, feature_range=feature_range,
gamma=gamma, ae_model=ae, max_iterations=max_iterations,
c_init=c_init, c_steps=c_steps, learning_rate_init=lr, clip=clip, no_info_val=no_info_val)
explanation = cem.explain(X)
Pertinent positive:
[19]:
print('Pertinent positive prediction: {}'.format(explanation.PP_pred))
plt.imshow(explanation.PP.reshape(28, 28));
Pertinent positive prediction: 5
Clean up:
[ ]:
os.remove('mnist_cnn.h5')
os.remove('mnist_ae.h5')
This page was generated from examples/cem_iris.ipynb.
Contrastive Explanations Method (CEM) applied to Iris dataset¶
The Contrastive Explanation Method (CEM) can generate black box model explanations in terms of pertinent positives (PP) and pertinent negatives (PN). For PP, it finds what should be minimally and sufficiently present (e.g. important pixels in an image) to justify its classification. PN on the other hand identify what should be minimally and necessarily absent from the explained instance in order to maintain the original prediction.
The original paper where the algorithm is based on can be found on arXiv.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_iris
from alibi.explainers import CEM
Load and prepare Iris dataset¶
[2]:
dataset = load_iris()
feature_names = dataset.feature_names
class_names = list(dataset.target_names)
Scale data
[3]:
dataset.data = (dataset.data  dataset.data.mean(axis=0)) / dataset.data.std(axis=0)
Define training and test set
[4]:
idx = 145
x_train,y_train = dataset.data[:idx,:], dataset.target[:idx]
x_test, y_test = dataset.data[idx+1:,:], dataset.target[idx+1:]
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
Define and train logistic regression model¶
[5]:
def lr_model():
x_in = Input(shape=(4,))
x_out = Dense(3, activation='softmax')(x_in)
lr = Model(inputs=x_in, outputs=x_out)
lr.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
return lr
[6]:
lr = lr_model()
lr.summary()
lr.fit(x_train, y_train, batch_size=16, epochs=500, verbose=0)
lr.save('iris_lr.h5', save_format='h5')
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 4)] 0
_________________________________________________________________
dense (Dense) (None, 3) 15
=================================================================
Total params: 15
Trainable params: 15
Nontrainable params: 0
_________________________________________________________________
Generate contrastive explanation with pertinent negative¶
Explained instance:
[7]:
idx = 0
X = x_test[idx].reshape((1,) + x_test[idx].shape)
print('Prediction on instance to be explained: {}'.format(class_names[np.argmax(lr.predict(X))]))
print('Prediction probabilities for each class on the instance: {}'.format(lr.predict(X)))
Prediction on instance to be explained: virginica
Prediction probabilities for each class on the instance: [[2.2114810e04 2.4995686e01 7.4982196e01]]
CEM parameters:
[8]:
mode = 'PN' # 'PN' (pertinent negative) or 'PP' (pertinent positive)
shape = (1,) + x_train.shape[1:] # instance shape
kappa = .2 # minimum difference needed between the prediction probability for the perturbed instance on the
# class predicted by the original instance and the max probability on the other classes
# in order for the first loss term to be minimized
beta = .1 # weight of the L1 loss term
c_init = 10. # initial weight c of the loss term encouraging to predict a different class (PN) or
# the same class (PP) for the perturbed instance compared to the original instance to be explained
c_steps = 10 # nb of updates for c
max_iterations = 1000 # nb of iterations per value of c
feature_range = (x_train.min(axis=0).reshape(shape).1, # feature range for the perturbed instance
x_train.max(axis=0).reshape(shape)+.1) # can be either a float or array of shape (1xfeatures)
clip = (1000.,1000.) # gradient clipping
lr_init = 1e2 # initial learning rate
Generate pertinent negative:
[9]:
# define model
lr = load_model('iris_lr.h5')
# initialize CEM explainer and explain instance
cem = CEM(lr, mode, shape, kappa=kappa, beta=beta, feature_range=feature_range,
max_iterations=max_iterations, c_init=c_init, c_steps=c_steps,
learning_rate_init=lr_init, clip=clip)
cem.fit(x_train, no_info_type='median') # we need to define what feature values contain the least
# info wrt predictions
# here we will naively assume that the featurewise median
# contains no info; domain knowledge helps!
explanation = cem.explain(X, verbose=False)
[10]:
print('Original instance: {}'.format(explanation.X))
print('Predicted class: {}'.format(class_names[explanation.X_pred]))
Original instance: [[ 0.55333328 1.28296331 0.70592084 0.92230284]]
Predicted class: virginica
[11]:
print('Pertinent negative: {}'.format(explanation.PN))
print('Predicted class: {}'.format(class_names[explanation.PN_pred]))
Pertinent negative: [[ 0.5533333 1.2829633 0.33608788 0.92230284]]
Predicted class: versicolor
Store explanation to plot later on:
[12]:
expl = {}
expl['PN'] = explanation.PN
expl['PN_pred'] = explanation.PN_pred
Generate pertinent positive¶
[13]:
mode = 'PP'
Generate pertinent positive:
[14]:
# define model
lr = load_model('iris_lr.h5')
# initialize CEM explainer and explain instance
cem = CEM(lr, mode, shape, kappa=kappa, beta=beta, feature_range=feature_range,
max_iterations=max_iterations, c_init=c_init, c_steps=c_steps,
learning_rate_init=lr_init, clip=clip)
cem.fit(x_train, no_info_type='median')
explanation = cem.explain(X, verbose=False)
[15]:
print('Pertinent positive: {}'.format(explanation.PP))
print('Predicted class: {}'.format(class_names[explanation.PP_pred]))
Pertinent positive: [[7.44469730e09 3.47054341e08 4.19345584e01 8.38366147e01]]
Predicted class: virginica
[16]:
expl['PP'] = explanation.PP
expl['PP_pred'] = explanation.PP_pred
Visualize PN and PP¶
Let’s visualize the generated explanations to check if the perturbed instances make sense.
Create dataframe from standardized data:
[17]:
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['species'] = np.array([dataset.target_names[i] for i in dataset.target])
Highlight explained instance and add pertinent negative and positive to the dataset:
[18]:
pn = pd.DataFrame(expl['PN'], columns=dataset.feature_names)
pn['species'] = 'PN_' + class_names[expl['PN_pred']]
pp = pd.DataFrame(expl['PP'], columns=dataset.feature_names)
pp['species'] = 'PP_' + class_names[expl['PP_pred']]
orig_inst = pd.DataFrame(explanation.X, columns=dataset.feature_names)
orig_inst['species'] = 'orig_' + class_names[explanation.X_pred]
df = df.append([pn, pp, orig_inst], ignore_index=True)
Pair plots between the features show that the pertinent negative is pushed from the original instance (versicolor) into the virginica distribution while the pertinent positive moved away from the virginica distribution.
[19]:
fig = sns.pairplot(df, hue='species', diag_kind='hist');
Use numerical gradients in CEM¶
If we do not have access to the Keras or TensorFlow model weights, we can use numerical gradients for the first term in the loss function that needs to be minimized (eq. 1 and 4 in the paper).
CEM parameters:
[20]:
mode = 'PN'
If numerical gradients are used to compute:
with L = loss function; p = predict function and x the parameter to optimize, then the tuple eps can be used to define the perturbation used to compute the derivatives. eps[0] is used to calculate the first partial derivative term and eps[1] is used for the second term. eps[0] and eps[1] can be a combination of float values or numpy arrays. For eps[0], the array dimension should be (1 x nb of prediction categories) and for eps[1] it should be (1 x nb of features).
[21]:
eps0 = np.array([[1e2, 1e2, 1e2]]) # 3 prediction categories, equivalent to 1e2
eps1 = np.array([[1e2, 1e2, 1e2, 1e2]]) # 4 features, also equivalent to 1e2
eps = (eps0, eps1)
For complex models with a high number of parameters and a high dimensional feature space (e.g. Inception on ImageNet), evaluating numerical gradients can be expensive as they involve multiple prediction calls for each perturbed instance. The update_num_grad parameter allows you to set a batch size on which to evaluate the numerical gradients, drastically reducing the number of prediction calls required.
[22]:
update_num_grad = 1
Generate pertinent negative:
[23]:
# define model
lr = load_model('iris_lr.h5')
predict_fn = lambda x: lr.predict(x) # only pass the predict fn which takes numpy arrays to CEM
# explainer can no longer minimize wrt model weights
# initialize CEM explainer and explain instance
cem = CEM(predict_fn, mode, shape, kappa=kappa, beta=beta,
feature_range=feature_range, max_iterations=max_iterations,
eps=eps, c_init=c_init, c_steps=c_steps, learning_rate_init=lr_init,
clip=clip, update_num_grad=update_num_grad)
cem.fit(x_train, no_info_type='median')
explanation = cem.explain(X, verbose=False)
[24]:
print('Original instance: {}'.format(explanation.X))
print('Predicted class: {}'.format(class_names[explanation.X_pred]))
Original instance: [[ 0.55333328 1.28296331 0.70592084 0.92230284]]
Predicted class: virginica
[25]:
print('Pertinent negative: {}'.format(explanation.X))
print('Predicted class: {}'.format(class_names[explanation.X_pred]))
Pertinent negative: [[ 0.55333328 1.28296331 0.70592084 0.92230284]]
Predicted class: virginica
Clean up:
[26]:
os.remove('iris_lr.h5')
This page was generated from examples/cf_mnist.ipynb.
Counterfactual instances on MNIST¶
Given a test instance \(X\), this method can generate counterfactual instances \(X^\prime\) given a desired counterfactual class \(t\) which can either be a class specified upfront or any other class that is different from the predicted class of \(X\).
The loss function for finding counterfactuals is the following:
The first loss term, guides the search towards instances \(X^\prime\) for which the predicted class probability \(f_t(X^\prime)\) is close to a prespecified target class probability \(p_t\) (typically \(p_t=1\)). The second loss term ensures that the counterfactuals are close in the feature space to the original test instance.
In this notebook we illustrate the usage of the basic counterfactual algorithm on the MNIST dataset.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D, Input
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
from time import time
from alibi.explainers import CounterFactual
Load and prepare MNIST data¶
[2]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
print('x_train shape:', x_train.shape, 'y_train shape:', y_train.shape)
plt.gray()
plt.imshow(x_test[1]);
x_train shape: (60000, 28, 28) y_train shape: (60000,)
Prepare data: scale, reshape and categorize
[3]:
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
x_train = np.reshape(x_train, x_train.shape + (1,))
x_test = np.reshape(x_test, x_test.shape + (1,))
print('x_train shape:', x_train.shape, 'x_test shape:', x_test.shape)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
print('y_train shape:', y_train.shape, 'y_test shape:', y_test.shape)
x_train shape: (60000, 28, 28, 1) x_test shape: (10000, 28, 28, 1)
y_train shape: (60000, 10) y_test shape: (10000, 10)
[4]:
xmin, xmax = .5, .5
x_train = ((x_train  x_train.min()) / (x_train.max()  x_train.min())) * (xmax  xmin) + xmin
x_test = ((x_test  x_test.min()) / (x_test.max()  x_test.min())) * (xmax  xmin) + xmin
Define and train CNN model¶
[5]:
def cnn_model():
x_in = Input(shape=(28, 28, 1))
x = Conv2D(filters=64, kernel_size=2, padding='same', activation='relu')(x_in)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Conv2D(filters=32, kernel_size=2, padding='same', activation='relu')(x)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x_out = Dense(10, activation='softmax')(x)
cnn = Model(inputs=x_in, outputs=x_out)
cnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return cnn
[6]:
cnn = cnn_model()
cnn.summary()
cnn.fit(x_train, y_train, batch_size=64, epochs=3, verbose=0)
cnn.save('mnist_cnn.h5')
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 28, 28, 1)] 0
_________________________________________________________________
conv2d (Conv2D) (None, 28, 28, 64) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 64) 0
_________________________________________________________________
dropout (Dropout) (None, 14, 14, 64) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 14, 14, 32) 8224
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 7, 7, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 1568) 0
_________________________________________________________________
dense (Dense) (None, 256) 401664
_________________________________________________________________
dropout_2 (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 2570
=================================================================
Total params: 412,778
Trainable params: 412,778
Nontrainable params: 0
_________________________________________________________________
Evaluate the model on test set
[7]:
cnn = load_model('mnist_cnn.h5')
score = cnn.evaluate(x_test, y_test, verbose=0)
print('Test accuracy: ', score[1])
Test accuracy: 0.9887
Generate counterfactuals¶
Original instance:
[8]:
X = x_test[0].reshape((1,) + x_test[0].shape)
plt.imshow(X.reshape(28, 28));
Counterfactual parameters:
[9]:
shape = (1,) + x_train.shape[1:]
target_proba = 1.0
tol = 0.01 # want counterfactuals with p(class)>0.99
target_class = 'other' # any class other than 7 will do
max_iter = 1000
lam_init = 1e1
max_lam_steps = 10
learning_rate_init = 0.1
feature_range = (x_train.min(),x_train.max())
Run counterfactual:
[10]:
# initialize explainer
cf = CounterFactual(cnn, shape=shape, target_proba=target_proba, tol=tol,
target_class=target_class, max_iter=max_iter, lam_init=lam_init,
max_lam_steps=max_lam_steps, learning_rate_init=learning_rate_init,
feature_range=feature_range)
start_time = time()
explanation = cf.explain(X)
print('Explanation took {:.3f} sec'.format(time()  start_time))
Explanation took 8.739 sec
Results:
[11]:
pred_class = explanation.cf['class']
proba = explanation.cf['proba'][0][pred_class]
print(f'Counterfactual prediction: {pred_class} with probability {proba}')
plt.imshow(explanation.cf['X'].reshape(28, 28));
Counterfactual prediction: 9 with probability 0.9919260144233704
The counterfactual starting from a 7 moves towards the closest class as determined by the model and the data  in this case a 9. The evolution of the counterfactual during the iterations over \(\lambda\) can be seen below (note that all of the following examples satisfy the counterfactual condition):
[12]:
n_cfs = np.array([len(explanation.all[iter_cf]) for iter_cf in range(max_lam_steps)])
examples = {}
for ix, n in enumerate(n_cfs):
if n>0:
examples[ix] = {'ix': ix, 'lambda': explanation.all[ix][0]['lambda'],
'X': explanation.all[ix][0]['X']}
columns = len(examples) + 1
rows = 1
fig = plt.figure(figsize=(16,6))
for i, key in enumerate(examples.keys()):
ax = plt.subplot(rows, columns, i+1)
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.imshow(examples[key]['X'].reshape(28,28))
plt.title(f'Iteration: {key}')
Typically, the first few iterations find counterfactuals that are out of distribution, while the later iterations make the counterfactual more sparse and interpretable.
Let’s now try to steer the counterfactual to a specific class:
[13]:
target_class = 1
cf = CounterFactual(cnn, shape=shape, target_proba=target_proba, tol=tol,
target_class=target_class, max_iter=max_iter, lam_init=lam_init,
max_lam_steps=max_lam_steps, learning_rate_init=learning_rate_init,
feature_range=feature_range)
explanation = start_time = time()
explanation = cf.explain(X)
print('Explanation took {:.3f} sec'.format(time()  start_time))
Explanation took 6.558 sec
Results:
[14]:
pred_class = explanation.cf['class']
proba = explanation.cf['proba'][0][pred_class]
print(f'Counterfactual prediction: {pred_class} with probability {proba}')
plt.imshow(explanation.cf['X'].reshape(28, 28));
Counterfactual prediction: 1 with probability 0.998101532459259
As you can see, by specifying a class, the search process can’t go towards the closest class to the test instance (in this case a 9 as we saw previously), so the resulting counterfactual might be less interpretable. We can gain more insight by looking at the difference between the counterfactual and the original instance:
[15]:
plt.imshow((explanation.cf['X']  X).reshape(28, 28));
This shows that the counterfactual is stripping out the top part of the 7 to make to result in a prediction of 1  not very surprising as the dataset has a lot of examples of diagonally slanted 1’s.
Clean up:
[16]:
os.remove('mnist_cnn.h5')
This page was generated from examples/cfproto_mnist.ipynb.
Counterfactuals guided by prototypes on MNIST¶
This method is described in the Interpretable Counterfactual Explanations Guided by Prototypes paper and can generate counterfactual instances guided by class prototypes. It means that for a certain instance X, the method builds a prototype for each prediction class using either an autoencoder or kd trees. The nearest prototype class other than the originally predicted class is then used to guide the counterfactual search. For example, in MNIST the closest class to a 7 could be a 9. As a result, the prototype loss term will try to minimize the distance between the proposed counterfactual and the prototype of a 9. This speeds up the search towards a satisfactory counterfactual by steering it towards an interpretable solution from the start of the optimization. It also helps to avoid outofdistribution counterfactuals with the perturbations driven to a prototype of another class.
The loss function to be optimized is the following:
\(Loss\) = c\(L_{pred}\) + \(\beta\)\(L_{1}\) + \(L_{2}\) + \(L_{AE}\) + \(L_{proto}\)
The first loss term relates to the model’s prediction function, the following 2 terms define the elastic net regularization while the last 2 terms are optional. The aim of \(L_{AE}\) is to penalize outofdistribution counterfactuals while \(L_{proto}\) guides the counterfactual to a prototype. When we only have acces to the model’s prediction function and cannot fully enjoy the benefits of automatic differentiation, the prototypes allow us to drop the prediction function loss term \(L_{pred}\) and still generate high quality counterfactuals. This drastically reduces the number of prediction calls made during the numerical gradient update step and again speeds up the search.
Other options include generating counterfactuals for specific classes or including trust score constraints to ensure that the counterfactual is close enough to the newly predicted class compared to the original class. Different use cases are illustrated throughout this notebook.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D, Input, UpSampling2D
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
from time import time
from alibi.explainers import CounterFactualProto
Load and prepare MNIST data¶
[2]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
print('x_train shape:', x_train.shape, 'y_train shape:', y_train.shape)
plt.gray()
plt.imshow(x_test[1]);
x_train shape: (60000, 28, 28) y_train shape: (60000,)
Prepare data: scale, reshape and categorize
[3]:
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
x_train = np.reshape(x_train, x_train.shape + (1,))
x_test = np.reshape(x_test, x_test.shape + (1,))
print('x_train shape:', x_train.shape, 'x_test shape:', x_test.shape)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
print('y_train shape:', y_train.shape, 'y_test shape:', y_test.shape)
x_train shape: (60000, 28, 28, 1) x_test shape: (10000, 28, 28, 1)
y_train shape: (60000, 10) y_test shape: (10000, 10)
[4]:
xmin, xmax = .5, .5
x_train = ((x_train  x_train.min()) / (x_train.max()  x_train.min())) * (xmax  xmin) + xmin
x_test = ((x_test  x_test.min()) / (x_test.max()  x_test.min())) * (xmax  xmin) + xmin
Define and train CNN model¶
[5]:
def cnn_model():
x_in = Input(shape=(28, 28, 1))
x = Conv2D(filters=32, kernel_size=2, padding='same', activation='relu')(x_in)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Conv2D(filters=64, kernel_size=2, padding='same', activation='relu')(x)
x = MaxPooling2D(pool_size=2)(x)
x = Dropout(0.3)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x_out = Dense(10, activation='softmax')(x)
cnn = Model(inputs=x_in, outputs=x_out)
cnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return cnn
[6]:
cnn = cnn_model()
cnn.fit(x_train, y_train, batch_size=32, epochs=3, verbose=0)
cnn.save('mnist_cnn.h5', save_format='h5')
Evaluate the model on test set
[7]:
cnn = load_model('mnist_cnn.h5')
score = cnn.evaluate(x_test, y_test, verbose=0)
print('Test accuracy: ', score[1])
Test accuracy: 0.9887
Define and train autoencoder¶
[8]:
def ae_model():
# encoder
x_in = Input(shape=(28, 28, 1))
x = Conv2D(16, (3, 3), activation='relu', padding='same')(x_in)
x = Conv2D(16, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
encoded = Conv2D(1, (3, 3), activation=None, padding='same')(x)
encoder = Model(x_in, encoded)
# decoder
dec_in = Input(shape=(14, 14, 1))
x = Conv2D(16, (3, 3), activation='relu', padding='same')(dec_in)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation='relu', padding='same')(x)
decoded = Conv2D(1, (3, 3), activation=None, padding='same')(x)
decoder = Model(dec_in, decoded)
# autoencoder = encoder + decoder
x_out = decoder(encoder(x_in))
autoencoder = Model(x_in, x_out)
autoencoder.compile(optimizer='adam', loss='mse')
return autoencoder, encoder, decoder
[9]:
ae, enc, dec = ae_model()
ae.fit(x_train, x_train, batch_size=128, epochs=4, validation_data=(x_test, x_test), verbose=0)
ae.save('mnist_ae.h5', save_format='h5')
enc.save('mnist_enc.h5', save_format='h5')
Compare original with decoded images
[10]:
ae = load_model('mnist_ae.h5')
enc = load_model('mnist_enc.h5', compile=False)
decoded_imgs = ae.predict(x_test)
n = 5
plt.figure(figsize=(20, 4))
for i in range(1, n+1):
# display original
ax = plt.subplot(2, n, i)
plt.imshow(x_test[i].reshape(28, 28))
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
# display reconstruction
ax = plt.subplot(2, n, i + n)
plt.imshow(decoded_imgs[i].reshape(28, 28))
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
Generate counterfactual guided by the nearest class prototype¶
Original instance:
[11]:
X = x_test[0].reshape((1,) + x_test[0].shape)
plt.imshow(X.reshape(28, 28));
Counterfactual parameters:
[12]:
shape = (1,) + x_train.shape[1:]
gamma = 100.
theta = 100.
c_init = 1.
c_steps = 2
max_iterations = 1000
feature_range = (x_train.min(),x_train.max())
Run counterfactual:
[25]:
# initialize explainer, fit and generate counterfactual
cf = CounterFactualProto(cnn, shape, gamma=gamma, theta=theta,
ae_model=ae, enc_model=enc, max_iterations=max_iterations,
feature_range=feature_range, c_init=c_init, c_steps=c_steps)
start_time = time()
cf.fit(x_train) # find class prototypes
print('Time to find prototypes each class: {:.3f} sec'.format(time()  start_time))
start_time = time()
explanation = cf.explain(X)
print('Explanation took {:.3f} sec'.format(time()  start_time))
Time to find prototypes each class: 10.619 sec
Explanation took 8.659 sec
Results:
[26]:
print('Counterfactual prediction: {}'.format(explanation.cf['class']))
print('Closest prototype class: {}'.format(explanation.id_proto))
plt.imshow(explanation.cf['X'].reshape(28, 28));
Counterfactual prediction: 9
Closest prototype class: 9
The counterfactual starting from a 7 moves towards its closest prototype class: a 9. The evolution of the counterfactual during the first iteration can be seen below:
[28]:
iter_cf = 0
print('iteration c {}'.format(iter_cf))
n = len(explanation['all'][iter_cf])
plt.figure(figsize=(20, 4))
for i in range(n):
ax = plt.subplot(1, n+1, i+1)
plt.imshow(explanation['all'][iter_cf][i].reshape(28, 28))
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
iteration c 0
Typically, the first few iterations already steer the 7 towards a 9, while the later iterations make the counterfactual more sparse.
Prototypes defined by the \(k\) nearest encoded instances¶
In the above example, the class prototypes are defined by the average encoding of all instances belonging to the specific class. Instead, we can also select only the \(k\) nearest encoded instances of a class to the encoded instance to be explained and use the average over those \(k\) encodings as the prototype.
[29]:
# initialize explainer, fit and generate counterfactuals
cf = CounterFactualProto(cnn, shape, gamma=gamma, theta=theta,
ae_model=ae, enc_model=enc, max_iterations=max_iterations,
feature_range=feature_range, c_init=c_init, c_steps=c_steps)
cf.fit(x_train)
explanation_k1 = cf.explain(X, k=1, k_type='mean')
explanation_k20 = cf.explain(X, k=20, k_type='mean')
Results for \(k\) equals 1:
[30]:
print('Counterfactual prediction: {}'.format(explanation_k1.cf['class']))
print('Closest prototype class: {}'.format(explanation.id_proto))
plt.imshow(explanation_k1.cf['X'].reshape(28, 28));
Counterfactual prediction: 9
Closest prototype class: 9
Results for \(k\) equals 20:
[31]:
print('Counterfactual prediction: {}'.format(explanation_k20.cf['class']))
print('Closest prototype class: {}'.format(explanation.id_proto))
plt.imshow(explanation_k20.cf['X'].reshape(28, 28));
Counterfactual prediction: 9
Closest prototype class: 9
A lower value of \(k\) typically leads to counterfactuals that look more like the original instance and less like an average instance of the counterfactual class.
Remove the autoencoder loss term \(L_{AE}\)¶
In the previous example, we used both an autoencoder loss term to penalize a counterfactual which falls outside of the training data distribution as well as an encoder loss term to guide the counterfactual to the nearest prototype class. In the next example we get rid of the autoencoder loss term to speed up the counterfactual search and still generate decent counterfactuals:
[32]:
# initialize explainer, fit and generate counterfactuals
cf = CounterFactualProto(cnn, shape, gamma=gamma, theta=theta,
enc_model=enc, max_iterations=max_iterations,
feature_range=feature_range, c_init=c_init, c_steps=c_steps)
cf.fit(x_train)
start_time = time()
explanation = cf.explain(X, k=1)
print('Explanation took {:.3f} sec'.format(time()  start_time))
Explanation took 6.284 sec
Results:
[33]:
print('Counterfactual prediction: {}'.format(explanation.cf['class']))
print('Closest prototype class: {}'.format(explanation.id_proto))
plt.imshow(explanation.cf['X'].reshape(28, 28));
Counterfactual prediction: 9
Closest prototype class: 9
Specify prototype classes¶
For multiclass predictions, we might be interested to generate counterfactuals for certain classes while avoiding others. The following example illustrates how to do this:
[34]:
X = x_test[12].reshape((1,) + x_test[1].shape)
plt.imshow(X.reshape(28, 28));
[35]:
# initialize explainer, fit and generate counterfactuals
cf = CounterFactualProto(cnn, shape, gamma=gamma, theta=theta,
ae_model=ae, enc_model=enc, max_iterations=max_iterations,
feature_range=feature_range, c_init=c_init, c_steps=c_steps)
cf.fit(x_train)
explanation_1 = cf.explain(X, k=5, k_type='mean')
proto_1 = explanation_1.id_proto
explanation_2 = cf.explain(X, k=5, k_type='mean', target_class=[7])
proto_2 = explanation_2.id_proto
The closest class to the 9 is 4. This is evident by looking at the first counterfactual below. For the second counterfactual, we specified that the prototype class used in the search should be a 7. As a result, a counterfactual 7 instead of a 4 is generated.
[36]:
print('Counterfactual prediction: {}'.format(explanation_1.cf['class']))
print('Closest prototype class: {}'.format(proto_1))
plt.imshow(explanation_1.cf['X'].reshape(28, 28));
Counterfactual prediction: 4
Closest prototype class: 4
[37]:
print('Counterfactual prediction: {}'.format(explanation_2.cf['class']))
print('Closest prototype class: {}'.format(proto_2))
plt.imshow(explanation_2.cf['X'].reshape(28, 28));
Counterfactual prediction: 7
Closest prototype class: 7
Speed up the counterfactual search by removing the predict function loss term¶
We can also remove the prediction loss term and still obtain an interpretable counterfactual. This is especially relevant for fully black box models. When we provide the counterfactual search method with a Keras or TensorFlow model, it is incorporated in the TensorFlow graph and evaluated using automatic differentiation. However, if we only have access to the model’s prediction function, the gradient updates are numerical and typically require a large number of prediction calls because of the prediction loss term \(L_{pred}\). These prediction calls can slow the search down significantly and become a bottleneck. We can represent the gradient of the loss term as follows:
where \(L_{pred}\) is the prediction loss term, \(p\) the prediction function and \(x\) the input features to optimize. For a 28 by 28 MNIST image, the \(^{\delta p}/_{\delta x}\) term alone would require a prediction call with batch size 28x28x2 = 1568. By using the prototypes to guide the search however, we can remove the prediction loss term and only make a single prediction at the end of each gradient update to check whether the predicted class on the proposed counterfactual is different from the original class. We do not necessarily need a Keras or TensorFlow autoencoder either and can use kd trees to find the nearest class prototypes. Please check out this notebook for a practical example.
The first example below removes \(L_{pred}\) from the loss function to bypass the bottleneck. It illustrates the drastic speed improvements over the black box alternative with numerical gradient evaluation while still producing interpretable counterfactual instances.
[38]:
plt.gray()
X = x_test[23].reshape(1, 28, 28, 1)
plt.imshow(X.reshape(28, 28));
[39]:
c_init = 0. # weight on prediction loss term set to 0
c_steps = 1 # no need to find optimal values for c
[40]:
# define a blackbox model
predict_fn = lambda x: cnn.predict(x)
# initialize explainer, fit and generate counterfactuals
cf = CounterFactualProto(predict_fn, shape, gamma=gamma, theta=theta,
ae_model=ae, enc_model=enc, max_iterations=max_iterations,
feature_range=feature_range, c_init=c_init, c_steps=c_steps)
cf.fit(x_train)
start_time = time()
explanation = cf.explain(X, k=1)
print('Explanation took {:.3f} sec'.format(time()  start_time))
Explanation took 6.271 sec
[41]:
print('Counterfactual prediction: {}'.format(explanation.cf['class']))
print('Closest prototype class: {}'.format(explanation.id_proto))
plt.imshow(explanation.cf['X'].reshape(28, 28));
Counterfactual prediction: 6
Closest prototype class: 6
Let us know add the \(L_{pred}\) loss term back in the objective function and observe how long it takes to generate a black box counterfactual:
[42]:
c_init = 1.
c_steps = 2
[43]:
# define a blackbox model
predict_fn = lambda x: cnn.predict(x)
# initialize explainer, fit and generate counterfactuals
cf = CounterFactualProto(predict_fn, shape, gamma=gamma, theta=theta,
ae_model=ae, enc_model=enc, max_iterations=max_iterations,
feature_range=feature_range, c_init=c_init, c_steps=c_steps)
cf.fit(x_train)
start_time = time()
explanation = cf.explain(X, k=1)
print('Explanation took {:.3f} sec'.format(time()  start_time))
Explanation took 545.836 sec
[44]:
print('Counterfactual prediction: {}'.format(explanation.cf['class']))
print('Closest prototype class: {}'.format(explanation.id_proto))
plt.imshow(explanation.cf['X'].reshape(28, 28));
Counterfactual prediction: 6
Closest prototype class: 6
Clean up:
[32]:
os.remove('mnist_cnn.h5')
os.remove('mnist_ae.h5')
os.remove('mnist_enc.h5')
This page was generated from examples/cfproto_housing.ipynb.
Counterfactuals guided by prototypes on Boston housing dataset¶
This notebook goes through an example of prototypical counterfactuals using kd trees to build the prototypes. Please check out this notebook for a more indepth application of the method on MNIST using (auto)encoders and trust scores.
In this example, we will train a simple neural net to predict whether house prices in the Boston area are above the median value or not. We can then find a counterfactual to see which variables need to be changed to increase or decrease a house price above or below the median value.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import to_categorical
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
from sklearn.datasets import load_boston
from alibi.explainers import CounterFactualProto
Load and prepare Boston housing dataset¶
[2]:
boston = load_boston()
data = boston.data
target = boston.target
feature_names = boston.feature_names
Transform into classification task: target becomes whether house price is above the overall median or not
[3]:
y = np.zeros((target.shape[0],))
y[np.where(target > np.median(target))[0]] = 1
Remove categorical feature
[4]:
data = np.delete(data, 3, 1)
feature_names = np.delete(feature_names, 3)
Explanation of remaining features:
CRIM: per capita crime rate by town
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of nonretail business acres per town
RM: average number of rooms per dwelling
AGE: proportion of owneroccupied units built prior to 1940
DIS: weighted distances to five Boston employment centres
RAD: index of accessibility to radial highways
TAX: fullvalue propertytax rate per USD10,000
PTRATIO: pupilteacher ratio by town
B: 1000(Bk  0.63)^2 where Bk is the proportion of blacks by town
LSTAT: % lower status of the population
Standardize data
[5]:
mu = data.mean(axis=0)
sigma = data.std(axis=0)
data = (data  mu) / sigma
Define train and test set
[6]:
idx = 475
x_train,y_train = data[:idx,:], y[:idx]
x_test, y_test = data[idx:,:], y[idx:]
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
Train model¶
[7]:
np.random.seed(0)
tf.set_random_seed(0)
[8]:
def nn_model():
x_in = Input(shape=(12,))
x = Dense(40, activation='relu')(x_in)
x = Dense(40, activation='relu')(x)
x_out = Dense(2, activation='softmax')(x)
nn = Model(inputs=x_in, outputs=x_out)
nn.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
return nn
[9]:
nn = nn_model()
nn.summary()
nn.fit(x_train, y_train, batch_size=64, epochs=500, verbose=0)
nn.save('nn_boston.h5', save_format='h5')
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 12)] 0
_________________________________________________________________
dense (Dense) (None, 40) 520
_________________________________________________________________
dense_1 (Dense) (None, 40) 1640
_________________________________________________________________
dense_2 (Dense) (None, 2) 82
=================================================================
Total params: 2,242
Trainable params: 2,242
Nontrainable params: 0
_________________________________________________________________
[10]:
nn = load_model('nn_boston.h5')
score = nn.evaluate(x_test, y_test, verbose=0)
print('Test accuracy: ', score[1])
Test accuracy: 0.83870965
Generate counterfactual guided by the nearest class prototype¶
Original instance:
[11]:
X = x_test[1].reshape((1,) + x_test[1].shape)
shape = X.shape
Run counterfactual:
[12]:
# define model
nn = load_model('nn_boston.h5')
# initialize explainer, fit and generate counterfactual
cf = CounterFactualProto(nn, shape, use_kdtree=True, theta=10., max_iterations=1000,
feature_range=(x_train.min(axis=0), x_train.max(axis=0)),
c_init=1., c_steps=10)
cf.fit(x_train)
explanation = cf.explain(X)
No encoder specified. Using kd trees to represent class prototypes.
The prediction flipped from 0 (value below the median) to 1 (above the median):
[13]:
print('Original prediction: {}'.format(explanation.orig_class))
print('Counterfactual prediction: {}'.format(explanation.cf['class']))
Original prediction: 0
Counterfactual prediction: 1
Let’s take a look at the counterfactual. To make the results more interpretable, we will first undo the preprocessing step and then check where the counterfactual differs from the original instance:
[14]:
orig = X * sigma + mu
counterfactual = explanation.cf['X'] * sigma + mu
delta = counterfactual  orig
for i, f in enumerate(feature_names):
if np.abs(delta[0][i]) > 1e4:
print('{}: {}'.format(f, delta[0][i]))
AGE: 11.460830148972562
LSTAT: 5.1282056172858645
So in order to increase the house price, the proportion of owneroccupied units built prior to 1940 should decrease by ~1112%. This is not surprising since the proportion for the observation is very high at 93.6%. Furthermore, the % of the population with “lower status” should decrease by ~5%.
[15]:
print('% owneroccupied units built prior to 1940: {}'.format(orig[0][5]))
print('% lower status of the population: {}'.format(orig[0][11]))
% owneroccupied units built prior to 1940: 93.6
% lower status of the population: 18.68
Clean up:
[16]:
os.remove('nn_boston.h5')
This page was generated from examples/cfproto_cat_adult_ohe.ipynb.
Counterfactual explanations with onehot encoded categorical variables¶
Real world machine learning applications often handle data with categorical variables. Explanation methods which rely on perturbations of the input features need to make sure those perturbations are meaningful and capture the underlying structure of the data. This becomes tricky for categorical features. For instance random perturbations across possible categories or enforcing a ranking between categories based on frequency of occurrence in the training data do not capture this structure. Our method captures the relation between categories of a variable numerically through the context given by the other features in the data and/or the predictions made by the model. First it captures the pairwise distances between categories and then applies multidimensional scaling. More details about the method can be found in the documentation. The example notebook illustrates this approach on the adult dataset, which contains a mixture of categorical and numerical features used to predict whether a person’s income is above or below $50k.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import os
from sklearn.preprocessing import OneHotEncoder
from time import time
from alibi.datasets import fetch_adult
from alibi.explainers import CounterFactualProto
from alibi.utils.mapping import ohe_to_ord, ord_to_ohe
Load adult dataset¶
The fetch_adult
function returns a Bunch
object containing the features, the targets, the feature names and a mapping of the categories in each categorical variable.
[2]:
adult = fetch_adult()
data = adult.data
target = adult.target
feature_names = adult.feature_names
category_map_tmp = adult.category_map
target_names = adult.target_names
Define shuffled training and test set:
[3]:
def set_seed(s=0):
np.random.seed(s)
tf.set_random_seed(s)
[4]:
set_seed()
data_perm = np.random.permutation(np.c_[data, target])
X = data_perm[:,:1]
y = data_perm[:,1]
[5]:
idx = 30000
y_train, y_test = y[:idx], y[idx+1:]
Reorganize data so categorical features come first:
[6]:
X = np.c_[X[:, 1:8], X[:, 11], X[:, 0], X[:, 8:11]]
Adjust feature_names
and category_map
as well:
[7]:
feature_names = feature_names[1:8] + feature_names[11:12] + feature_names[0:1] + feature_names[8:11]
print(feature_names)
['Workclass', 'Education', 'Marital Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Country', 'Age', 'Capital Gain', 'Capital Loss', 'Hours per week']
[8]:
category_map = {}
for i, (_, v) in enumerate(category_map_tmp.items()):
category_map[i] = v
Create a dictionary with as keys the categorical columns and values the number of categories for each variable in the dataset:
[9]:
cat_vars_ord = {}
n_categories = len(list(category_map.keys()))
for i in range(n_categories):
cat_vars_ord[i] = len(np.unique(X[:, i]))
print(cat_vars_ord)
{0: 9, 1: 7, 2: 4, 3: 9, 4: 6, 5: 5, 6: 2, 7: 11}
Since we will apply onehot encoding (OHE) on the categorical variables, we convert cat_vars_ord
from the ordinal to OHE format. alibi.utils.mapping
contains utility functions to do this. The keys in cat_vars_ohe
now represent the first column index for each onehot encoded categorical variable. This dictionary will later be used in the counterfactual explanation.
[10]:
cat_vars_ohe = ord_to_ohe(X, cat_vars_ord)[1]
print(cat_vars_ohe)
{0: 9, 9: 7, 16: 4, 20: 9, 29: 6, 35: 5, 40: 2, 42: 11}
Preprocess data¶
Scale numerical features between 1 and 1:
[11]:
X_num = X[:, 4:].astype(np.float32, copy=False)
xmin, xmax = X_num.min(axis=0), X_num.max(axis=0)
rng = (1., 1.)
X_num_scaled = (X_num  xmin) / (xmax  xmin) * (rng[1]  rng[0]) + rng[0]
X_num_scaled_train = X_num_scaled[:idx, :]
X_num_scaled_test = X_num_scaled[idx+1:, :]
Apply OHE to categorical variables:
[12]:
X_cat = X[:, :4].copy()
ohe = OneHotEncoder(categories='auto')
ohe.fit(X_cat)
X_cat_ohe = ohe.transform(X_cat)
Combine numerical and categorical data:
[13]:
X = np.c_[X_cat_ohe.todense(), X_num_scaled].astype(np.float32, copy=False)
X_train, X_test = X[:idx, :], X[idx+1:, :]
print(X_train.shape, X_test.shape)
(30000, 57) (2560, 57)
Train neural net¶
[14]:
def nn_ohe():
x_in = Input(shape=(57,))
x = Dense(60, activation='relu')(x_in)
x = Dropout(.2)(x)
x = Dense(60, activation='relu')(x)
x = Dropout(.2)(x)
x = Dense(60, activation='relu')(x)
x = Dropout(.2)(x)
x_out = Dense(2, activation='softmax')(x)
nn = Model(inputs=x_in, outputs=x_out)
nn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return nn
[15]:
set_seed()
nn = nn_ohe()
nn.summary()
nn.fit(X_train, to_categorical(y_train), batch_size=256, epochs=20, verbose=0)
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 57)] 0
_________________________________________________________________
dense (Dense) (None, 60) 3480
_________________________________________________________________
dropout (Dropout) (None, 60) 0
_________________________________________________________________
dense_1 (Dense) (None, 60) 3660
_________________________________________________________________
dropout_1 (Dropout) (None, 60) 0
_________________________________________________________________
dense_2 (Dense) (None, 60) 3660
_________________________________________________________________
dropout_2 (Dropout) (None, 60) 0
_________________________________________________________________
dense_3 (Dense) (None, 2) 122
=================================================================
Total params: 10,922
Trainable params: 10,922
Nontrainable params: 0
_________________________________________________________________
[15]:
<tensorflow.python.keras.callbacks.History at 0x7f30f89a0048>
Generate counterfactual¶
Original instance:
[16]:
X = X_test[0].reshape((1,) + X_test[0].shape)
Initialize counterfactual parameters. The feature perturbations are applied in the numerical feature space, after transforming the categorical variables to numerical features. As a result, the dimensionality and values of feature_range
are defined in the numerical space.
[17]:
shape = X.shape
beta = .01
c_init = 1.
c_steps = 5
max_iterations = 500
rng = (1., 1.) # scale features between 1 and 1
rng_shape = (1,) + data.shape[1:]
feature_range = ((np.ones(rng_shape) * rng[0]).astype(np.float32),
(np.ones(rng_shape) * rng[1]).astype(np.float32))
Initialize explainer:
[18]:
def set_seed(s=0):
np.random.seed(s)
tf.set_random_seed(s)
[19]:
set_seed()
cf = CounterFactualProto(nn,
shape,
beta=beta,
cat_vars=cat_vars_ohe,
ohe=True, # OHE flag
max_iterations=max_iterations,
feature_range=feature_range,
c_init=c_init,
c_steps=c_steps
)
Fit explainer. d_type
refers to the distance metric used to convert the categorical to numerical values. Valid options are abdm
, mvdm
and abdmmvdm
. abdm
infers the distance between categories of the same variable from the context provided by the other variables. This requires binning of the numerical features as well. mvdm
computes the distance using the model predictions, and abdmmvdm
combines both methods. More info on both distance measures can be found in the
documentation.
[20]:
cf.fit(X_train, d_type='abdm', disc_perc=[25, 50, 75])
[20]:
CounterFactualProto(meta={
'name': 'CounterFactualProto',
'type': ['blackbox', 'tensorflow', 'keras'],
'explanations': ['local'],
'params': {
'kappa': 0.0,
'beta': 0.01,
'feature_range': (
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
dtype=float32),
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)
),
'gamma': 0.0,
'theta': 0.0,
'cat_vars': {
0: 9,
9: 7,
16: 4,
20: 9,
29: 6,
35: 5,
40: 2,
42: 11
},
'ohe': True,
'use_kdtree': False,
'learning_rate_init': 0.01,
'max_iterations': 500,
'c_init': 1.0,
'c_steps': 5,
'eps': (0.001, 0.001),
'clip': (1000.0, 1000.0),
'update_num_grad': 1,
'write_dir': None,
'shape': (1, 57),
'is_model': True,
'is_model_keras': False,
'is_ae': False,
'is_ae_keras': False,
'is_enc': False,
'is_enc_keras': False,
'enc_or_kdtree': False,
'is_cat': True,
'trustscore_kwargs': None,
'd_type': 'abdm',
'w': None,
'disc_perc': [25, 50, 75],
'standardize_cat_vars': False,
'smooth': 1.0,
'center': True,
'update_feature_range': True
}
})
We can now visualize the transformation from the categorical to numerical values for each category. The example below shows that the Education feature is ordered from High School Dropout to having obtained a Doctorate degree. As a result, if we perturb an instance representing a person that has obtained a Bachelors degree, the nearest perturbations will result in a counterfactual instance with either a Masters or an Associates degree.
[21]:
def plot_bar(dist, cols, figsize=(10,4)):
dist = dist.reshape(dist.shape[0])
idx = np.argsort(dist)
fig, ax = plt.subplots(figsize=figsize)
plt.bar(cols[idx], dist[idx])
print(cols[idx])
[22]:
cat = 'Education'
idx = feature_names.index(cat)
plot_bar(cf.d_abs[idx], np.array(category_map[idx]), figsize=(20,4))
['Dropout' 'High School grad' 'Associates' 'Bachelors' 'Masters'
'ProfSchool' 'Doctorate']
Explain instance:
[23]:
explanation = cf.explain(X)
Helper function to more clearly describe explanations:
[24]:
def describe_instance(X, explanation, eps=1e2):
print('Original instance: {}  proba: {}'.format(target_names[explanation.orig_class],
explanation.orig_proba[0]))
print('Counterfactual instance: {}  proba: {}'.format(target_names[explanation.cf['class']],
explanation.cf['proba'][0]))
print('\nCounterfactual perturbations...')
print('\nCategorical:')
X_orig_ord = ohe_to_ord(X, cat_vars_ohe)[0]
X_cf_ord = ohe_to_ord(explanation.cf['X'], cat_vars_ohe)[0]
delta_cat = {}
for i, (_, v) in enumerate(category_map.items()):
cat_orig = v[int(X_orig_ord[0, i])]
cat_cf = v[int(X_cf_ord[0, i])]
if cat_orig != cat_cf:
delta_cat[feature_names[i]] = [cat_orig, cat_cf]
if delta_cat:
for k, v in delta_cat.items():
print('{}: {} > {}'.format(k, v[0], v[1]))
print('\nNumerical:')
delta_num = X_cf_ord[0, 4:]  X_orig_ord[0, 4:]
n_keys = len(list(cat_vars_ord.keys()))
for i in range(delta_num.shape[1]):
if np.abs(delta_num[0, i]) > eps:
print('{}: {:.2f} > {:.2f}'.format(feature_names[i+n_keys],
X_orig_ord[0,i+n_keys],
X_cf_ord[0,i+n_keys]))
[25]:
describe_instance(X, explanation)
Original instance: <=50K  proba: [0.8058248 0.19417518]
Counterfactual instance: >50K  proba: [0.45429865 0.5457013 ]
Counterfactual perturbations...
Categorical:
Education: Associates > Masters
Numerical:
Capital Gain: 1.00 > 0.97
By obtaining a higher level of education and increasing capital gain slightly the income is predicted to be above $50k.
Change the categorical distance metric¶
Instead of abdm
, we now use mvdm
as our distance metric.
[26]:
set_seed()
cf.fit(X_train, d_type='mvdm')
explanation = cf.explain(X)
describe_instance(X, explanation)
Original instance: <=50K  proba: [0.8058248 0.19417518]
Counterfactual instance: >50K  proba: [0.34000358 0.65999645]
Counterfactual perturbations...
Categorical:
Education: Associates > Masters
Numerical:
Capital Gain: 1.00 > 0.93
The same conclusion (increase the level of education and capital gain) can be drawn with the different distance metric.
Use kd trees to build prototypes¶
We can also use kd trees to build class prototypes to guide the counterfactual to nearby instances in the counterfactual class as described in Interpretable Counterfactual Explanations Guided by Prototypes.
[27]:
use_kdtree = True
theta = 10. # weight of prototype loss term
Initialize, fit and explain instance:
[28]:
set_seed()
X = X_test[7].reshape((1,) + X_test[0].shape)
cf = CounterFactualProto(nn,
shape,
beta=beta,
theta=theta,
cat_vars=cat_vars_ohe,
ohe=True,
use_kdtree=use_kdtree,
max_iterations=max_iterations,
feature_range=feature_range,
c_init=c_init,
c_steps=c_steps
)
cf.fit(X_train, d_type='abdm')
explanation = cf.explain(X)
describe_instance(X, explanation)
No encoder specified. Using kd trees to represent class prototypes.
Original instance: <=50K  proba: [0.5573588 0.4426412]
Counterfactual instance: >50K  proba: [0.49765858 0.50234145]
Counterfactual perturbations...
Categorical:
Numerical:
Age: 0.53 > 0.49
By slightly increasing the age of the person the income would be predicted to be above $50k.
Use an autoencoder to build prototypes¶
Another option is to use an autoencoder to guide the perturbed instance to the counterfactual class. We define and train the autoencoder:
[29]:
def ae_model():
# encoder
x_in = Input(shape=(57,))
x = Dense(60, activation='relu')(x_in)
x = Dense(30, activation='relu')(x)
x = Dense(15, activation='relu')(x)
encoded = Dense(10, activation=None)(x)
encoder = Model(x_in, encoded)
# decoder
dec_in = Input(shape=(10,))
x = Dense(15, activation='relu')(dec_in)
x = Dense(30, activation='relu')(x)
x = Dense(60, activation='relu')(x)
decoded = Dense(57, activation=None)(x)
decoder = Model(dec_in, decoded)
# autoencoder = encoder + decoder
x_out = decoder(encoder(x_in))
autoencoder = Model(x_in, x_out)
autoencoder.compile(optimizer='adam', loss='mse')
return autoencoder, encoder, decoder
[30]:
set_seed()
ae, enc, dec = ae_model()
ae.summary()
ae.fit(X_train, X_train, batch_size=128, epochs=100, validation_data=(X_test, X_test), verbose=0)
Model: "model_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 57)] 0
_________________________________________________________________
model_1 (Model) (None, 10) 5935
_________________________________________________________________
model_2 (Model) (None, 57) 5982
=================================================================
Total params: 11,917
Trainable params: 11,917
Nontrainable params: 0
_________________________________________________________________
[30]:
<tensorflow.python.keras.callbacks.History at 0x7fa0bc3e47f0>
Weights for the autoencoder and prototype loss terms:
[31]:
beta = .1 # L1
gamma = 10. # autoencoder
theta = .1 # prototype
Initialize, fit and explain instance:
[32]:
set_seed()
X = X_test[19].reshape((1,) + X_test[0].shape)
cf = CounterFactualProto(nn,
shape,
beta=beta,
enc_model=enc,
ae_model=ae,
gamma=gamma,
theta=theta,
cat_vars=cat_vars_ohe,
ohe=True,
max_iterations=max_iterations,
feature_range=feature_range,
c_init=c_init,
c_steps=c_steps
)
cf.fit(X_train, d_type='abdm')
explanation = cf.explain(X)
describe_instance(X, explanation)
Original instance: >50K  proba: [0.49551004 0.5044899 ]
Counterfactual instance: <=50K  proba: [0.67236006 0.32763994]
Counterfactual perturbations...
Categorical:
Occupation: Other > BlueCollar
Numerical:
Black box model with kd trees¶
Now we assume that we only have access to the model’s prediction function and treat it as a black box. The kd trees are again used to define the prototypes.
[33]:
use_kdtree = True
theta = 10. # weight of prototype loss term
Initialize, fit and explain instance:
[34]:
set_seed()
X = X_test[24].reshape((1,) + X_test[0].shape)
# define predict function
predict_fn = lambda x: nn.predict(x)
cf = CounterFactualProto(predict_fn,
shape,
beta=beta,
theta=theta,
cat_vars=cat_vars_ohe,
ohe=True,
use_kdtree=use_kdtree,
max_iterations=max_iterations,
feature_range=feature_range,
c_init=c_init,
c_steps=c_steps
)
cf.fit(X_train, d_type='abdm')
explanation = cf.explain(X)
describe_instance(X, explanation)
No encoder specified. Using kd trees to represent class prototypes.
Original instance: >50K  proba: [0.23185147 0.7681486 ]
Counterfactual instance: <=50K  proba: [0.5012684 0.4987316]
Counterfactual perturbations...
Categorical:
Numerical:
Age: 0.15 > 0.37
Hours per week: 0.20 > 0.25
If the person was younger and worked less, he or she would have a predicted income below $50k.
This page was generated from examples/cfproto_cat_adult_ord.ipynb.
Counterfactual explanations with ordinally encoded categorical variables¶
This example notebook illustrates how to obtain counterfactual explanations for instances with a mixture of ordinally encoded categorical and numerical variables. A more elaborate notebook highlighting additional functionality can be found here. We generate counterfactuals for instances in the adult dataset where we predict whether a person’s income is above or below $50k.
[1]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # suppress deprecation messages
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Dense, Input, Embedding, Concatenate, Reshape, Dropout, Lambda
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import os
from sklearn.preprocessing import OneHotEncoder
from time import time
from alibi.datasets import fetch_adult
from alibi.explainers import CounterFactualProto
Load adult dataset¶
The fetch_adult
function returns a Bunch
object containing the features, the targets, the feature names and a mapping of the categories in each categorical variable.
[2]:
adult = fetch_adult()
data = adult.data
target = adult.target
feature_names = adult.feature_names
category_map_tmp = adult.category_map
target_names = adult.target_names
Define shuffled training and test set:
[3]:
def set_seed(s=0):
np.random.seed(s)
tf.set_random_seed(s)
[4]:
set_seed()
data_perm = np.random.permutation(np.c_[data, target])
X = data_perm[:,:1]
y = data_perm[:,1]
[5]:
idx = 30000
y_train, y_test = y[:idx], y[idx+1:]
Reorganize data so categorical features come first:
[6]:
X = np.c_[X[:, 1:8], X[:, 11], X[:, 0], X[:, 8:11]]
Adjust feature_names
and category_map
as well:
[7]:
feature_names = feature_names[1:8] + feature_names[11:12] + feature_names[0:1] + feature_names[8:11]
print(feature_names)
['Workclass', 'Education', 'Marital Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Country', 'Age', 'Capital Gain', 'Capital Loss', 'Hours per week']
[8]:
category_map = {}
for i, (_, v) in enumerate(category_map_tmp.items()):
category_map[i] = v
Create a dictionary with as keys the categorical columns and values the number of categories for each variable in the dataset. This dictionary will later be used in the counterfactual explanation.
[9]:
cat_vars_ord = {}
n_categories = len(list(category_map.keys()))
for i in range(n_categories):
cat_vars_ord[i] = len(np.unique(X[:, i]))
print(cat_vars_ord)
{0: 9, 1: 7, 2: 4, 3: 9, 4: 6, 5: 5, 6: 2, 7: 11}
Preprocess data¶
Scale numerical features between 1 and 1:
[10]:
X_num = X[:, 4:].astype(np.float32, copy=False)
xmin, xmax = X_num.min(axis=0), X_num.max(axis=0)
rng = (1., 1.)
X_num_scaled = (X_num  xmin) / (xmax  xmin) * (rng[1]  rng[0]) + rng[0]
X_num_scaled_train = X_num_scaled[:idx, :]
X_num_scaled_test = X_num_scaled[idx+1:, :]
Combine numerical and categorical data:
[11]:
X = np.c_[X[:, :4], X_num_scaled].astype(np.float32, copy=False)
X_train, X_test = X[:idx, :], X[idx+1:, :]
print(X_train.shape, X_test.shape)
(30000, 12) (2560, 12)
Train a neural net¶
The neural net will use entity embeddings for the categorical variables.
[12]:
def nn_ord():
x_in = Input(shape=(12,))
layers_in = []
# embedding layers
for i, (_, v) in enumerate(cat_vars_ord.items()):
emb_in = Lambda(lambda x: x[:, i:i+1])(x_in)
emb_dim = int(max(min(np.ceil(.5 * v), 50), 2))
emb_layer = Embedding(input_dim=v+1, output_dim=emb_dim, input_length=1)(emb_in)
emb_layer = Reshape(target_shape=(emb_dim,))(emb_layer)
layers_in.append(emb_layer)
# numerical layers
num_in = Lambda(lambda x: x[:, 4:])(x_in)
num_layer = Dense(16)(num_in)
layers_in.append(num_layer)
# combine
x = Concatenate()(layers_in)
x = Dense(60, activation='relu')(x)
x = Dropout(.2)(x)
x = Dense(60, activation='relu')(x)
x = Dropout(.2)(x)
x = Dense(60, activation='relu')(x)
x = Dropout(.2)(x)
x_out = Dense(2, activation='softmax')(x)
nn = Model(inputs=x_in, outputs=x_out)
nn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return nn
[13]:
set_seed()
nn = nn_ord()
nn.summary()
nn.fit(X_train, to_categorical(y_train), batch_size=128, epochs=20, verbose=0)
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_1 (InputLayer) [(None, 12)] 0
__________________________________________________________________________________________________
lambda (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_1 (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_2 (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_3 (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_4 (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_5 (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_6 (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
lambda_7 (Lambda) (None, 1) 0 input_1[0][0]
__________________________________________________________________________________________________
embedding (Embedding) (None, 1, 5) 50 lambda[0][0]
__________________________________________________________________________________________________
embedding_1 (Embedding) (None, 1, 4) 32 lambda_1[0][0]
__________________________________________________________________________________________________
embedding_2 (Embedding) (None, 1, 2) 10 lambda_2[0][0]
__________________________________________________________________________________________________
embedding_3 (Embedding) (None, 1, 5) 50 lambda_3[0][0]
__________________________________________________________________________________________________
embedding_4 (Embedding) (None, 1, 3) 21 lambda_4[0][0]
__________________________________________________________________________________________________
embedding_5 (Embedding) (None, 1, 3) 18 lambda_5[0][0]
__________________________________________________________________________________________________
embedding_6 (Embedding) (None, 1, 2) 6 lambda_6[0][0]
__________________________________________________________________________________________________
embedding_7 (Embedding) (None, 1, 6) 72 lambda_7[0][0]
__________________________________________________________________________________________________
lambda_8 (Lambda) (None, 4) 0 input_1[0][0]
__________________________________________________________________________________________________
reshape (Reshape) (None, 5) 0 embedding[0][0]
__________________________________________________________________________________________________
reshape_1 (Reshape) (None, 4) 0 embedding_1[0][0]
__________________________________________________________________________________________________
reshape_2 (Reshape) (None, 2) 0 embedding_2[0][0]
__________________________________________________________________________________________________
reshape_3 (Reshape) (None, 5) 0 embedding_3[0][0]
__________________________________________________________________________________________________
reshape_4 (Reshape) (None, 3) 0 embedding_4[0][0]
__________________________________________________________________________________________________
reshape_5 (Reshape) (None, 3) 0 embedding_5[0][0]
__________________________________________________________________________________________________
reshape_6 (Reshape) (None, 2) 0 embedding_6[0][0]
__________________________________________________________________________________________________
reshape_7 (Reshape) (None, 6) 0 embedding_7[0][0]
__________________________________________________________________________________________________
dense (Dense) (None, 16) 80 lambda_8[0][0]
__________________________________________________________________________________________________
concatenate (Concatenate) (None, 46) 0 reshape[0][0]
reshape_1[0][0]
reshape_2[0][0]
reshape_3[0][0]
reshape_4[0][0]
reshape_5[0][0]
reshape_6[0][0]
reshape_7[0][0]
dense[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 60) 2820 concatenate[0][0]
__________________________________________________________________________________________________
dropout (Dropout) (None, 60) 0 dense_1[0][0]
__________________________________________________________________________________________________
dense_2 (Dense) (None, 60) 3660 dropout[0][0]
__________________________________________________________________________________________________
dropout_1 (Dropout) (None, 60) 0 dense_2[0][0]
__________________________________________________________________________________________________
dense_3 (Dense) (None, 60) 3660 dropout_1[0][0]
__________________________________________________________________________________________________
dropout_2 (Dropout) (None, 60) 0 dense_3[0][0]
__________________________________________________________________________________________________
dense_4 (Dense) (None, 2) 122 dropout_2[0][0]
==================================================================================================
Total params: 10,601
Trainable params: 10,601
Nontrainable params: 0
__________________________________________________________________________________________________
[13]:
<tensorflow.python.keras.callbacks.History at 0x7fa7c32fb2e8>
Generate counterfactual¶
Original instance:
[14]:
X = X_test[0].reshape((1,) + X_test[0].shape)
Initialize counterfactual parameters:
[15]:
shape = X.shape
beta = .01
c_init = 1.
c_steps = 5
max_iterations = 500
rng = (1., 1.) # scale features between 1 and 1
rng_shape = (1,) + data.shape[1:]
feature_range = ((np.ones(rng_shape) * rng[0]).astype(np.float32),
(np.ones(rng_shape) * rng[1]).astype(np.float32))
Initialize explainer. Since the Embedding
layers in tf.keras
do not let gradients propagate through, we will only make use of the model’s predict function, treat it as a black box and perform numerical gradient calculations.
[16]:
set_seed()
# define predict function
predict_fn = lambda x: nn.predict(x)
cf = CounterFactualProto(predict_fn,
shape,
beta=beta,
cat_vars=cat_vars_ord,
max_iterations=max_iterations,
feature_range=feature_range,
c_init=c_init,
c_steps=c_steps,
eps=(.01, .01) # perturbation size for numerical gradients
)
Fit explainer. Please check the documentation for more info about the optional arguments.
[17]:
cf.fit(X_train, d_type='abdm', disc_perc=[25, 50, 75])
[17]:
CounterFactualProto(meta={
'name': 'CounterFactualProto',
'type': ['blackbox', 'tensorflow', 'keras'],
'explanations': ['local'],
'params': {
'kappa': 0.0,
'beta': 0.01,
'feature_range': (
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
dtype=float32),
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)
),
'gamma': 0.0,
'theta': 0.0,
'cat_vars': {
0: 9,
1: 7,
2: 4,
3: 9,
4: 6,
5: 5,
6: 2,
7: 11
},
'ohe': False,
'use_kdtree': False,
'learning_rate_init': 0.01,
'max_iterations': 500,
'c_init': 1.0,
'c_steps': 5,
'eps': (0.01, 0.01),
'clip': (1000.0, 1000.0),
'update_num_grad': 1,
'write_dir': None,
'shape': (1, 12),
'is_model': False,
'is_model_keras': False,
'is_ae': False,
'is_ae_keras': False,
'is_enc': False,
'is_enc_keras': False,
'enc_or_kdtree': False,
'is_cat': True,
'trustscore_kwargs': None,
'd_type': 'abdm',
'w': None,
'disc_perc': [25, 50, 75],
'standardize_cat_vars': False,
'smooth': 1.0,
'center': True,
'update_feature_range': True
}
})
Explain instance:
[18]:
set_seed()
explanation = cf.explain(X)
Helper function to more clearly describe explanations:
[19]:
def describe_instance(X, explanation, eps=1e2):
print('Original instance: {}  proba: {}'.format(target_names[explanation.orig_class],
explanation.orig_proba[0]))
print('Counterfactual instance: {}  proba: {}'.format(target_names[explanation.cf['class']],
explanation.cf['proba'][0]))
print('\nCounterfactual perturbations...')
print('\nCategorical:')
X_orig_ord = X
X_cf_ord = explanation.cf['X']
delta_cat = {}
for i, (_, v) in enumerate(category_map.items()):
cat_orig = v[int(X_orig_ord[0, i])]
cat_cf = v[int(X_cf_ord[0, i])]
if cat_orig != cat_cf:
delta_cat[feature_names[i]] = [cat_orig, cat_cf]
if delta_cat:
for k, v in delta_cat.items():
print('{}: {} > {}'.format(k, v[0], v[1]))
print('\nNumerical:')
delta_num = X_cf_ord[0, 4:]  X_orig_ord[0, 4:]
n_keys = len(list(cat_vars_ord.keys()))
for i in range(delta_num.shape[0]):
if np.abs(delta_num[i]) > eps:
print('{}: {:.2f} > {:.2f}'.format(feature_names[i+n_keys],
X_orig_ord[0,i+n_keys],
X_cf_ord[0,i+n_keys]))
[20]:
describe_instance(X, explanation)
Original instance: <=50K  proba: [0.8347201 0.16527994]
Counterfactual instance: >50K  proba: [0.47529536 0.52470464]
Counterfactual perturbations...
Categorical:
Occupation: WhiteCollar > Professional
Numerical:
Capital Gain: 1.00 > 0.87
The person’s incomce is predicted to be above $50k by increasing his or her capital gain and changing occupation to Professional.
This page was generated from examples/kernel_shap_wine_intro.ipynb.
Kernel SHAP explanation for SVM models¶
Introduction¶
In this example, we show how to explain a multiclass classification model based on the SVM algorithm using the KernelSHAP
method. We show how to perform instancelevel (or local) explanations on this model as well as how to draw insights about the model behaviour in general by aggregating information from explanations across many instances (that is, perform global explanations).
[1]:
import shap
shap.initjs()
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from alibi.explainers import KernelShap
from sklearn import svm
from sklearn.datasets import load_wine
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC