This page was generated from doc/source/methods/ALE.ipynb.

# Accumulated Local Effects¶

## Overview¶

Accumulated Local Effects (ALE) is a method for computing feature effects based on the paper Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models by Apley and Zhu. The algorithm provides model-agnostic (*black box*) global explanations for classification and regression models on tabular data.

ALE addresses some key shortcomings of Partial Dependence Plots (PDP), a popular method for estimating first order feature effects. We discuss these limitations and motivate ALE after presenting the method usage.

## Usage¶

Initialize the explainer by passing a black-box prediction function and optionally a list of feature names and target (class) names for interpretation:

```
from alibi.explainers import ALE
ale = ALE(predict_fn, feature_names=feature_names, target_names=target_names)
```

Following the initialization, we can immediately produce an explanation given a dataset of instances \(X\):

```
exp = ale.explain(X)
```

The `explain`

method has a default argument, `min_bin_points=4`

, which determines the number of bins the range of each feature is subdivided into so that the ALE estimate for each bin is made with at least `min_bin_points`

. Smaller values can result in less accurate local estimates while larger values can also result in less accurate estimates by averaging across large parts of the feature range.

Alternatively, we can run the explanation only on a subset of features:

```
exp = ale.explain(X, features=[0, 1])
```

This is useful if the number of total features is large and only small number is of interest. Also, it can be particularly useful to filter out categorical variable columns as there is no consistent ALE formulation and hence any results for categorical variables would be misleading.

The result `exp`

is an `Explanation`

object which contains the following data-related attributes:

`ale_values`

- a list of arrays of ALE values (one for each feature). Each array can have multiple columns (if the number of targets is >1 as in classification)`constant_value`

- the mean prediction over \(X\) (zeroth order effects)`ale0`

- a list of “centering” values (one for each feature) used by the algorithm to center the`ale_values`

around the expected effect for the feature (i.e. the sum of`ale_values`

and`ale0`

will be the uncentered ALE)`feature_values`

- a list of arrays (one for each feature) of feature values at which the ALE values were computed`feature_names`

- a list of feature names`target_names`

- a list of target names`feature_deciles`

- a list of arrays (one for each feature) of the feature deciles

Plotting `ale_values`

against `feature_values`

recovers the ALE curves. For convenience we include a plotting function `plot_ale`

which automatically produces ALE plots using `matplotlib`

:

```
from alibi.explainers import plot_ale
plot_ale(exp)
```

The following is an example ALE plot of a logistic regression model on the Iris dataset (see worked example):

## Motivation and definition¶

The following exposition largely follows Apley and Zhu (2016) and Molnar (2019).

Given a predictive model \(f(x)\) where \(x=(x_1,\dots, x_d)\) is a vector of \(d\) features, we are interested in computing the *feature effects* of each feature \(x_i\) on the model \(f(x)\). A feature effect of feature \(x_i\) is some function \(g(x_i)\) designed to disentangle the contribution of \(x_i\) to the response \(f(x)\). To simplify notation, in the following we condiser the \(d=2\) case and define the feature effect functions for the first
feature \(x_1\).

### Partial Dependence¶

Partial Dependence Plots (PDP) is a very common method for computing feature effects. It is defined as

where \(p(x_2)\) is the marginal distribution of \(X_2\). To estimate the expectation, we can take the training set \(X\) and average the predictions of instances where the first feature for all instances is replaced by \(x_1\):

The PD function attempts to calculate the effect of \(x_1\) by averaging the effects of the other feature \(x_2\) over it’s marginal distribution. This is problematic because by doing so we are averaging predictions of many *out of distribution* instances. For example, if \(x_1\) and \(x_2\) are a person’s height and weight and \(f\) predicts some other attribute of the person, then the PD function at a fixed height \(x_1\) would average predictions of persons with height
\(x_1\) *and all possible weights* \(x_2\) observed in the training set. Clearly, since height and weight are strongly correlated this would lead to many unrealistic data points. Since the predictor \(f\) has not been trained on such impossible data points, the predictions are no longer meaningful. We can say that an implicit assumption motivating the PD approach is that the features are uncorrelated, however this is rarely the case and severely limits the usage of PDP.

An attempt to fix the issue with the PD function is to average over the conditional distribution instead of the marginal which leads to the following feature effect function:

where \(p(x_2\vert x_1)\) is the conditional distribution of \(X_2\). To estimate this function from the training set \(X\) we can compute

where \(N(x_1)\) is a subset of indices \(j\) for which \(x_{1,j}\) falls into some small neighbourhood of \(x_1\) and \(n(x_1)\) is the number of such instances.

While this refinement addresses the issue of the PD function averaging over impossible data points, the use of the \(M(x_1)\) function as feature effects remains limited when the features are correlated. To go back to the example with people’s height and weight, if we fix the height to be some particular value \(x_1\) and calculate the effects according to \(M(x_1)\), because of the correlation of height and weight the function value mixes effects of *both* features and estimates the
**combined** effect. This is undesirable as we cannot attribute the value of \(M(x_1)\) purely to height. Furthermore, suppose height doesn’t actually have any effect on the prediction, only weight does. Because of the correlation between height and weight, \(M(x_1)\) would still show an effect which can be highly misleading. Concretely, for a model like \(f(x_1, x_2)=x_2\) it is possible that \(M(x_1)\neq 0\) if \(x_1, x_2\) are correlated.

The following plot summarizes the two approaches for estimating the effect of \(x_1\) at a particular value when \(x_2\) is strongly correlated with \(x_1\):

### ALE¶

ALE solves the problem of mixing effects from different features. As with the function \(M(x_1)\), ALE uses the conditional distribution to average over other features, but instead of averaging the predictions directly, it averages *differences in predictions* to block the effect of correlated features. The ALE function is defined as follows:

where the constant \(c_1\) is chosen such that the resulting ALE values are independent of the point \(\min(x_1)\) and have zero mean over the distribution \(p(x_1)\).

The term \(\dfrac{\partial f(x_1, x_2)}{\partial x_1}\) is called the *local effect* of \(x_1\) on \(f\). Averaging the local effect over the conditional distribution \(p(x_2\vert x_1)\) allows us to isolate the effect of \(x_1\) from the effects of other correlated features avoiding the issue of \(M\) plots which directly average the predictor \(f\). Finally, note that the local effects are integrated over the range of \(x_1\), this corresponds to the
*accumulated* in ALE. This is done as a means of visualizing the *global* effect of the feature by “piecing together” the calculated local effects.

In practice, we calculate the local effects by finite differences so the predictor \(f\) need not be differentiable. Thus, to estimate the ALE from data, we compute the following:

Here \(z_0,z_1,\dots\) is a sufficiently fine grid of the feature \(x_1\) (typically quantiles so that each resulting interval contains a similar number of points), \(N(k)\) denotes the interval \([z_{k-1}, z_k)\), \(n(k)\) denotes the number of points falling into interval \(N(k)\) and \(k(x_1)\) denotes the index of the interval into which \(x_1\) falls into, i.e. \(x_1\in [z_{k(x_1)-1}, z_{k(x_1)})\). Finally, the notation \(f(z_{k}, x^{(i)}_{\setminus{}1})\) means that for instance \(i\) we replace \(x_1\) with the value of the right interval end-point \(z_k\) (likewise for the left interval end-point using \(z_{k-1}\)), leaving the rest of the features unchanged, and evaluate the difference of predictions at these points.

The following plot illustrates the ALE estimation process. We have subdivided the feature range of \(x_1\) into \(5\) bins with roughly the same number of points indexed by \(N(k)\). Focusing on bin \(N(4)\), for each point falling into this bin, we replace their \(x_1\) feature value by the left and right end-points of the interval, \(z_3\) and \(z_4\). Then we evaluate the difference of the predictions of these points and calculate the average by dividing by the number of points in this interval \(n(4)\). We do this for every interval and sum up (accumulate) the results. Finally, to calculate the constant \(c_1\), we subtract the expectation over \(p(x_1)\) of the calculated uncentered ALE so that the resulting ALE values have mean zero over the distribution \(p(x_1)\).

We show the results of ALE calculation for a model \(f(x_1, x_2) = 3x_1 + 2x_2^2\). The resulting plots correctly recover the linear effect of \(x_1\) and the quadratic effect of \(x_2\) on \(f\). Note that the ALE is estimated for each interval edge and linearly interpolated in between, for real applications it is important to have a sufficiently fine grid but also one that has enough points into each interval for accurate estimates. The x-axis also shows feature deciles of the feature to help judge in which parts of the feature space the ALE plot is interpolating more and the estimate might be less trustworthy.

The value of \(\text{ALE}(x_i)\) is the main effect of feature \(x_i\) as compared to the average prediction for the data. For example, the value of \(\text{ALE}(x_1)=0.75\) at \(x_1=0.7\), if we sample data from the joint distribution \(p(x_1, x_2)\) (i.e. realistic data points) and \(x_1=0.7\), then we would expect the first order effect of feature \(x_1\) to be \(0.75\) higher than the *average* first order effect of this feature. Seeing that the
\(\text{ALE}(x_1)\) plot crosses zero at \(x_1\approx 0.45\), realistic data points with \(x_1\approx 0.45\) will have effect on \(f\) similar to the average first order effect of \(x_1\). For realistic data points with smaller \(x_1\), the effect will become negative with respect to the average effect.

Because the model \(f(x_1, x_2) = 3x_1 + 2x_2^2\) is explicit and differentiable, we can calculate the ALE functions analytically which gives us even more insight. The partial derivatives are given by \((3, 4x_2)\). Assuming that the conditional distributions \(p(x_2\vert x_1)\) and \(p(x_1\vert x_2)\) are uniform, the expectations over the conditional distributions are equal to the partial derivatives. Next, we integrate over the range of the features to obtain the *uncentered*
ALE functions:

Finally, to obtaine the ALE functions, we center by setting \(c_i = \mathbb{E}(\text{ALE}_u(x_i))\) where the expectation is over the marginal distribution \(p(x_i)\):

This calculation verifies that the ALE curves are the desired feature effects (linear for \(x_1\) and quadratic for \(x_2\)) relative to the mean feature effects across the dataset. In fact if \(f\) is additive in the individual features like our toy model, then the ALE main effects recover the correct additive components (Apley and Zhu (2016)). Furthermore, for additive models we have the decomposition \(f(x) = \mathbb{E}(f(x)) + \sum_{i=1}^{d}\text{ALE}(x_i)\), here the first term which is the average prediction across the dataset \(X\) can be thought of as zeroth order effects.