This page was generated from methods/ALE.ipynb.

[source]

Accumulated Local Effects

Overview

Accumulated Local Effects (ALE) is a method for computing feature effects based on the paper Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models by Apley and Zhu. The algorithm provides model-agnostic (black box) global explanations for classification and regression models on tabular data.

ALE addresses some key shortcomings of Partial Dependence Plots (PDP), a popular method for estimating first order feature effects. We discuss these limitations and motivate ALE after presenting the method usage.

Usage

Initialize the explainer by passing a black-box prediction function and optionally a list of feature names and target (class) names for interpretation:

from alibi.explainers import ALE
ale = ALE(predict_fn, feature_names=feature_names, target_names=target_names)

Following the initialization, we can immediately produce an explanation given a dataset of instances \(X\):

exp = ale.explain(X)

The explain method has a default argument, min_bin_points=4, which determines the number of bins the range of each feature is subdivided into so that the ALE estimate for each bin is made with at least min_bin_points. Smaller values can result in less accurate local estimates while larger values can also result in less accurate estimates by averaging across large parts of the feature range.

Alternatively, we can run the explanation only on a subset of features:

exp = ale.explain(X, features=[0, 1])

This is useful if the number of total features is large and only small number is of interest. Also, it can be particularly useful to filter out categorical variable columns as there is no consistent ALE formulation and hence any results for categorical variables would be misleading.

It is also possible to define custom grid points for each feature. The custom grid points are defined in a dictionary where the keys are the features indices and the values are numpy arrays containing the points. The sequence of points for each feature must be monotocaly increasing:

grid_points = {0: np.array([0.1, 0.5, 1.0]),
               1: np.array([0.1, 0.5, 1.0])}
exp = ale.explain(X, features=[0, 1], grid_points=grid_points)

The result exp is an Explanation object which contains the following data-related attributes:

  • ale_values - a list of arrays of ALE values (one for each feature). Each array can have multiple columns (if the number of targets is >1 as in classification).

  • constant_value - the mean prediction over \(X\) (zeroth order effects).

  • ale0 - a list of arrays of “centering” values (one for each feature) used by the algorithm to center the ale_values around the expected effect for the feature (i.e. the sum of ale_values and ale0 will be the uncentered ALE).

  • feature_values - a list of arrays (one for each feature) of feature values at which the ALE values were computed.

  • feature_names - an array of feature names.

  • target_names - an array of target names.

  • feature_deciles - a list of arrays (one for each feature) of the feature deciles.

Plotting ale_values against feature_values recovers the ALE curves. For convenience we include a plotting function plot_ale which automatically produces ALE plots using matplotlib:

from alibi.explainers import plot_ale
plot_ale(exp)

The following is an example ALE plot of a logistic regression model on the Iris dataset (see worked example):

ALE-iris

Examples

ALE regression example (California house prices)

ALE classification example (Iris dataset)

Motivation and definition

The following exposition largely follows Apley and Zhu (2016) and Molnar (2019).

Given a predictive model \(f(x)\) where \(x=(x_1,\dots, x_d)\) is a vector of \(d\) features, we are interested in computing the feature effects of each feature \(x_i\) on the model \(f(x)\). A feature effect of feature \(x_i\) is some function \(g(x_i)\) designed to disentangle the contribution of \(x_i\) to the response \(f(x)\). To simplify notation, in the following we condiser the \(d=2\) case and define the feature effect functions for the first feature \(x_1\).

Partial Dependence

Partial Dependence Plots (PDP) is a very common method for computing feature effects. It is defined as

\[\text{PD}(x_1) = \mathbb{E}[f(x_1, X_2)] = \int p(x_2)f(x_1, x_2)dx_2,\]

where \(p(x_2)\) is the marginal distribution of \(X_2\). To estimate the expectation, we can take the training set \(X\) and average the predictions of instances where the first feature for all instances is replaced by \(x_1\):

\[\widehat{\text{PD}}(x_1) = \frac{1}{n}\sum_{j=1}^{n}f(x_1, x_{2, j}).\]

The PD function attempts to calculate the effect of \(x_1\) by averaging the effects of the other feature \(x_2\) over it’s marginal distribution. This is problematic because by doing so we are averaging predictions of many out of distribution instances. For example, if \(x_1\) and \(x_2\) are a person’s height and weight and \(f\) predicts some other attribute of the person, then the PD function at a fixed height \(x_1\) would average predictions of persons with height \(x_1\) and all possible weights \(x_2\) observed in the training set. Clearly, since height and weight are strongly correlated this would lead to many unrealistic data points. Since the predictor \(f\) has not been trained on such impossible data points, the predictions are no longer meaningful. We can say that an implicit assumption motivating the PD approach is that the features are uncorrelated, however this is rarely the case and severely limits the usage of PDP.

An attempt to fix the issue with the PD function is to average over the conditional distribution instead of the marginal which leads to the following feature effect function:

\[M(x_1) = \mathbb{E}[f(X_1, X_2)\vert X_1=x_1] = \int p(x_2\vert x_1)f(x_1,x_2)dx_2,\]

where \(p(x_2\vert x_1)\) is the conditional distribution of \(X_2\). To estimate this function from the training set \(X\) we can compute

\[\widehat{M}(x_1) = \frac{1}{n(x_1)}\sum_{j\in N(x_1)}f(x_1,x_{2,j}),\]

where \(N(x_1)\) is a subset of indices \(j\) for which \(x_{1,j}\) falls into some small neighbourhood of \(x_1\) and \(n(x_1)\) is the number of such instances.

While this refinement addresses the issue of the PD function averaging over impossible data points, the use of the \(M(x_1)\) function as feature effects remains limited when the features are correlated. To go back to the example with people’s height and weight, if we fix the height to be some particular value \(x_1\) and calculate the effects according to \(M(x_1)\), because of the correlation of height and weight the function value mixes effects of both features and estimates the combined effect. This is undesirable as we cannot attribute the value of \(M(x_1)\) purely to height. Furthermore, suppose height doesn’t actually have any effect on the prediction, only weight does. Because of the correlation between height and weight, \(M(x_1)\) would still show an effect which can be highly misleading. Concretely, for a model like \(f(x_1, x_2)=x_2\) it is possible that \(M(x_1)\neq 0\) if \(x_1, x_2\) are correlated.

The following plot summarizes the two approaches for estimating the effect of \(x_1\) at a particular value when \(x_2\) is strongly correlated with \(x_1\):

PDP-M-estimation

ALE

ALE solves the problem of mixing effects from different features. As with the function \(M(x_1)\), ALE uses the conditional distribution to average over other features, but instead of averaging the predictions directly, it averages differences in predictions to block the effect of correlated features. The ALE function is defined as follows:

\begin{align} \text{ALE}(x_1) &= \int_{\min(x_1)}^{x_1}\mathbb{E}\left[\frac{\partial f(X_1,X_2)}{\partial X_1}\Big\vert X_1=z_1\right]dz_1 - c_1 \\ &= \underbrace{\int_{\min(x_1)}^{x_1}\int p(x_2\vert z_1)\frac{\partial f(z_1, x_2)}{\partial z_1}dx_2dz_1}_{\text{uncentered ALE}} - c_1, \end{align} where the constant \(c_1\) is chosen such that the resulting ALE values are independent of the point \(\min(x_1)\) and have zero mean over the distribution \(p(x_1)\).

The term \(\dfrac{\partial f(x_1, x_2)}{\partial x_1}\) is called the local effect of \(x_1\) on \(f\). Averaging the local effect over the conditional distribution \(p(x_2\vert x_1)\) allows us to isolate the effect of \(x_1\) from the effects of other correlated features avoiding the issue of \(M\) plots which directly average the predictor \(f\). Finally, note that the local effects are integrated over the range of \(x_1\), this corresponds to the accumulated in ALE. This is done as a means of visualizing the global effect of the feature by “piecing together” the calculated local effects.

In practice, we calculate the local effects by finite differences so the predictor \(f\) need not be differentiable. Thus, to estimate the ALE from data, we compute the following:

\[\widehat{\text{ALE}}(x_1)=\underbrace{\sum_{k=1}^{k(x_1)}\frac{1}{n(k)}\sum_{i:x_{1}^{(i)}\in{}N(k)}\left[f(z_{k},x^{(i)}_{\setminus{}1})-f(z_{k-1},x^{(i)}_{\setminus{}1})\right]}_{\text{uncentered ALE}} - c_1.\]

Here \(z_0,z_1,\dots\) is a sufficiently fine grid of the feature \(x_1\) (typically quantiles so that each resulting interval contains a similar number of points), \(N(k)\) denotes the interval \([z_{k-1}, z_k)\), \(n(k)\) denotes the number of points falling into interval \(N(k)\) and \(k(x_1)\) denotes the index of the interval into which \(x_1\) falls into, i.e. \(x_1\in [z_{k(x_1)-1}, z_{k(x_1)})\). Finally, the notation \(f(z_{k}, x^{(i)}_{\setminus{}1})\) means that for instance \(i\) we replace \(x_1\) with the value of the right interval end-point \(z_k\) (likewise for the left interval end-point using \(z_{k-1}\)), leaving the rest of the features unchanged, and evaluate the difference of predictions at these points.

The following plot illustrates the ALE estimation process. We have subdivided the feature range of \(x_1\) into \(5\) bins with roughly the same number of points indexed by \(N(k)\). Focusing on bin \(N(4)\), for each point falling into this bin, we replace their \(x_1\) feature value by the left and right end-points of the interval, \(z_3\) and \(z_4\). Then we evaluate the difference of the predictions of these points and calculate the average by dividing by the number of points in this interval \(n(4)\). We do this for every interval and sum up (accumulate) the results. Finally, to calculate the constant \(c_1\), we subtract the expectation over \(p(x_1)\) of the calculated uncentered ALE so that the resulting ALE values have mean zero over the distribution \(p(x_1)\).

ALE-estimation

We show the results of ALE calculation for a model \(f(x_1, x_2) = 3x_1 + 2x_2^2\). The resulting plots correctly recover the linear effect of \(x_1\) and the quadratic effect of \(x_2\) on \(f\). Note that the ALE is estimated for each interval edge and linearly interpolated in between, for real applications it is important to have a sufficiently fine grid but also one that has enough points into each interval for accurate estimates. The x-axis also shows feature deciles of the feature to help judge in which parts of the feature space the ALE plot is interpolating more and the estimate might be less trustworthy.

The value of \(\text{ALE}(x_i)\) is the main effect of feature \(x_i\) as compared to the average effect of that feature. For example, the value of \(\text{ALE}(x_1)=0.75\) at \(x_1=0.7\), if we sample data from the joint distribution \(p(x_1, x_2)\) (i.e. realistic data points) and \(x_1=0.7\), then we would expect the first order effect of feature \(x_1\) to be \(0.75\) higher than the average first order effect of this feature. Seeing that the \(\text{ALE}(x_1)\) plot crosses zero at \(x_1\approx 0.45\), realistic data points with \(x_1\approx 0.45\) will have effect on \(f\) similar to the average first order effect of \(x_1\). For realistic data points with smaller \(x_1\), the effect will become negative with respect to the average effect.

ALE-plots

Note

The interpretation of ALE plots is mainly qualitative—the focus should be on the shape of the curve at different points in the feature range. The steepness of the tangent to the curve (or the slope of the linear interpolation) determines the size of the effect of that feature locally—the steeper the tangent, the bigger the effect.

Because the model \(f(x_1, x_2) = 3x_1 + 2x_2^2\) is explicit and differentiable, we can calculate the ALE functions analytically which gives us even more insight. The partial derivatives are given by \((3, 4x_2)\). Assuming that the conditional distributions \(p(x_2\vert x_1)\) and \(p(x_1\vert x_2)\) are uniform, the expectations over the conditional distributions are equal to the partial derivatives. Next, we integrate over the range of the features to obtain the uncentered ALE functions:

\begin{align} \text{ALE}_u(x_1) &= \int_{\min(x_1)}^{x_1}3dz_1 = 3x_1 - 3\min(x_1) \\ \text{ALE}_u(x_2) &= \int_{\min(x_2)}^{x_2}4z_2dz_2 = 2x_2^2 - 2\min(x_2)^2. \end{align}

Finally, to obtaine the ALE functions, we center by setting \(c_i = \mathbb{E}(\text{ALE}_u(x_i))\) where the expectation is over the marginal distribution \(p(x_i)\):

\begin{align} \text{ALE}(x_1) &= 3x_1 - 3\min(x_1) - \mathbb{E}(3x_1 - 3\min(x_1)) = 3x_1 - 3\mathbb{E}(x_1) \\ \text{ALE}(x_2) &= 2x_2^2 - 2\min(x_2)^2 - \mathbb{E}(2x_2^2 -2\min(x_2)^2) = 2x_2^2 - 2\mathbb{E}(x_2^2). \end{align}

This calculation verifies that the ALE curves are the desired feature effects (linear for \(x_1\) and quadratic for \(x_2\)) relative to the mean feature effects across the dataset. In fact if \(f\) is additive in the individual features like our toy model, then the ALE main effects recover the correct additive components (Apley and Zhu (2016)). Furthermore, for additive models we have the decomposition \(f(x) = \mathbb{E}(f(x)) + \sum_{i=1}^{d}\text{ALE}(x_i)\), here the first term which is the average prediction across the dataset \(X\) can be thought of as zeroth order effects.