White-box and black-box models

Explainer algorithms can be categorised in many ways (see this table), but perhaps the most fundamental one is whether they work with white-box or black-box models.

White-box is a term used for any model that the explainer method can “look inside” and manipulate arbitrarily. In the context of alibi this category of models corresponds to Python objects that represent models, for example instances of sklearn.base.BaseEstimator, tensorflow.keras.Model, torch.nn.Module etc. The exact type of the white-box model in question enables different white-box explainer methods. For example, tensorflow and torch models are gradient-based which enables gradient-based explainer methods such as Integrated Gradients 1 whilst various types of tree-based models are supported by TreeShap.

On the other hand, a black-box model describes any model that the explainer method may not inspect and modify arbitrarily. The only interaction with the model is via calling its predict function (or similar) on data and receiving predictions back. In the context of alibi black-box models have a concrete definition—they are functions that take in a numpy array representing data and return a numpy array representing a prediction. Using type hints we can define a general black-box model (also referred to as a prediction function) to be of type Callable[[np.ndarray], np.ndarray] 2. Explainers that expect black-box models as input are very flexible as any type of function that conforms to the expected type can be explained by black-box explainers.


In addition to the expected type, black-box models must be compatible with batch prediction. I.e. alibi explainers assume that the first dimension of the input array is always batch.


There is currently one exception to the black-box interface: the AnchorText explainer expects the prediction function to be of type Callable[[List[str], np.ndarray], i.e. the model is expected to work on batches of raw text (here List[str] indicates a batch of text strings). See this example for more information.

Wrapping white-box models into black-box models

Models in Python all start out as white-box models (i.e. custom Python objects from some modelling library like sklearn or tensorflow). However, to be used with explainers that expect a black-box prediction function, the user has to define a prediction function that conforms to the black-box definition given above. Here we give a few common examples and some pointers about creating a black-box prediction function from a white-box model. In what follows we distinguish between the original white-box model and the wrapped black-box predictor function.

Scikit-learn models

All sklearn models expose a predict method that already conforms to the black-box function interface defined above which makes it easy to create black-box predictors:

predictor = model.predict
explainer = SomeExplainer(predictor, **kwargs)

In some cases for classifiers it may be more appropriate to expose the predict_proba or decision_function method instead of predict, see an example on ALE for classifiers.

Tensorflow models

Tensorflow models (specifically instances of tensorflow.keras.Model) expose a predict method that takes in numpy arrays and returns predictions as numpy arrays 3:

predictor = model.predict
explainer = SomeExplainer(predictor), **kwargs)

Pytorch models

Pytorch models (specifically instances of torch.nn.Module) expect and return instances of torch.Tensor from the forward method, thus we need to do a bit more work to define the predictor black-box function:

def predictor(X: np.ndarray) -> np.ndarray:
    X = torch.as_tensor(X, dtype=dtype, device=device)
    return model.forward(X).detach().numpy()

Note that there are a few differences with tensorflow models:

  • Explicit conversion to a tensor with a specific dtype. Whilst tensorflow handles this internally when predict is called, for torch we need to do this manually.

  • Explicit device selection for the tensor. This is an important step as numpy arrays are limited to cpu and if your model is on a gpu it will expect its input tensors to be on a gpu.

  • Explicit conversion of prediction tensor to numpy. Here we detach the output from the gradient graph (as gradient information is not needed) and convert to a numpy array.

If you are using Pytorch Lightning to create torch models, then the dtype and device can be retrieved as attributes of your LightningModule, see here.

General models

Given the above examples, the pattern for defining a black-box predictor from a white-box model is clear: define a predictor function that manipulates the inputs to and outputs from the underlying model in a way that conforms to the black-box model interface alibi expects:

def predictor(X: np.ndarray) -> np.ndarray:
    inp = transform_input(X)
    output = model(inp) # or call the model-specific prediction method
    output = transform_output(output)
    return output

explainer = SomeExplainer(predictor, **kwargs)

Here transform_input and transform_output are general user-defined functions that appropriately transform the input numpy arrays in a format that the model expects and transform the output predictions into a numpy array so that predictor is an alibi compatible black-box function.


At the time of writing IntegratedGradients only supports tensorflow models.


Note that this definition limits black-box models to be single-input and single-output which are what most black-box alibi explainers can handle. In the general case the definition can be extended to multi-input and multi-output models, i.e. taking in and/or returning multiple arrays.


This is in contrast to the __call__ and call methods which expect and return tensorflow.Tensor objects. However, using __call__ may be preferable for performance in some cases (this would require transforming inputs and outputs similar to the torch example).