# White-box and black-box models

Explainer algorithms can be categorised in many ways (see this
table), but perhaps the most
fundamental one is whether they work with **white-box** or **black-box**
models.

**White-box** is a term used for any model that the explainer method can
“look inside” and manipulate arbitrarily. In the context of `alibi`

this category of models corresponds to Python objects that represent
models, for example instances of `sklearn.base.BaseEstimator`

,
`tensorflow.keras.Model`

, `torch.nn.Module`

etc. The exact type of
the white-box model in question enables different white-box explainer
methods. For example, `tensorflow`

and `torch`

models are
*gradient-based* which enables *gradient-based* explainer methods such
as Integrated Gradients [1]
whilst various types of tree-based models are supported by
TreeShap.

On the other hand, a **black-box** model describes any model that the
explainer method may not inspect and modify arbitrarily. The only
interaction with the model is via calling its `predict`

function (or
similar) on data and receiving predictions back. In the context of
`alibi`

black-box models have a concrete definition—they are functions
that take in a `numpy`

array representing data and return a `numpy`

array representing a prediction. Using type
hints we can define a
general black-box model (also referred to as a *prediction function*) to
be of type `Callable[[np.ndarray], np.ndarray]`

[2]. Explainers
that expect black-box models as input are very flexible as *any* type of
function that conforms to the expected type can be explained by
black-box explainers.

Note

In addition to the expected type, black-box models **must** be
compatible with batch prediction. I.e. `alibi`

explainers assume
that the first dimension of the input array is always batch.

Warning

There is currently one exception to the black-box interface: the
`AnchorText`

explainer expects the prediction function to be of
type `Callable[[List[str], np.ndarray]`

, i.e. the model is expected
to work on batches of raw text (here `List[str]`

indicates a batch
of text strings). See this
example for more
information.

## Wrapping white-box models into black-box models

Models in Python all start out as white-box models (i.e. custom Python
objects from some modelling library like `sklearn`

or `tensorflow`

).
However, to be used with explainers that expect a black-box prediction
function, the user has to define a prediction function that conforms to
the black-box definition given above. Here we give a few common examples
and some pointers about creating a black-box prediction function from a
white-box model. In what follows we distinguish between the original
white-box `model`

and the wrapped black-box `predictor`

function.

### Scikit-learn models

All `sklearn`

models expose a `predict`

method that already conforms
to the black-box function interface defined above which makes it easy to
create black-box predictors:

```
predictor = model.predict
explainer = SomeExplainer(predictor, **kwargs)
```

In some cases for classifiers it may be more appropriate to expose the
`predict_proba`

or `decision_function`

method instead of
`predict`

, see an example on ALE for
classifiers.

### Tensorflow models

Tensorflow models (specifically instances of `tensorflow.keras.Model`

)
expose a `predict`

method that takes in `numpy`

arrays and returns
predictions as `numpy`

arrays [3]:

```
predictor = model.predict
explainer = SomeExplainer(predictor), **kwargs)
```

### Pytorch models

Pytorch models (specifically instances of `torch.nn.Module`

) expect
and return instances of `torch.Tensor`

from the `forward`

method,
thus we need to do a bit more work to define the `predictor`

black-box
function:

```
model.eval()
@torch.no_grad()
def predictor(X: np.ndarray) -> np.ndarray:
X = torch.as_tensor(X, dtype=dtype, device=device)
return model.forward(X).cpu().numpy()
```

Note that there are a few differences with `tensorflow`

models:

Ensure the model is in the evaluation mode (i.e.,

`model.eval()`

) and that the mode does not change to training (i.e.,`model.train()`

) between consecutive calls to the explainer. Otherwise consider including`model.eval()`

inside the`predictor`

function.Decorate the

`predictor`

with`@torch.no_grad()`

to avoid the computation and storage of the gradients which are not needed.Explicit conversion to a tensor with a specific

`dtype`

. Whilst`tensorflow`

handles this internally when`predict`

is called, for`torch`

we need to do this manually.Explicit device selection for the tensor. This is an important step as

`numpy`

arrays are limited to cpu and if your model is on a gpu it will expect its input tensors to be on a gpu.Explicit conversion of prediction tensor to

`numpy`

. We first send the output to the cpu and then transform into`numpy`

array.

If you are using Pytorch Lightning
to create `torch`

models, then the `dtype`

and `device`

can be
retrieved as attributes of your `LightningModule`

, see
here.

### General models

Given the above examples, the pattern for defining a black-box predictor
from a white-box model is clear: define a `predictor`

function that
manipulates the inputs to and outputs from the underlying model in a way
that conforms to the black-box model interface `alibi`

expects:

```
def predictor(X: np.ndarray) -> np.ndarray:
inp = transform_input(X)
output = model(inp) # or call the model-specific prediction method
output = transform_output(output)
return output
explainer = SomeExplainer(predictor, **kwargs)
```

Here `transform_input`

and `transform_output`

are general
user-defined functions that appropriately transform the input `numpy`

arrays in a format that the model expects and transform the output
predictions into a `numpy`

array so that `predictor`

is an `alibi`

compatible black-box function.