White-box and black-box models
Explainer algorithms can be categorised in many ways (see this table), but perhaps the most fundamental one is whether they work with white-box or black-box models.
White-box is a term used for any model that the explainer method can
“look inside” and manipulate arbitrarily. In the context of alibi
this category of models corresponds to Python objects that represent
models, for example instances of sklearn.base.BaseEstimator
,
tensorflow.keras.Model
, torch.nn.Module
etc. The exact type of
the white-box model in question enables different white-box explainer
methods. For example, tensorflow
and torch
models are
gradient-based which enables gradient-based explainer methods such
as Integrated Gradients [1]
whilst various types of tree-based models are supported by
TreeShap.
On the other hand, a black-box model describes any model that the
explainer method may not inspect and modify arbitrarily. The only
interaction with the model is via calling its predict
function (or
similar) on data and receiving predictions back. In the context of
alibi
black-box models have a concrete definition—they are functions
that take in a numpy
array representing data and return a numpy
array representing a prediction. Using type
hints we can define a
general black-box model (also referred to as a prediction function) to
be of type Callable[[np.ndarray], np.ndarray]
[2]. Explainers
that expect black-box models as input are very flexible as any type of
function that conforms to the expected type can be explained by
black-box explainers.
Note
In addition to the expected type, black-box models must be
compatible with batch prediction. I.e. alibi
explainers assume
that the first dimension of the input array is always batch.
Warning
There is currently one exception to the black-box interface: the
AnchorText
explainer expects the prediction function to be of
type Callable[[List[str], np.ndarray]
, i.e. the model is expected
to work on batches of raw text (here List[str]
indicates a batch
of text strings). See this
example for more
information.
Wrapping white-box models into black-box models
Models in Python all start out as white-box models (i.e. custom Python
objects from some modelling library like sklearn
or tensorflow
).
However, to be used with explainers that expect a black-box prediction
function, the user has to define a prediction function that conforms to
the black-box definition given above. Here we give a few common examples
and some pointers about creating a black-box prediction function from a
white-box model. In what follows we distinguish between the original
white-box model
and the wrapped black-box predictor
function.
Scikit-learn models
All sklearn
models expose a predict
method that already conforms
to the black-box function interface defined above which makes it easy to
create black-box predictors:
predictor = model.predict
explainer = SomeExplainer(predictor, **kwargs)
In some cases for classifiers it may be more appropriate to expose the
predict_proba
or decision_function
method instead of
predict
, see an example on ALE for
classifiers.
Tensorflow models
Tensorflow models (specifically instances of tensorflow.keras.Model
)
expose a predict
method that takes in numpy
arrays and returns
predictions as numpy
arrays [3]:
predictor = model.predict
explainer = SomeExplainer(predictor), **kwargs)
Pytorch models
Pytorch models (specifically instances of torch.nn.Module
) expect
and return instances of torch.Tensor
from the forward
method,
thus we need to do a bit more work to define the predictor
black-box
function:
model.eval()
@torch.no_grad()
def predictor(X: np.ndarray) -> np.ndarray:
X = torch.as_tensor(X, dtype=dtype, device=device)
return model.forward(X).cpu().numpy()
Note that there are a few differences with tensorflow
models:
Ensure the model is in the evaluation mode (i.e.,
model.eval()
) and that the mode does not change to training (i.e.,model.train()
) between consecutive calls to the explainer. Otherwise consider includingmodel.eval()
inside thepredictor
function.Decorate the
predictor
with@torch.no_grad()
to avoid the computation and storage of the gradients which are not needed.Explicit conversion to a tensor with a specific
dtype
. Whilsttensorflow
handles this internally whenpredict
is called, fortorch
we need to do this manually.Explicit device selection for the tensor. This is an important step as
numpy
arrays are limited to cpu and if your model is on a gpu it will expect its input tensors to be on a gpu.Explicit conversion of prediction tensor to
numpy
. We first send the output to the cpu and then transform intonumpy
array.
If you are using Pytorch Lightning
to create torch
models, then the dtype
and device
can be
retrieved as attributes of your LightningModule
, see
here.
General models
Given the above examples, the pattern for defining a black-box predictor
from a white-box model is clear: define a predictor
function that
manipulates the inputs to and outputs from the underlying model in a way
that conforms to the black-box model interface alibi
expects:
def predictor(X: np.ndarray) -> np.ndarray:
inp = transform_input(X)
output = model(inp) # or call the model-specific prediction method
output = transform_output(output)
return output
explainer = SomeExplainer(predictor, **kwargs)
Here transform_input
and transform_output
are general
user-defined functions that appropriately transform the input numpy
arrays in a format that the model expects and transform the output
predictions into a numpy
array so that predictor
is an alibi
compatible black-box function.