This page was generated from examples/cd_model_unc_cifar10_wine.ipynb.

Model uncertainty based drift detection on CIFAR-10 and Wine-Quality datasets

Method

Model-uncertainty drift detectors aim to directly detect drift that’s likely to effect the performance of a model of interest. The approach is to test for change in the number of instances falling into regions of the input space on which the model is uncertain in its predictions. For each instance in the reference set the detector obtains the model’s prediction and some associated notion of uncertainty. For example for a classifier this may be the entropy of the predicted label probabilities or for a regressor with dropout layers dropout Monte Carlo can be used to provide a notion of uncertainty. The same is done for the test set and if significant differences in uncertainty are detected (via a Kolmogorov-Smirnoff test) then drift is flagged.

It is important that the detector uses a reference set that is disjoint from the model’s training set (on which the model’s confidence may be higher).

Backend

For models that require batch evaluation both PyTorch and TensorFlow frameworks are supported. Alibi Detect does however not install PyTorch for you. Check the PyTorch docs how to do this.

Classifier uncertainty based drift detection

We start by demonstrating how to leverage model uncertainty to detect malicious drift when the model of interest is a classifer.

Dataset

CIFAR10 consists of 60,000 32 by 32 RGB images equally distributed over 10 classes. We evaluate the drift detector on the CIFAR-10-C dataset (Hendrycks & Dietterich, 2019). The instances in CIFAR-10-C have been corrupted and perturbed by various types of noise, blur, brightness etc. at different levels of severity, leading to a gradual decline in the classification model performance. We also check for drift against the original test set with class imbalances.

[1]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import tensorflow as tf
import torch
from torch import nn

from alibi_detect.cd import ClassifierUncertaintyDrift, RegressorUncertaintyDrift
from alibi_detect.models.tensorflow import scale_by_instance
from alibi_detect.utils.fetching import fetch_tf_model, fetch_detector
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.datasets import fetch_cifar10c, corruption_types_cifar10c
from alibi_detect.models.pytorch import trainer
from alibi_detect.cd.utils import encompass_batching

Original CIFAR-10 data:

[2]:

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
y_train = y_train.astype('int64').reshape(-1,)
y_test = y_test.astype('int64').reshape(-1,)

For CIFAR-10-C, we can select from the following corruption types at 5 severity levels:

[3]:

corruptions = corruption_types_cifar10c()
print(corruptions)

['brightness', 'contrast', 'defocus_blur', 'elastic_transform', 'fog', 'frost', 'gaussian_blur', 'gaussian_noise', 'glass_blur', 'impulse_noise', 'jpeg_compression', 'motion_blur', 'pixelate', 'saturate', 'shot_noise', 'snow', 'spatter', 'speckle_noise', 'zoom_blur']

Let’s pick a subset of the corruptions at corruption level 5. Each corruption type consists of perturbations on all of the original test set images.

[4]:

corruption = ['gaussian_noise', 'motion_blur', 'brightness', 'pixelate']
X_corr, y_corr = fetch_cifar10c(corruption=corruption, severity=5, return_X_y=True)
X_corr = X_corr.astype('float32') / 255

We split the original test set in a reference dataset and a dataset which should not be rejected under the no-change null H0. We also split the corrupted data by corruption type:

[5]:

np.random.seed(0)
n_test = X_test.shape[0]
idx = np.random.choice(n_test, size=n_test // 2, replace=False)
idx_h0 = np.delete(np.arange(n_test), idx, axis=0)
X_ref,y_ref = X_test[idx], y_test[idx]
X_h0, y_h0 = X_test[idx_h0], y_test[idx_h0]
print(X_ref.shape, X_h0.shape)

(5000, 32, 32, 3) (5000, 32, 32, 3)

[6]:

# check that the classes are more or less balanced
classes, counts_ref = np.unique(y_ref, return_counts=True)
counts_h0 = np.unique(y_h0, return_counts=True)[1]
print('Class Ref H0')
for cl, cref, ch0 in zip(classes, counts_ref, counts_h0):
    assert cref + ch0 == n_test // 10
    print('{}     {} {}'.format(cl, cref, ch0))

Class Ref H0
0     472 528
1     510 490
2     498 502
3     492 508
4     501 499
5     495 505
6     493 507
7     501 499
8     516 484
9     522 478

[7]:

n_corr = len(corruption)
X_c = [X_corr[i * n_test:(i + 1) * n_test] for i in range(n_corr)]

We can visualise the same instance for each corruption type:

[8]:

i = 1

n_test = X_test.shape[0]
plt.title('Original')
plt.axis('off')
plt.imshow(X_test[i])
plt.show()
for _ in range(len(corruption)):
    plt.title(corruption[_])
    plt.axis('off')
    plt.imshow(X_corr[n_test * _+ i])
    plt.show()

../_images/examples_cd_model_unc_cifar10_wine_14_0.svg

../_images/examples_cd_model_unc_cifar10_wine_14_1.svg

../_images/examples_cd_model_unc_cifar10_wine_14_2.svg

../_images/examples_cd_model_unc_cifar10_wine_14_3.svg

../_images/examples_cd_model_unc_cifar10_wine_14_4.svg

We can also verify that the performance of a classification model on CIFAR-10 drops significantly on this perturbed dataset:

[9]:

dataset = 'cifar10'
model = 'resnet32'
clf = fetch_tf_model(dataset, model)
acc = clf.evaluate(scale_by_instance(X_test), y_test, batch_size=128, verbose=0)[1]
print('Test set accuracy:')
print('Original {:.4f}'.format(acc))
clf_accuracy = {'original': acc}
for _ in range(len(corruption)):
    acc = clf.evaluate(scale_by_instance(X_c[_]), y_test, batch_size=128, verbose=0)[1]
    clf_accuracy[corruption[_]] = acc
    print('{} {:.4f}'.format(corruption[_], acc))

Test set accuracy:
Original 0.9278
gaussian_noise 0.2208
motion_blur 0.6339
brightness 0.8913
pixelate 0.3666

Given the drop in performance, it is important that we detect the harmful data drift!

Detect drift

Unlike many other approaches we needn’t specify a dimension-reducing preprocessing step as the detector operates directly on the data as it is input to the model of interest. In fact, the two-stage projection input -> prediction -> uncertainty can be thought of as the projection from the input space onto the real line, ready to perform the test.

We simply pass the model to the detector and inform it that the predictions should be interpreted as ‘probs’ rather than ‘logits’ (i.e. a softmax has already been applied). By default uncertainty_type='entropy' is used as the notion of uncertainty for classifier predictions, however uncertainty_type='margin' can be specified to deem the classifier’s prediction uncertain if they fall within a margin (e.g. in [0.45,0.55] for binary classifier probabilities) (similar to Sethi and Kantardzic (2017)).

[10]:

cd = ClassifierUncertaintyDrift(
  X_ref, model=clf, backend='tensorflow', p_val=0.05, preds_type='probs'
)

Let’s check whether the detector thinks drift occurred on the different test sets and time the prediction calls:

[11]:

from timeit import default_timer as timer

labels = ['No!', 'Yes!']

def make_predictions(cd, x_h0, x_corr, corruption):
    t = timer()
    preds = cd.predict(x_h0)
    dt = timer() - t
    print('No corruption')
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print('Feature-wise p-values:')
    print(preds['data']['p_val'])
    print(f'Time (s) {dt:.3f}')

    if isinstance(x_corr, list):
        for x, c in zip(x_corr, corruption):
            t = timer()
            preds = cd.predict(x)
            dt = timer() - t
            print('')
            print(f'Corruption type: {c}')
            print('Drift? {}'.format(labels[preds['data']['is_drift']]))
            print('Feature-wise p-values:')
            print(preds['data']['p_val'])
            print(f'Time (s) {dt:.3f}')

[12]:

make_predictions(cd, X_h0, X_c, corruption)

No corruption
Drift? No!
Feature-wise p-values:
[0.7868902]
Time (s) 15.574

Corruption type: gaussian_noise
Drift? Yes!
Feature-wise p-values:
[0.]
Time (s) 33.066

Corruption type: motion_blur
Drift? Yes!
Feature-wise p-values:
[0.]
Time (s) 32.637

Corruption type: brightness
Drift? No!
Feature-wise p-values:
[0.1102559]
Time (s) 34.126

Corruption type: pixelate
Drift? Yes!
Feature-wise p-values:
[0.]
Time (s) 34.351

Note here how drift is only detected for the corrupted datasets on which the model’s performance is significantly degraded. For the ‘brightness’ corruption, for which the model maintains 89% classification accuracy, the change in model uncertainty is not deemed significant (p-value 0.11, above the 0.05 threshold). For the other corruptions which signficiantly hamper model performance, the malicious drift is detected.

Regressor uncertainty based drift detection

We now demonstrate how to leverage model uncertainty to detect malicious drift when the model of interest is a regressor. This is a less general approach as regressors often make point-predictions with no associated notion of uncertainty. However, if the model makes its predictions by ensembling the predicitons of sub-models then we can consider the variation in the sub-model predictions as a notion of uncertainty. RegressorUncertaintyDetector facilitates models that output a vector of such sub-model predictions (uncertainty_type='ensemble') or deep learning models that include dropout layers and can therefore (as noted by Gal and Ghahramani 2016) be considered as an ensemble (uncertainty_type='mc_dropout', the default option).

Dataset

The Wine Quality Data Set consists of 1599 and 4898 samples of red and white wine respectively. Each sample has an associated quality (as determined by experts) and 11 numeric features indicating its acidity, density, pH etc. We consider the regression problem of tring to predict the quality of red wine sample given these features. We will then consider whether the model remains suitable for predicting the quality of white wine samples or whether the associated change in the underlying distribution should be considered as malicious drift.

First we load in the data.

[13]:

red = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=';'
)
white = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=';'
)
red.describe()

[13]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000
mean	8.319637	0.527821	0.270976	2.538806	0.087467	15.874922	46.467792	0.996747	3.311113	0.658149	10.422983	5.636023
std	1.741096	0.179060	0.194801	1.409928	0.047065	10.460157	32.895324	0.001887	0.154386	0.169507	1.065668	0.807569
min	4.600000	0.120000	0.000000	0.900000	0.012000	1.000000	6.000000	0.990070	2.740000	0.330000	8.400000	3.000000
25%	7.100000	0.390000	0.090000	1.900000	0.070000	7.000000	22.000000	0.995600	3.210000	0.550000	9.500000	5.000000
50%	7.900000	0.520000	0.260000	2.200000	0.079000	14.000000	38.000000	0.996750	3.310000	0.620000	10.200000	6.000000
75%	9.200000	0.640000	0.420000	2.600000	0.090000	21.000000	62.000000	0.997835	3.400000	0.730000	11.100000	6.000000
max	15.900000	1.580000	1.000000	15.500000	0.611000	72.000000	289.000000	1.003690	4.010000	2.000000	14.900000	8.000000

We can see that the data for both red and white wine samples take the same format.

[14]:

white.describe()

[14]:

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000
mean	6.854788	0.278241	0.334192	6.391415	0.045772	35.308085	138.360657	0.994027	3.188267	0.489847	10.514267	5.877909
std	0.843868	0.100795	0.121020	5.072058	0.021848	17.007137	42.498065	0.002991	0.151001	0.114126	1.230621	0.885639
min	3.800000	0.080000	0.000000	0.600000	0.009000	2.000000	9.000000	0.987110	2.720000	0.220000	8.000000	3.000000
25%	6.300000	0.210000	0.270000	1.700000	0.036000	23.000000	108.000000	0.991723	3.090000	0.410000	9.500000	5.000000
50%	6.800000	0.260000	0.320000	5.200000	0.043000	34.000000	134.000000	0.993740	3.180000	0.470000	10.400000	6.000000
75%	7.300000	0.320000	0.390000	9.900000	0.050000	46.000000	167.000000	0.996100	3.280000	0.550000	11.400000	6.000000
max	14.200000	1.100000	1.660000	65.800000	0.346000	289.000000	440.000000	1.038980	3.820000	1.080000	14.200000	9.000000

We shuffle and normalise the data such that each feature takes a value in [0,1], as does the quality we seek to predict.

[15]:

red, white = np.asarray(red, np.float32), np.asarray(white, np.float32)
n_red, n_white = red.shape[0], white.shape[0]

col_maxes = red.max(axis=0)
red, white = red / col_maxes, white / col_maxes
red, white = red[np.random.permutation(n_red)], white[np.random.permutation(n_white)]
X, y = red[:, :-1], red[:, -1:]
X_corr, y_corr = white[:, :-1], white[:, -1:]

We split the red wine data into a set on which to train the model, a reference set with which to instantiate the detector and a set which the detector should not flag drift. We then instantiate a DataLoader to pass the training data to a PyTorch model in batches.

[16]:

X_train, y_train = X[:(n_red//2)], y[:(n_red//2)]
X_ref, y_ref = X[(n_red//2):(3*n_red//4)], y[(n_red//2):(3*n_red//4)]
X_h0, y_h0 = X[(3*n_red//4):], y[(3*n_red//4):]

X_train_ds = torch.utils.data.TensorDataset(torch.tensor(X_train), torch.tensor(y_train))
X_train_dl = torch.utils.data.DataLoader(X_train_ds, batch_size=32, shuffle=True, drop_last=True)

Regression model

We now define the regression model that we’ll train to predict the quality from the features. The exact details aren’t important other than the presence of at least one dropout layer. We then train the model for 20 epochs to optimise the mean square error on the training data.

[17]:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

reg = nn.Sequential(
    nn.Linear(11, 16),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(16, 32),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(32, 1)
).to(device)

trainer(reg, nn.MSELoss(), X_train_dl, device, torch.optim.Adam, learning_rate=0.001, epochs=30)

/home/oliver/Projects/alibi-detect/.venv/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
Epoch 1/30: 100%|██████████| 24/24 [00:00<00:00, 267.60it/s, loss=0.119]
Epoch 2/30: 100%|██████████| 24/24 [00:00<00:00, 267.14it/s, loss=0.0857]
Epoch 3/30: 100%|██████████| 24/24 [00:00<00:00, 266.90it/s, loss=0.043]
Epoch 4/30: 100%|██████████| 24/24 [00:00<00:00, 250.43it/s, loss=0.0553]
Epoch 5/30: 100%|██████████| 24/24 [00:00<00:00, 187.70it/s, loss=0.0365]
Epoch 6/30: 100%|██████████| 24/24 [00:00<00:00, 260.13it/s, loss=0.03]
Epoch 7/30: 100%|██████████| 24/24 [00:00<00:00, 245.26it/s, loss=0.0552]
Epoch 8/30: 100%|██████████| 24/24 [00:00<00:00, 241.64it/s, loss=0.0335]
Epoch 9/30: 100%|██████████| 24/24 [00:00<00:00, 229.60it/s, loss=0.0254]
Epoch 10/30: 100%|██████████| 24/24 [00:00<00:00, 244.06it/s, loss=0.0223]
Epoch 11/30: 100%|██████████| 24/24 [00:00<00:00, 225.92it/s, loss=0.0224]
Epoch 12/30: 100%|██████████| 24/24 [00:00<00:00, 204.65it/s, loss=0.0254]
Epoch 13/30: 100%|██████████| 24/24 [00:00<00:00, 226.55it/s, loss=0.0236]
Epoch 14/30: 100%|██████████| 24/24 [00:00<00:00, 226.15it/s, loss=0.0247]
Epoch 15/30: 100%|██████████| 24/24 [00:00<00:00, 250.90it/s, loss=0.0292]
Epoch 16/30: 100%|██████████| 24/24 [00:00<00:00, 208.73it/s, loss=0.0263]
Epoch 17/30: 100%|██████████| 24/24 [00:00<00:00, 294.98it/s, loss=0.0163]
Epoch 18/30: 100%|██████████| 24/24 [00:00<00:00, 173.03it/s, loss=0.0223]
Epoch 19/30: 100%|██████████| 24/24 [00:00<00:00, 186.12it/s, loss=0.0244]
Epoch 20/30: 100%|██████████| 24/24 [00:00<00:00, 228.75it/s, loss=0.0295]
Epoch 21/30: 100%|██████████| 24/24 [00:00<00:00, 240.64it/s, loss=0.0218]
Epoch 22/30: 100%|██████████| 24/24 [00:00<00:00, 242.33it/s, loss=0.019]
Epoch 23/30: 100%|██████████| 24/24 [00:00<00:00, 275.11it/s, loss=0.0257]
Epoch 24/30: 100%|██████████| 24/24 [00:00<00:00, 267.61it/s, loss=0.0165]
Epoch 25/30: 100%|██████████| 24/24 [00:00<00:00, 290.65it/s, loss=0.0192]
Epoch 26/30: 100%|██████████| 24/24 [00:00<00:00, 259.52it/s, loss=0.0224]
Epoch 27/30: 100%|██████████| 24/24 [00:00<00:00, 244.17it/s, loss=0.0173]
Epoch 28/30: 100%|██████████| 24/24 [00:00<00:00, 261.72it/s, loss=0.0159]
Epoch 29/30: 100%|██████████| 24/24 [00:00<00:00, 293.25it/s, loss=0.012]
Epoch 30/30: 100%|██████████| 24/24 [00:00<00:00, 298.45it/s, loss=0.022]

We now evaluate the trained model on both unseen samples of red wine and white wine. We see that, unsurprisingly, the model is better able to predict the quality of unseen red wine samples.

[18]:

reg = reg.eval()
reg_fn = encompass_batching(reg, backend='pytorch', batch_size=32)
preds_ref = reg_fn(X_ref)
preds_corr = reg_fn(X_corr)

ref_mse = np.square(preds_ref - y_ref).mean()
corr_mse = np.square(preds_corr - y_corr).mean()

print(f'MSE when predicting the quality of unseen red wine samples: {ref_mse}')
print(f'MSE when predicting the quality of unseen white wine samples: {corr_mse}')

MSE when predicting the quality of unseen red wine samples: 0.008570569567382336
MSE when predicting the quality of unseen white wine samples: 0.014613097533583641

Detect drift

We now look at whether a regressor-uncertainty detector would have picked up on this malicious drift. We instantiate the detector and obtain drift predictions on both the held-out red-wine samples and the white-wine samples. We specify uncertainty_type='mc_dropout' in this case, but alternatively we could have trained an ensemble model that for each instance outputs a vector of multiple independent predictions and specified uncertainty_type='ensemble'.

[19]:

cd = RegressorUncertaintyDrift(
    X_ref, model=reg, backend='pytorch', p_val=0.05, uncertainty_type='mc_dropout', n_evals=100
)
preds_h0 = cd.predict(X_h0)
preds_h1 = cd.predict(X_corr)

print(f"Drift detected on unseen red wine samples? {'yes' if preds_h0['data']['is_drift']==1 else 'no'}")
print(f"Drift detected on white wine samples? {'yes' if preds_h1['data']['is_drift']==1 else 'no'}")

print(f"p-value on unseen red wine samples? {preds_h0['data']['p_val']}")
print(f"p-value on white wine samples? {preds_h1['data']['p_val']}")

Drift detected on unseen red wine samples? no
Drift detected on white wine samples? yes
p-value on unseen red wine samples? [0.23237702]
p-value on white wine samples? [1.7934791e-10]