This page was generated from examples/cd_model_unc_cifar10_wine.ipynb.
Model uncertainty based drift detection on CIFAR-10 and Wine-Quality datasets
Method
Model-uncertainty drift detectors aim to directly detect drift that’s likely to effect the performance of a model of interest. The approach is to test for change in the number of instances falling into regions of the input space on which the model is uncertain in its predictions. For each instance in the reference set the detector obtains the model’s prediction and some associated notion of uncertainty. For example for a classifier this may be the entropy of the predicted label probabilities or for a regressor with dropout layers dropout Monte Carlo can be used to provide a notion of uncertainty. The same is done for the test set and if significant differences in uncertainty are detected (via a Kolmogorov-Smirnoff test) then drift is flagged.
It is important that the detector uses a reference set that is disjoint from the model’s training set (on which the model’s confidence may be higher).
Backend
For models that require batch evaluation both PyTorch and TensorFlow frameworks are supported. Alibi Detect does however not install PyTorch for you. Check the PyTorch docs how to do this.
Classifier uncertainty based drift detection
We start by demonstrating how to leverage model uncertainty to detect malicious drift when the model of interest is a classifer.
Dataset
CIFAR10 consists of 60,000 32 by 32 RGB images equally distributed over 10 classes. We evaluate the drift detector on the CIFAR-10-C dataset (Hendrycks & Dietterich, 2019). The instances in CIFAR-10-C have been corrupted and perturbed by various types of noise, blur, brightness etc. at different levels of severity, leading to a gradual decline in the classification model performance. We also check for drift against the original test set with class imbalances.
[1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import tensorflow as tf
import torch
from torch import nn
from alibi_detect.cd import ClassifierUncertaintyDrift, RegressorUncertaintyDrift
from alibi_detect.models.tensorflow import scale_by_instance
from alibi_detect.utils.fetching import fetch_tf_model, fetch_detector
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.datasets import fetch_cifar10c, corruption_types_cifar10c
from alibi_detect.models.pytorch import trainer
from alibi_detect.cd.utils import encompass_batching
Original CIFAR-10 data:
[2]:
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
y_train = y_train.astype('int64').reshape(-1,)
y_test = y_test.astype('int64').reshape(-1,)
For CIFAR-10-C, we can select from the following corruption types at 5 severity levels:
[3]:
corruptions = corruption_types_cifar10c()
print(corruptions)
['brightness', 'contrast', 'defocus_blur', 'elastic_transform', 'fog', 'frost', 'gaussian_blur', 'gaussian_noise', 'glass_blur', 'impulse_noise', 'jpeg_compression', 'motion_blur', 'pixelate', 'saturate', 'shot_noise', 'snow', 'spatter', 'speckle_noise', 'zoom_blur']
Let’s pick a subset of the corruptions at corruption level 5. Each corruption type consists of perturbations on all of the original test set images.
[4]:
corruption = ['gaussian_noise', 'motion_blur', 'brightness', 'pixelate']
X_corr, y_corr = fetch_cifar10c(corruption=corruption, severity=5, return_X_y=True)
X_corr = X_corr.astype('float32') / 255
We split the original test set in a reference dataset and a dataset which should not be rejected under the no-change null H0. We also split the corrupted data by corruption type:
[5]:
np.random.seed(0)
n_test = X_test.shape[0]
idx = np.random.choice(n_test, size=n_test // 2, replace=False)
idx_h0 = np.delete(np.arange(n_test), idx, axis=0)
X_ref,y_ref = X_test[idx], y_test[idx]
X_h0, y_h0 = X_test[idx_h0], y_test[idx_h0]
print(X_ref.shape, X_h0.shape)
(5000, 32, 32, 3) (5000, 32, 32, 3)
[6]:
# check that the classes are more or less balanced
classes, counts_ref = np.unique(y_ref, return_counts=True)
counts_h0 = np.unique(y_h0, return_counts=True)[1]
print('Class Ref H0')
for cl, cref, ch0 in zip(classes, counts_ref, counts_h0):
assert cref + ch0 == n_test // 10
print('{} {} {}'.format(cl, cref, ch0))
Class Ref H0
0 472 528
1 510 490
2 498 502
3 492 508
4 501 499
5 495 505
6 493 507
7 501 499
8 516 484
9 522 478
[7]:
n_corr = len(corruption)
X_c = [X_corr[i * n_test:(i + 1) * n_test] for i in range(n_corr)]
We can visualise the same instance for each corruption type:
[8]:
i = 1
n_test = X_test.shape[0]
plt.title('Original')
plt.axis('off')
plt.imshow(X_test[i])
plt.show()
for _ in range(len(corruption)):
plt.title(corruption[_])
plt.axis('off')
plt.imshow(X_corr[n_test * _+ i])
plt.show()
We can also verify that the performance of a classification model on CIFAR-10 drops significantly on this perturbed dataset:
[9]:
dataset = 'cifar10'
model = 'resnet32'
clf = fetch_tf_model(dataset, model)
acc = clf.evaluate(scale_by_instance(X_test), y_test, batch_size=128, verbose=0)[1]
print('Test set accuracy:')
print('Original {:.4f}'.format(acc))
clf_accuracy = {'original': acc}
for _ in range(len(corruption)):
acc = clf.evaluate(scale_by_instance(X_c[_]), y_test, batch_size=128, verbose=0)[1]
clf_accuracy[corruption[_]] = acc
print('{} {:.4f}'.format(corruption[_], acc))
Test set accuracy:
Original 0.9278
gaussian_noise 0.2208
motion_blur 0.6339
brightness 0.8913
pixelate 0.3666
Given the drop in performance, it is important that we detect the harmful data drift!
Detect drift
Unlike many other approaches we needn’t specify a dimension-reducing preprocessing step as the detector operates directly on the data as it is input to the model of interest. In fact, the two-stage projection input -> prediction -> uncertainty can be thought of as the projection from the input space onto the real line, ready to perform the test.
We simply pass the model to the detector and inform it that the predictions should be interpreted as ‘probs’ rather than ‘logits’ (i.e. a softmax has already been applied). By default uncertainty_type='entropy'
is used as the notion of uncertainty for classifier predictions, however uncertainty_type='margin'
can be specified to deem the classifier’s prediction uncertain if they fall within a margin (e.g. in [0.45,0.55] for binary classifier probabilities) (similar to Sethi and
Kantardzic (2017)).
[10]:
cd = ClassifierUncertaintyDrift(
X_ref, model=clf, backend='tensorflow', p_val=0.05, preds_type='probs'
)
Let’s check whether the detector thinks drift occurred on the different test sets and time the prediction calls:
[11]:
from timeit import default_timer as timer
labels = ['No!', 'Yes!']
def make_predictions(cd, x_h0, x_corr, corruption):
t = timer()
preds = cd.predict(x_h0)
dt = timer() - t
print('No corruption')
print('Drift? {}'.format(labels[preds['data']['is_drift']]))
print('Feature-wise p-values:')
print(preds['data']['p_val'])
print(f'Time (s) {dt:.3f}')
if isinstance(x_corr, list):
for x, c in zip(x_corr, corruption):
t = timer()
preds = cd.predict(x)
dt = timer() - t
print('')
print(f'Corruption type: {c}')
print('Drift? {}'.format(labels[preds['data']['is_drift']]))
print('Feature-wise p-values:')
print(preds['data']['p_val'])
print(f'Time (s) {dt:.3f}')
[12]:
make_predictions(cd, X_h0, X_c, corruption)
No corruption
Drift? No!
Feature-wise p-values:
[0.7868902]
Time (s) 15.574
Corruption type: gaussian_noise
Drift? Yes!
Feature-wise p-values:
[0.]
Time (s) 33.066
Corruption type: motion_blur
Drift? Yes!
Feature-wise p-values:
[0.]
Time (s) 32.637
Corruption type: brightness
Drift? No!
Feature-wise p-values:
[0.1102559]
Time (s) 34.126
Corruption type: pixelate
Drift? Yes!
Feature-wise p-values:
[0.]
Time (s) 34.351
Note here how drift is only detected for the corrupted datasets on which the model’s performance is significantly degraded. For the ‘brightness’ corruption, for which the model maintains 89% classification accuracy, the change in model uncertainty is not deemed significant (p-value 0.11, above the 0.05 threshold). For the other corruptions which signficiantly hamper model performance, the malicious drift is detected.
Regressor uncertainty based drift detection
We now demonstrate how to leverage model uncertainty to detect malicious drift when the model of interest is a regressor. This is a less general approach as regressors often make point-predictions with no associated notion of uncertainty. However, if the model makes its predictions by ensembling the predicitons of sub-models then we can consider the variation in the sub-model predictions as a notion of uncertainty. RegressorUncertaintyDetector
facilitates models that output a vector of such
sub-model predictions (uncertainty_type='ensemble'
) or deep learning models that include dropout layers and can therefore (as noted by Gal and Ghahramani 2016) be considered as an ensemble (uncertainty_type='mc_dropout'
, the default option).
Dataset
The Wine Quality Data Set consists of 1599 and 4898 samples of red and white wine respectively. Each sample has an associated quality (as determined by experts) and 11 numeric features indicating its acidity, density, pH etc. We consider the regression problem of tring to predict the quality of red wine sample given these features. We will then consider whether the model remains suitable for predicting the quality of white wine samples or whether the associated change in the underlying distribution should be considered as malicious drift.
First we load in the data.
[13]:
red = pd.read_csv(
"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=';'
)
white = pd.read_csv(
"http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=';'
)
red.describe()
[13]:
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 | 1599.000000 |
mean | 8.319637 | 0.527821 | 0.270976 | 2.538806 | 0.087467 | 15.874922 | 46.467792 | 0.996747 | 3.311113 | 0.658149 | 10.422983 | 5.636023 |
std | 1.741096 | 0.179060 | 0.194801 | 1.409928 | 0.047065 | 10.460157 | 32.895324 | 0.001887 | 0.154386 | 0.169507 | 1.065668 | 0.807569 |
min | 4.600000 | 0.120000 | 0.000000 | 0.900000 | 0.012000 | 1.000000 | 6.000000 | 0.990070 | 2.740000 | 0.330000 | 8.400000 | 3.000000 |
25% | 7.100000 | 0.390000 | 0.090000 | 1.900000 | 0.070000 | 7.000000 | 22.000000 | 0.995600 | 3.210000 | 0.550000 | 9.500000 | 5.000000 |
50% | 7.900000 | 0.520000 | 0.260000 | 2.200000 | 0.079000 | 14.000000 | 38.000000 | 0.996750 | 3.310000 | 0.620000 | 10.200000 | 6.000000 |
75% | 9.200000 | 0.640000 | 0.420000 | 2.600000 | 0.090000 | 21.000000 | 62.000000 | 0.997835 | 3.400000 | 0.730000 | 11.100000 | 6.000000 |
max | 15.900000 | 1.580000 | 1.000000 | 15.500000 | 0.611000 | 72.000000 | 289.000000 | 1.003690 | 4.010000 | 2.000000 | 14.900000 | 8.000000 |
We can see that the data for both red and white wine samples take the same format.
[14]:
white.describe()
[14]:
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 |
mean | 6.854788 | 0.278241 | 0.334192 | 6.391415 | 0.045772 | 35.308085 | 138.360657 | 0.994027 | 3.188267 | 0.489847 | 10.514267 | 5.877909 |
std | 0.843868 | 0.100795 | 0.121020 | 5.072058 | 0.021848 | 17.007137 | 42.498065 | 0.002991 | 0.151001 | 0.114126 | 1.230621 | 0.885639 |
min | 3.800000 | 0.080000 | 0.000000 | 0.600000 | 0.009000 | 2.000000 | 9.000000 | 0.987110 | 2.720000 | 0.220000 | 8.000000 | 3.000000 |
25% | 6.300000 | 0.210000 | 0.270000 | 1.700000 | 0.036000 | 23.000000 | 108.000000 | 0.991723 | 3.090000 | 0.410000 | 9.500000 | 5.000000 |
50% | 6.800000 | 0.260000 | 0.320000 | 5.200000 | 0.043000 | 34.000000 | 134.000000 | 0.993740 | 3.180000 | 0.470000 | 10.400000 | 6.000000 |
75% | 7.300000 | 0.320000 | 0.390000 | 9.900000 | 0.050000 | 46.000000 | 167.000000 | 0.996100 | 3.280000 | 0.550000 | 11.400000 | 6.000000 |
max | 14.200000 | 1.100000 | 1.660000 | 65.800000 | 0.346000 | 289.000000 | 440.000000 | 1.038980 | 3.820000 | 1.080000 | 14.200000 | 9.000000 |
We shuffle and normalise the data such that each feature takes a value in [0,1], as does the quality we seek to predict.
[15]:
red, white = np.asarray(red, np.float32), np.asarray(white, np.float32)
n_red, n_white = red.shape[0], white.shape[0]
col_maxes = red.max(axis=0)
red, white = red / col_maxes, white / col_maxes
red, white = red[np.random.permutation(n_red)], white[np.random.permutation(n_white)]
X, y = red[:, :-1], red[:, -1:]
X_corr, y_corr = white[:, :-1], white[:, -1:]
We split the red wine data into a set on which to train the model, a reference set with which to instantiate the detector and a set which the detector should not flag drift. We then instantiate a DataLoader to pass the training data to a PyTorch model in batches.
[16]:
X_train, y_train = X[:(n_red//2)], y[:(n_red//2)]
X_ref, y_ref = X[(n_red//2):(3*n_red//4)], y[(n_red//2):(3*n_red//4)]
X_h0, y_h0 = X[(3*n_red//4):], y[(3*n_red//4):]
X_train_ds = torch.utils.data.TensorDataset(torch.tensor(X_train), torch.tensor(y_train))
X_train_dl = torch.utils.data.DataLoader(X_train_ds, batch_size=32, shuffle=True, drop_last=True)
Regression model
We now define the regression model that we’ll train to predict the quality from the features. The exact details aren’t important other than the presence of at least one dropout layer. We then train the model for 20 epochs to optimise the mean square error on the training data.
[17]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
reg = nn.Sequential(
nn.Linear(11, 16),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(16, 32),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(32, 1)
).to(device)
trainer(reg, nn.MSELoss(), X_train_dl, device, torch.optim.Adam, learning_rate=0.001, epochs=30)
/home/oliver/Projects/alibi-detect/.venv/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
Epoch 1/30: 100%|██████████| 24/24 [00:00<00:00, 267.60it/s, loss=0.119]
Epoch 2/30: 100%|██████████| 24/24 [00:00<00:00, 267.14it/s, loss=0.0857]
Epoch 3/30: 100%|██████████| 24/24 [00:00<00:00, 266.90it/s, loss=0.043]
Epoch 4/30: 100%|██████████| 24/24 [00:00<00:00, 250.43it/s, loss=0.0553]
Epoch 5/30: 100%|██████████| 24/24 [00:00<00:00, 187.70it/s, loss=0.0365]
Epoch 6/30: 100%|██████████| 24/24 [00:00<00:00, 260.13it/s, loss=0.03]
Epoch 7/30: 100%|██████████| 24/24 [00:00<00:00, 245.26it/s, loss=0.0552]
Epoch 8/30: 100%|██████████| 24/24 [00:00<00:00, 241.64it/s, loss=0.0335]
Epoch 9/30: 100%|██████████| 24/24 [00:00<00:00, 229.60it/s, loss=0.0254]
Epoch 10/30: 100%|██████████| 24/24 [00:00<00:00, 244.06it/s, loss=0.0223]
Epoch 11/30: 100%|██████████| 24/24 [00:00<00:00, 225.92it/s, loss=0.0224]
Epoch 12/30: 100%|██████████| 24/24 [00:00<00:00, 204.65it/s, loss=0.0254]
Epoch 13/30: 100%|██████████| 24/24 [00:00<00:00, 226.55it/s, loss=0.0236]
Epoch 14/30: 100%|██████████| 24/24 [00:00<00:00, 226.15it/s, loss=0.0247]
Epoch 15/30: 100%|██████████| 24/24 [00:00<00:00, 250.90it/s, loss=0.0292]
Epoch 16/30: 100%|██████████| 24/24 [00:00<00:00, 208.73it/s, loss=0.0263]
Epoch 17/30: 100%|██████████| 24/24 [00:00<00:00, 294.98it/s, loss=0.0163]
Epoch 18/30: 100%|██████████| 24/24 [00:00<00:00, 173.03it/s, loss=0.0223]
Epoch 19/30: 100%|██████████| 24/24 [00:00<00:00, 186.12it/s, loss=0.0244]
Epoch 20/30: 100%|██████████| 24/24 [00:00<00:00, 228.75it/s, loss=0.0295]
Epoch 21/30: 100%|██████████| 24/24 [00:00<00:00, 240.64it/s, loss=0.0218]
Epoch 22/30: 100%|██████████| 24/24 [00:00<00:00, 242.33it/s, loss=0.019]
Epoch 23/30: 100%|██████████| 24/24 [00:00<00:00, 275.11it/s, loss=0.0257]
Epoch 24/30: 100%|██████████| 24/24 [00:00<00:00, 267.61it/s, loss=0.0165]
Epoch 25/30: 100%|██████████| 24/24 [00:00<00:00, 290.65it/s, loss=0.0192]
Epoch 26/30: 100%|██████████| 24/24 [00:00<00:00, 259.52it/s, loss=0.0224]
Epoch 27/30: 100%|██████████| 24/24 [00:00<00:00, 244.17it/s, loss=0.0173]
Epoch 28/30: 100%|██████████| 24/24 [00:00<00:00, 261.72it/s, loss=0.0159]
Epoch 29/30: 100%|██████████| 24/24 [00:00<00:00, 293.25it/s, loss=0.012]
Epoch 30/30: 100%|██████████| 24/24 [00:00<00:00, 298.45it/s, loss=0.022]
We now evaluate the trained model on both unseen samples of red wine and white wine. We see that, unsurprisingly, the model is better able to predict the quality of unseen red wine samples.
[18]:
reg = reg.eval()
reg_fn = encompass_batching(reg, backend='pytorch', batch_size=32)
preds_ref = reg_fn(X_ref)
preds_corr = reg_fn(X_corr)
ref_mse = np.square(preds_ref - y_ref).mean()
corr_mse = np.square(preds_corr - y_corr).mean()
print(f'MSE when predicting the quality of unseen red wine samples: {ref_mse}')
print(f'MSE when predicting the quality of unseen white wine samples: {corr_mse}')
MSE when predicting the quality of unseen red wine samples: 0.008570569567382336
MSE when predicting the quality of unseen white wine samples: 0.014613097533583641
Detect drift
We now look at whether a regressor-uncertainty detector would have picked up on this malicious drift. We instantiate the detector and obtain drift predictions on both the held-out red-wine samples and the white-wine samples. We specify uncertainty_type='mc_dropout'
in this case, but alternatively we could have trained an ensemble model that for each instance outputs a vector of multiple independent predictions and specified uncertainty_type='ensemble'
.
[19]:
cd = RegressorUncertaintyDrift(
X_ref, model=reg, backend='pytorch', p_val=0.05, uncertainty_type='mc_dropout', n_evals=100
)
preds_h0 = cd.predict(X_h0)
preds_h1 = cd.predict(X_corr)
print(f"Drift detected on unseen red wine samples? {'yes' if preds_h0['data']['is_drift']==1 else 'no'}")
print(f"Drift detected on white wine samples? {'yes' if preds_h1['data']['is_drift']==1 else 'no'}")
print(f"p-value on unseen red wine samples? {preds_h0['data']['p_val']}")
print(f"p-value on white wine samples? {preds_h1['data']['p_val']}")
Drift detected on unseen red wine samples? no
Drift detected on white wine samples? yes
p-value on unseen red wine samples? [0.23237702]
p-value on white wine samples? [1.7934791e-10]