This page was generated from examples/cd_clf_adult.ipynb.

Learned drift detectors on Adult Census

Under the hood, drift detectors leverage a function (also known as a test-statistic) that is expected to take a large value if drift has occurred and a low value if not. The power of the detector is partly determined by how well the function satisfies this property. However, specifying such a function in advance can be very difficult.

Detecting drift with a learned classifier

The classifier-based drift detector simply tries to correctly distinguish instances from the reference data vs. the test set. The classifier is trained to output the probability that a given instance belongs to the test set. If the probabilities it assigns to unseen tests instances are significantly higher (as determined by a Kolmogorov-Smirnov test) than those it assigns to unseen reference instances then the test set must differ from the reference set and drift is flagged. To leverage all the available reference and test data, stratified cross-validation can be applied and the out-of-fold predictions are used for the significance test. Note that a new classifier is trained for each test set or even each fold within the test set.

Backend

The method works with both the PyTorch, TensorFlow, and Sklearn frameworks. We will focus exclusively on the Sklearn backend in this notebook.

Dataset

Adult dataset consists of 32,561 distributed over 2 classes based on whether the annual income is >50K. We evaluate drift on particular subsets of the data which are constructed based on the education level. As we will further discuss, our reference dataset will consist of people having a low education level, while our test dataset will consist of people having a high education level.

Note: we need to install alibi to fetch the adult dataset.

[ ]:
!pip install alibi
[1]:
import numpy as np
import pandas as pd
from typing import List, Tuple, Dict, Callable

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import LinearSVC

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

from alibi.datasets import fetch_adult
from alibi_detect.cd import ClassifierDrift

Load Adult Census Dataset

[2]:
# fetch adult dataset
adult = fetch_adult()

# separate columns in numerical and categorical.
categorical_names = [adult.feature_names[i] for i in adult.category_map.keys()]
categorical_ids = list(adult.category_map.keys())

numerical_names = [name for i, name in enumerate(adult.feature_names) if i not in adult.category_map.keys()]
numerical_ids = [i for i in range(len(adult.feature_names)) if i not in adult.category_map.keys()]

X = adult.data

We split the dataset in two based on the education level. We define a low_education level consisting of: 'Dropout', 'High School grad', 'Bachelors', and a high_education level consisting of: 'Bachelors', 'Masters', 'Doctorate'. Intentionally we included an overlap between the two distributions consisting of people that have a Bachelors degree. Our goal is to detect that the two distributions are different.

[3]:
education_col = adult.feature_names.index('Education')
education = adult.category_map[education_col]
print(education)
['Associates', 'Bachelors', 'Doctorate', 'Dropout', 'High School grad', 'Masters', 'Prof-School']
[4]:
# define low education
low_education = [
    education.index('Dropout'),
    education.index('High School grad'),
    education.index('Bachelors')

]
# define high education
high_education = [
    education.index('Bachelors'),
    education.index('Masters'),
    education.index('Doctorate')
]
print("Low education:", [education[i] for i in low_education])
print("High education:", [education[i] for i in high_education])
Low education: ['Dropout', 'High School grad', 'Bachelors']
High education: ['Bachelors', 'Masters', 'Doctorate']
[5]:
# select instances for low and high education
low_education_mask = pd.Series(X[:, education_col]).isin(low_education).to_numpy()
high_education_mask = pd.Series(X[:, education_col]).isin(high_education).to_numpy()
X_low, X_high = X[low_education_mask], X[high_education_mask]

We sample our reference dataset from the low_education level. In addition, we sample two other datasets:

  • x_h0 - sampled from the low_education level to support the null hypothesis (i.e., the two distributions are identical);

  • x_h1 - sampled from the high_education level to support the alternative hypothesis (i.e., the two distributions are different);

[6]:
size = 1000
np.random.seed(0)

# define reference and H0 dataset
idx_low = np.random.choice(np.arange(X_low.shape[0]), size=2*size, replace=False)
x_ref, x_h0 = train_test_split(X_low[idx_low], test_size=0.5, random_state=5, shuffle=True)

# define reference and H1 dataset
idx_high = np.random.choice(np.arange(X_high.shape[0]), size=size, replace=False)
x_h1 = X_high[idx_high]

Define dataset pre-processor

[7]:
# define numerical standard scaler.
num_transf = StandardScaler()

# define categorical one-hot encoder.
cat_transf = OneHotEncoder(
    categories=[range(len(x)) for x in adult.category_map.values()],
    handle_unknown="ignore"
)

# Define column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", cat_transf, categorical_ids),
        ("num", num_transf, numerical_ids),
    ],
    sparse_threshold=0
)

# fit preprocessor.
preprocessor = preprocessor.fit(np.concatenate([x_ref, x_h0, x_h1]))

Utils

[8]:
labels = ['No!', 'Yes!']

def print_preds(preds: dict, preds_name: str) -> None:
    print(preds_name)
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print(f'p-value: {preds["data"]["p_val"]:.3f}')
    print('')

Drift detection

We perform a binomial test using a RandomForestClassifier.

[9]:
# define classifier
model = RandomForestClassifier()

# define drift detector with binarize prediction
detector = ClassifierDrift(
    x_ref=x_ref,
    model=model,
    backend='sklearn',
    preprocess_fn=preprocessor.transform,
    binarize_preds=True,
    n_folds=2,
)

# print results
print_preds(detector.predict(x=x_h0), "H0")
print_preds(detector.predict(x=x_h1), "H1")
Both `n_folds` and `train_size` specified. By default `n_folds` is used.
`retrain_from_scratch=True` sets automatically the parameter `warm_start=False`.
`use_oob=False` sets automatically the classifier parameters `oob_score=False`.
H0
Drift? No!
p-value: 0.681

H1
Drift? Yes!
p-value: 0.000

As expected, when testing against x_h0, we fail to reject \(H_0\), while for the second case there is enough evidence to reject \(H_0\) and flag that the data has drifted.

For the classifiers that do not support predict_proba but offer support for decision_function, we can perform a K-S test on the scores by setting preds_type='scores'.

[10]:
# define model
model = GradientBoostingClassifier()


# define drift detector
detector = ClassifierDrift(
    x_ref=x_ref,
    model=model,
    backend='sklearn',
    preprocess_fn=preprocessor.transform,
    preds_type='scores',
    binarize_preds=False,
    n_folds=2,
)

# print results
print_preds(detector.predict(x=x_h0), "H0")
print_preds(detector.predict(x=x_h1), "H1")
Both `n_folds` and `train_size` specified. By default `n_folds` is used.
`retrain_from_scratch=True` sets automatically the parameter `warm_start=False`.
H0
Drift? No!
p-value: 0.294

H1
Drift? Yes!
p-value: 0.000

Some models can return a poor estimate of the class label probability or some might not even support probability predictions. We can add calibration on top of each classifier to obtain better probability estimates and perform a K-S test. For demonstrative purposes, we will calibrate a LinearSVC which does not support predict_proba, but any other classifier would work.

[11]:
# define model - does not support predict_proba
model = LinearSVC(max_iter=10000)

# define drift detector
detector = ClassifierDrift(
    x_ref=x_ref,
    model=model,
    backend='sklearn',
    preprocess_fn=preprocessor.transform,
    binarize_preds=False,
    n_folds=2,
    use_calibration=True,
    calibration_kwargs={'method': 'isotonic'}
)

# print results
print_preds(detector.predict(x=x_h0), "H0")
print_preds(detector.predict(x=x_h1), "H1")
Both `n_folds` and `train_size` specified. By default `n_folds` is used.
Using calibration to obtain the prediction probabilities.
H0
Drift? No!
p-value: 0.457

H1
Drift? Yes!
p-value: 0.000

Speeding things up

In order to use the entire dataset and obtain unbiased predictions required to perform the statistical test, the ClassifierDrift detector has the option to perform a n_folds split. Although appealing due to its data efficiency, this method can be slow since it is required to train a number of n_folds classifiers.

For the RandomForestClassifier we can avoid retraining n_folds classifiers by using the out-of-bag predictions. In a RandomForestClassifier each tree is trained on a separate dataset obtained by sampling with replacement the original training set, a method known as bagging. On average, only 63% unique samples from the original dataset are used to train each tree (Bostrom). Thus, for each tree, we can obtain predictions for the remaining out-of-bag samples (i.e., the rest of 37%). By cumulating the out-of-bag predictions across all the trees we can eventually obtain a prediction for each sample in the original dataset. Note that we used the word ‘eventually’ because if the number of trees is too small, covering the entire original dataset might be unlikely.

For demonstrative purposes, we will compare the running time of the ClassifierDrift detector when using a RandomForestClassifier in two setups: n_folds=5, use_oob=False and use_oob=True.

[12]:
n_estimators = 400
n_folds = 5
[13]:
%%time
# define drift detector
detector_rf = ClassifierDrift(
    x_ref=x_ref,
    model=RandomForestClassifier(n_estimators=n_estimators),
    backend='sklearn',
    preprocess_fn=preprocessor.transform,
    binarize_preds=False,
    n_folds=n_folds
)

# print results
print_preds(detector_rf.predict(x=x_h0), "H0")
print_preds(detector_rf.predict(x=x_h1), "H1")
Both `n_folds` and `train_size` specified. By default `n_folds` is used.
`retrain_from_scratch=True` sets automatically the parameter `warm_start=False`.
`use_oob=False` sets automatically the classifier parameters `oob_score=False`.
H0
Drift? No!
p-value: 0.670

H1
Drift? Yes!
p-value: 0.000

CPU times: user 5.13 s, sys: 4.92 ms, total: 5.14 s
Wall time: 5.13 s
[14]:
%%time
# define drift detector
detector_rf_oob = ClassifierDrift(
    x_ref=x_ref,
    model=RandomForestClassifier(n_estimators=n_estimators),
    backend='sklearn',
    preprocess_fn=preprocessor.transform,
    binarize_preds=False,
    use_oob=True
)

# print results
print_preds(detector_rf_oob.predict(x=x_h0), "H0")
print_preds(detector_rf_oob.predict(x=x_h1), "H1")
`retrain_from_scratch=True` sets automatically the parameter `warm_start=False`.
`use_oob=True` sets automatically the classifier parameters `boostrap=True` and `oob_score=True`. `train_size` and `n_folds` are ignored when `use_oob=True`.
H0
Drift? No!
p-value: 0.905

H1
Drift? Yes!
p-value: 0.000

CPU times: user 1.39 s, sys: 18.3 ms, total: 1.41 s
Wall time: 1.41 s

We can observe that in this particular setting, using the out-of-bag prediction can speed up the procedure up to almost x4.