source

# Chi-Squared¶

## Overview¶

The drift detector applies feature-wise Chi-Squared tests for the categorical features. For multivariate data, the obtained p-values for each feature are aggregated either via the Bonferroni or the False Discovery Rate (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur. Similarly to the other drift detectors, a preprocessing steps could be applied, but the output features need to be categorical.

## Usage¶

### Initialize¶

Parameters:

• p_val: p-value used for significance of the Chi-Squared test for each feature. If the FDR correction method is used, this corresponds to the acceptable q-value.

• X_ref: Data used as reference distribution.

• preprocess_X_ref: Whether to already count and store the number of instances for each possible category of each variable of the reference data X_ref when initializing the detector. If a preprocessing step is specified, the step will be applied first. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data.

• categories_per_feature: Optional dictionary with as keys the feature column index and as values the number of possible categorical values for that feature. E.g.: {0: 5, 1: 9, 2: 7}. If it is not specified, categories_per_feature is inferred from X_ref.

• update_X_ref: Reference data can optionally be updated to the last N instances seen by the detector or via reservoir sampling with size N. For the former, the parameter equals {‘last’: N} while for reservoir sampling {‘reservoir_sampling’: N} is passed.

• preprocess_fn: Function to preprocess the data before computing the data drift metrics. Typically a dimensionality reduction technique. Needs to return categorical features for the Chi-Squared detector.

• preprocess_kwargs: Keyword arguments for preprocess_fn.

• correction: Correction type for multivariate data. Either ‘bonferroni’ or ‘fdr’ (False Discovery Rate).

• n_features: Number of features used in the Chi-Squared test. No need to pass it if no preprocessing takes place. In case of a preprocessing step, this can also be inferred automatically but could be more expensive to compute.

• n_infer: If the number of features needs to be inferred after the preprocessing step, we can specify the number of instances used to infer the number of features from since this can depend on the specific preprocessing step.

• data_type: can specify data type added to metadata. E.g. ‘tabular’.

Initialized drift detector example:

from alibi_detect.cd import ChiSquareDrift

cd = ChiSquareDrift(p_val=0.05, X_ref=X_ref)


### Detect Drift¶

We detect data drift by simply calling predict on a batch of instances X. We can return the feature-wise p-values before the multivariate correction by setting return_p_val to True. The drift can also be detected at the feature level by setting drift_type to ‘feature’. No multivariate correction will take place since we return the output of n_features univariate tests. For drift detection on all the features combined with the correction, use ‘batch’. return_p_val equal to True will also return the threshold used by the detector (either for the univariate case or after the multivariate correction).

The prediction takes the form of a dictionary with meta and data keys. meta contains the detector’s metadata while data is also a dictionary which contains the actual predictions stored in the following keys:

• is_drift: 1 if the sample tested has drifted from the reference data and 0 otherwise.

• p_val: contains feature-level p-values if return_p_val equals True.

• threshold: for feature-level drift detection the threshold equals the p-value used for the significance of the Chi-Square test. Otherwise the threshold after the multivariate correction (either bonferroni or fdr) is returned.

• distance: feature-wise Chi-Square test statistics between the reference data and the new batch if return_distance equals True.

preds_drift = cd.predict(X, drift_type='batch', return_p_val=True, return_distance=True)


The drift detectors can be saved and loaded in the same way as other detectors when using the built-in preprocessing steps (alibi_detect.cd.preprocess.UAE and alibi_detect.cd.preprocess.HiddenOutput) or no preprocessing at all:

from alibi_detect.utils.saving import save_detector, load_detector

filepath = 'my_path'
save_detector(cd, filepath)

cd = load_detector(filepath, **{'preprocess_kwargs': preprocess_kwargs})