alibi_detect.od.mahalanobis module
- class alibi_detect.od.mahalanobis.Mahalanobis(threshold=None, n_components=3, std_clip=3, start_clip=100, max_n=None, cat_vars=None, ohe=False, data_type='tabular')[source]
Bases:
BaseDetector
,FitMixin
,ThresholdMixin
- __init__(threshold=None, n_components=3, std_clip=3, start_clip=100, max_n=None, cat_vars=None, ohe=False, data_type='tabular')[source]
Outlier detector for tabular data using the Mahalanobis distance.
- Parameters:
threshold (
Optional
[float
]) – Mahalanobis distance threshold used to classify outliers.n_components (
int
) – Number of principal components used.std_clip (
int
) – Feature-wise stdev used to clip the observations before updating the mean and cov.start_clip (
int
) – Number of observations before clipping is applied.max_n (
Optional
[int
]) – Algorithm behaves as if it has seen at most max_n points.cat_vars (
Optional
[dict
]) – Dict with as keys the categorical columns and as values the number of categories per categorical variable.ohe (
bool
) – Whether the categorical variables are one-hot encoded (OHE) or not. If not OHE, they are assumed to have ordinal encodings.data_type (
str
) – Optionally specifiy the data type (tabular, image or time-series). Added to metadata.
- cat2num(X)[source]
Convert categorical variables to numerical values.
- Parameters:
X (
ndarray
) – Batch of instances to analyze.- Return type:
ndarray
- Returns:
Batch of instances where categorical variables are converted to numerical values.
- fit(X, y=None, d_type='abdm', w=None, disc_perc=[25, 50, 75], standardize_cat_vars=True, feature_range=(-10000000000.0, 10000000000.0), smooth=1.0, center=True)[source]
If categorical variables are present, then transform those to numerical values. This step is not necessary in the absence of categorical variables.
- Parameters:
X (
ndarray
) – Batch of instances used to infer distances between categories from.y (
Optional
[ndarray
]) – Model class predictions or ground truth labels for X. Used for ‘mvdm’ and ‘abdm-mvdm’ pairwise distance metrics. Note that this is only compatible with classification problems. For regression problems, use the ‘abdm’ distance metric.d_type (
str
) – Pairwise distance metric used for categorical variables. Currently, ‘abdm’, ‘mvdm’ and ‘abdm-mvdm’ are supported. ‘abdm’ infers context from the other variables while ‘mvdm’ uses the model predictions. ‘abdm-mvdm’ is a weighted combination of the two metrics.w (
Optional
[float
]) – Weight on ‘abdm’ (between 0. and 1.) distance if d_type equals ‘abdm-mvdm’.disc_perc (
list
) – List with percentiles used in binning of numerical features used for the ‘abdm’ and ‘abdm-mvdm’ pairwise distance measures.standardize_cat_vars (
bool
) – Standardize numerical values of categorical variables if True.feature_range (
tuple
) – Tuple with min and max ranges to allow for perturbed instances. Min and max ranges can be floats or numpy arrays with dimension (1x nb of features) for feature-wise ranges.smooth (
float
) – Smoothing exponent between 0 and 1 for the distances. Lower values of l will smooth the difference in distance metric between different features.center (
bool
) – Whether to center the scaled distance measures. If False, the min distance for each feature except for the feature with the highest raw max distance will be the lower bound of the feature range, but the upper bound will be below the max feature range.
- Return type:
- infer_threshold(X, threshold_perc=95.0)[source]
Update threshold by a value inferred from the percentage of instances considered to be outliers in a sample of the dataset.
- predict(X, return_instance_score=True)[source]
Compute outlier scores and transform into outlier predictions.
- Parameters:
X (
ndarray
) – Batch of instances.return_instance_score (
bool
) – Whether to return instance level outlier scores.
- Return type:
- Returns:
Dictionary containing
'meta'
and'data'
dictionaries. –'meta'
has the model’s metadata.'data'
contains the outlier predictions and instance level outlier scores.