alibi.utils package

class alibi.utils.BertBaseUncased(preloading=True)[source]

Bases: LanguageModel

SUBWORD_PREFIX = '##'

Language model subword prefix.

__init__(preloading=True)[source]

Initialize BertBaseUncased.

Parameters:

preloading (bool) – See alibi.utils.lang_model.LanguageModel.__init__().

is_subword_prefix(token)[source]

Checks if the given token is a part of the tail of a word. Note that a word can be split in multiple tokens (e.g., word = [head_token tail_token_1 tail_token_2 ... tail_token_k]). Each language model has a convention on how to mark a tail token. For example DistilbertBaseUncased and BertBaseUncased have the tail tokens prefixed with the special set of characters '##'. On the other hand, for RobertaBase only the head token is prefixed with the special character 'Ġ' and thus we need to check the absence of the prefix to identify the tail tokens. We call those special characters SUBWORD_PREFIX. Due to different conventions, this method has to be implemented for each language model. See module docstring for namings.

Parameters:

token (str) – Token to be checked if it is a subword.

Return type:

bool

Returns:

True if the given token is a subword prefix. False otherwise.

property mask: str

Returns the mask token.

Return type:

str

class alibi.utils.DistilbertBaseUncased(preloading=True)[source]

Bases: LanguageModel

SUBWORD_PREFIX = '##'

Language model subword prefix.

__init__(preloading=True)[source]

Initialize DistilbertBaseUncased.

Parameters:

preloading (bool) – See alibi.utils.lang_model.LanguageModel.__init__().

is_subword_prefix(token)[source]

Checks if the given token is a part of the tail of a word. Note that a word can be split in multiple tokens (e.g., word = [head_token tail_token_1 tail_token_2 ... tail_token_k]). Each language model has a convention on how to mark a tail token. For example DistilbertBaseUncased and BertBaseUncased have the tail tokens prefixed with the special set of characters '##'. On the other hand, for RobertaBase only the head token is prefixed with the special character 'Ġ' and thus we need to check the absence of the prefix to identify the tail tokens. We call those special characters SUBWORD_PREFIX. Due to different conventions, this method has to be implemented for each language model. See module docstring for namings.

Parameters:

token (str) – Token to be checked if it is a subword.

Return type:

bool

Returns:

True if the given token is a subword prefix. False otherwise.

property mask: str

Returns the mask token.

Return type:

str

class alibi.utils.DistributedExplainer(distributed_opts, explainer_type, explainer_init_args, explainer_init_kwargs, concatenate_results=True, return_generator=False)[source]

Bases: object

A class that orchestrates the execution of the execution of a batch of explanations in parallel.

__getattr__(item)[source]

Accesses actor attributes. Use sparingly as this involves a remote call (that is, these attributes are of an object in a different process). The intended use is for retrieving any common state across the actor at the end of the computation in order to form the response (see notes 2 & 3).

Parameters:

item (str) – The explainer attribute to be returned.

Return type:

Any

Returns:

The value of the attribute specified by item.

Raises:

ValueError – If the actor index is invalid.

Notes

  1. This method assumes that the actor implements a return_attribute method.

  2. Note that we are indexing the idle actors. This means that if a pool was initialised with 5 actors and 3 are busy, indexing with index 2 will raise an IndexError.

  3. The order of _idle_actors constantly changes - an actor is removed from it if there is a task to execute and appended back when the task is complete. Therefore, indexing at the same position as computation proceeds will result in retrieving state from different processes.

__init__(distributed_opts, explainer_type, explainer_init_args, explainer_init_kwargs, concatenate_results=True, return_generator=False)[source]

Creates a pool of actors (i.e., replicas of an instantiated explainer_type in a separate process) which can explain batches of instances in parallel via calls to get_explanation.

Parameters:
  • distributed_opts (Dict[str, Any]) –

    A dictionary with the following type (minimal signature):

    class DistributedOpts(TypedDict):
        n_cpus: Optional[int]
        batch_size: Optional[int]
    

    The dictionary may contain two additional keys:

    • 'actor_cpu_frac' : (float, <= 1.0, >0.0) - This is used to create more than one process on one CPU/GPU. This may not speed up CPU intensive tasks but it is worth experimenting with when few physical cores are available. In particular, this is highly useful when the user wants to share a GPU for multiple tasks, with the caviat that the machine learning framework itself needs to support running multiple replicas on the same GPU. See the ray documentation here for details.

    • 'algorithm' : str - this is specified internally by the caller. It is used in order to register target function callbacks for the parallel pool These should be implemented in the global scope. If not specified, its value will be 'default', which will select a default target function which expects the actor has a get_explanation method.

  • explainer_type (Any) – Explainer class.

  • explainer_init_args (Tuple) – Positional arguments to explainer constructor.

  • explainer_init_kwargs (dict) – Keyword arguments to explainer constructor.

  • concatenate_results (bool) – If True concatenates the results. See alibi.utils.distributed.concatenate_minibatches() for more details.

  • return_generator (bool) – If True a generator that returns the results in the order the computation finishes is returned when get_explanation is called. Otherwise, the order of the results is the same as the order of the minibatches.

Notes

When return_generator=True, the caller has to take elements from the generator (e.g., by calling next) in order to start computing the results (because the ray pool is implemented as a generator).

property actor_index: int

Returns the index of the actor for which state is returned.

Return type:

int

concatenate: Callable
create_parallel_pool(explainer_type, explainer_init_args, explainer_init_kwargs)[source]

Creates a pool of actors that can explain the rows of a dataset in parallel.

Parameters:

documentation. (See constructor) –

get_explanation(X, **kwargs)[source]

Performs distributed explanations of instances in X.

Parameters:
  • X (ndarray) – A batch of instances to be explained. Split into batches according to the settings passed to the constructor.

  • **kwargs – Any keyword-arguments for the explainer explain method.

Return type:

Union[Generator[Tuple[int, Any], None, None], List[Any], Any]

Returns:

The explanations are returned as

  • a generator, if the return_generator option is specified. This is used so that the caller can access the results as they are computed. This is the only case when this method is non-blocking and the caller needs to call next on the generator to trigger the parallel computation.

  • a list of objects, whose type depends on the return type of the explainer. This is returned if no custom preprocessing function is specified.

  • an object, whose type depends on the return type of the concatenation function return when called with a list of minibatch results with the same order as the minibatches.

return_attribute(name)[source]

Returns an attribute specified by its name. Used in a distributed context where the properties cannot be accessed using the dot syntax.

Return type:

Any

set_actor_index(value)[source]

Sets actor index. This is used when the DistributedExplainer is in a separate process because ray does not support calling property setters remotely

class alibi.utils.LanguageModel(model_path, preloading=True)[source]

Bases: ABC

SUBWORD_PREFIX = ''

Language model subword prefix.

__init__(model_path, preloading=True)[source]

Initialize the language model.

Parameters:
  • model_path (str) – transformers package model path.

  • preloading (bool) – Whether to preload the online version of the transformer. If False, a call to from_disk method is expected.

caller: Callable
from_disk(path)[source]

Loads a model from disk.

Parameters:

path (Union[str, Path]) – Path to the checkpoint.

head_tail_split(text)[source]

Split the text in head and tail. Some language models support a maximum number of tokens. Thus is necessary to split the text to meet this constraint. After the text is split in head and tail, only the head is considered for operation. Thus the tail will remain unchanged.

Parameters:

text (str) – Text to be split in head and tail.

Return type:

Tuple[str, str, List[str], List[str]]

Returns:

Tuple consisting of the head, tail and their corresponding list of tokens.

is_punctuation(token, punctuation)[source]

Checks if the given token is punctuation.

Parameters:
  • token (str) – Token to be checked if it is punctuation.

  • punctuation (str) – String containing all punctuation to be considered.

Return type:

bool

Returns:

True if the token is a punctuation. False otherwise.

is_stop_word(tokenized_text, start_idx, punctuation, stopwords)[source]

Checks if the given word starting at the given index is in the list of stopwords.

Parameters:
  • tokenized_text (List[str]) – Tokenized text.

  • start_idx (int) – Starting index of a word.

  • stopwords (Optional[List[str]]) – List of stop words. The words in this list should be lowercase.

  • punctuation (str) – Punctuation to be considered. See alibi.utils.lang_model.LanguageModel.select_entire_word().

Return type:

bool

Returns:

True if the token is in the stopwords list. False otherwise.

abstract is_subword_prefix(token)[source]

Checks if the given token is a part of the tail of a word. Note that a word can be split in multiple tokens (e.g., word = [head_token tail_token_1 tail_token_2 ... tail_token_k]). Each language model has a convention on how to mark a tail token. For example DistilbertBaseUncased and BertBaseUncased have the tail tokens prefixed with the special set of characters '##'. On the other hand, for RobertaBase only the head token is prefixed with the special character 'Ġ' and thus we need to check the absence of the prefix to identify the tail tokens. We call those special characters SUBWORD_PREFIX. Due to different conventions, this method has to be implemented for each language model. See module docstring for namings.

Parameters:

token (str) – Token to be checked if it is a subword.

Return type:

bool

Returns:

True if the given token is a subword prefix. False otherwise.

abstract property mask: str

Returns the mask token.

Return type:

str

property mask_id: int

Returns the mask token id

Return type:

int

property max_num_tokens: int

Returns the maximum number of token allowed by the model.

Return type:

int

model: Any
predict_batch_lm(x, vocab_size, batch_size)[source]

Tensorflow language model batch predictions for AnchorText.

Parameters:
  • x (BatchEncoding) – Batch of instances.

  • vocab_size (int) – Vocabulary size of language model.

  • batch_size (int) – Batch size used for predictions.

Return type:

ndarray

Returns:

y – Array with model predictions.

select_word(tokenized_text, start_idx, punctuation)[source]

Given a tokenized text and the starting index of a word, the function selects the entire word. Note that a word is composed of multiple tokens (e.g., word = [head_token tail_token_1 tail_token_2 ... tail_token_k]). The tail tokens can be identified based on the presence/absence of SUBWORD_PREFIX. See alibi.utils.lang_model.LanguageModel.is_subword_prefix() for more details.

Parameters:
  • tokenized_text (List[str]) – Tokenized text.

  • start_idx (int) – Starting index of a word.

  • punctuation (str) – String of punctuation to be considered. If it encounters a token composed only of characters in punctuation it terminates the search.

Return type:

str

Returns:

The word obtained by concatenation [head_token tail_token_1 tail_token_2 ... tail_token_k].

to_disk(path)[source]

Saves a model to disk.

Parameters:

path (Union[str, Path]) – Path to the checkpoint.

tokenizer: Any
class alibi.utils.RobertaBase(preloading=True)[source]

Bases: LanguageModel

SUBWORD_PREFIX = 'Ġ'

Language model subword prefix.

__init__(preloading=True)[source]

Initialize RobertaBase.

Parameters:

preloading (bool) – See alibi.utils.lang_model.LanguageModel.__init__() constructor.

is_subword_prefix(token)[source]

Checks if the given token is a part of the tail of a word. Note that a word can be split in multiple tokens (e.g., word = [head_token tail_token_1 tail_token_2 ... tail_token_k]). Each language model has a convention on how to mark a tail token. For example DistilbertBaseUncased and BertBaseUncased have the tail tokens prefixed with the special set of characters '##'. On the other hand, for RobertaBase only the head token is prefixed with the special character 'Ġ' and thus we need to check the absence of the prefix to identify the tail tokens. We call those special characters SUBWORD_PREFIX. Due to different conventions, this method has to be implemented for each language model. See module docstring for namings.

Parameters:

token (str) – Token to be checked if it is a subword.

Return type:

bool

Returns:

True if the given token is a subword prefix. False otherwise.

property mask: str

Returns the mask token.

Return type:

str

alibi.utils.gen_category_map(data, categorical_columns=None)[source]
Parameters:
  • data (Union[DataFrame, ndarray]) – 2-dimensional pandas dataframe or numpy array.

  • categorical_columns (Union[List[int], List[str], None]) – A list of columns indicating categorical variables. Optional if passing a pandas dataframe as inference will be used based on dtype 'O'. If passing a numpy array this is compulsory.

Return type:

Dict[int, list]

Returns:

category_map – A dictionary with keys being the indices of the categorical columns and values being lists of categories for that column. Implicitly each category is mapped to the index of its position in the list.

alibi.utils.ohe_to_ord(X_ohe, cat_vars_ohe)[source]

Convert one-hot encoded variables to ordinal encodings.

Parameters:
  • X_ohe (ndarray) – Data with mixture of one-hot encoded and numerical variables.

  • cat_vars_ohe (dict) – Dict with as keys the first column index for each one-hot encoded categorical variable and as values the number of categories per categorical variable.

Return type:

Tuple[ndarray, dict]

Returns:

Ordinal equivalent of one-hot encoded data and dict with categorical columns and number of categories.

alibi.utils.ord_to_ohe(X_ord, cat_vars_ord)[source]

Convert ordinal to one-hot encoded variables.

Parameters:
  • X_ord (ndarray) – Data with mixture of ordinal encoded and numerical variables.

  • cat_vars_ord (dict) – Dict with as keys the categorical columns and as values the number of categories per categorical variable.

Return type:

Tuple[ndarray, dict]

Returns:

One-hot equivalent of ordinal encoded data and dict with categorical columns and number of categories.

alibi.utils.spacy_model(model='en_core_web_md')[source]

Download spaCy model.

Parameters:

model (str) – Model to be downloaded.

Return type:

None

alibi.utils.visualize_image_attr(attr, original_image=None, method='heat_map', sign='absolute_value', plt_fig_axis=None, outlier_perc=2, cmap=None, alpha_overlay=0.5, show_colorbar=False, title=None, fig_size=(6, 6), use_pyplot=True)[source]

Visualizes attribution for a given image by normalizing attribution values of the desired sign ('positive' | 'negative' | 'absolute_value' | 'all') and displaying them using the desired mode in a matplotlib figure.

Parameters:
  • attr (ndarray) – Numpy array corresponding to attributions to be visualized. Shape must be in the form (H, W, C), with channels as last dimension. Shape must also match that of the original image if provided.

  • original_image (Optional[ndarray]) – Numpy array corresponding to original image. Shape must be in the form (H, W, C), with channels as the last dimension. Image can be provided either with float values in range 0-1 or int values between 0-255. This is a necessary argument for any visualization method which utilizes the original image.

  • method (str) –

    Chosen method for visualizing attribution. Supported options are:

    • 'heat_map' - Display heat map of chosen attributions

    • 'blended_heat_map' - Overlay heat map over greyscale version of original image. Parameter alpha_overlay corresponds to alpha of heat map.

    • 'original_image' - Only display original image.

    • 'masked_image’ - Mask image (pixel-wise multiply) by normalized attribution values.

    • 'alpha_scaling' - Sets alpha channel of each pixel to be equal to normalized attribution value.

    Default: 'heat_map'.

  • sign (str) –

    Chosen sign of attributions to visualize. Supported options are:

    • 'positive' - Displays only positive pixel attributions.

    • 'absolute_value' - Displays absolute value of attributions.

    • 'negative' - Displays only negative pixel attributions.

    • 'all' - Displays both positive and negative attribution values. This is not supported for 'masked_image' or 'alpha_scaling' modes, since signed information cannot be represented in these modes.

  • plt_fig_axis (Optional[Tuple[Figure, Axes]]) – Tuple of matplotlib.pyplot.figure and axis on which to visualize. If None is provided, then a new figure and axis are created.

  • outlier_perc (Union[int, float]) – Top attribution values which correspond to a total of outlier_perc percentage of the total attribution are set to 1 and scaling is performed using the minimum of these values. For sign='all', outliers and scale value are computed using absolute value of attributions.

  • cmap (Optional[str]) – String corresponding to desired colormap for heatmap visualization. This defaults to 'Reds' for negative sign, 'Blues' for absolute value, 'Greens' for positive sign, and a spectrum from red to green for all. Note that this argument is only used for visualizations displaying heatmaps.

  • alpha_overlay (float) – Visualizes attribution for a given image by normalizing attribution values of the desired sign (positive, negative, absolute value, or all) and displaying them using the desired mode in a matplotlib figure.

  • show_colorbar (bool) – Displays colorbar for heatmap below the visualization. If given method does not use a heatmap, then a colormap axis is created and hidden. This is necessary for appropriate alignment when visualizing multiple plots, some with colorbars and some without.

  • title (Optional[str]) – The title for the plot. If None, no title is set.

  • fig_size (Tuple[int, int]) – Size of figure created.

  • use_pyplot (bool) – If True, uses pyplot to create and show figure and displays the figure after creating. If False, uses matplotlib object-oriented API and simply returns a figure object without showing.

Return type:

Tuple[Figure, Axes]

Returns:

2-element tuple of consisting of

  • figure : matplotlib.pyplot.Figure - Figure object on which visualization is created. If plt_fig_axis argument is given, this is the same figure provided.

  • axis : matplotlib.pyplot.Axes - Axes object on which visualization is created. If plt_fig_axis argument is given, this is the same axis provided.

Submodules