alibi.utils.lang_model module

This module defines a wrapper for transformer-based masked language models used in AnchorText as a perturbation strategy. The LanguageModel base class defines basic functionalities as loading, storing, and predicting.

Language model’s tokenizers usually work at a subword level, and thus, a word can be split into subwords. For example, a word can be decomposed as: word = [head_token tail_token_1 tail_token_2 ... tail_token_k]. For language models such as DistilbertBaseUncased and BertBaseUncased, the tail tokens can be identified by a special prefix '##'. On the other hand, for RobertaBase only the head is prefixed with the special character 'Ġ', thus the tail tokens can be identified by the absence of the special token. In this module, we refer to a tail token as a subword prefix. We will use the notion of a subword to refer to either a head or a tail token.

To generate interpretable perturbed instances, we do not mask subwords, but entire words. Note that this operation is equivalent to replacing the head token with the special mask token, and removing the tail tokens if they exist. Thus, the LanguageModel class offers additional functionalities such as: checking if a token is a subword prefix, selection of a word (head_token along with the tail_tokens), etc.

Some language models can work with a limited number of tokens, thus the input text has to be split. Thus, a text will be split in head and tail, where the number of tokens in the head is less or equal to the maximum allowed number of tokens to be processed by the language model. In the AnchorText only the head is perturbed. To keep the results interpretable, we ensure that the head will not end with a subword, and will contain only full words.

class alibi.utils.lang_model.BertBaseUncased(preloading=True)[source]

Bases: LanguageModel

SUBWORD_PREFIX = '##'

Language model subword prefix.

__init__(preloading=True)[source]

Initialize BertBaseUncased.

Parameters:

preloading (bool) – See alibi.utils.lang_model.LanguageModel.__init__().

caller: Callable
is_subword_prefix(token)[source]

Checks if the given token is a part of the tail of a word. Note that a word can be split in multiple tokens (e.g., word = [head_token tail_token_1 tail_token_2 ... tail_token_k]). Each language model has a convention on how to mark a tail token. For example DistilbertBaseUncased and BertBaseUncased have the tail tokens prefixed with the special set of characters '##'. On the other hand, for RobertaBase only the head token is prefixed with the special character 'Ġ' and thus we need to check the absence of the prefix to identify the tail tokens. We call those special characters SUBWORD_PREFIX. Due to different conventions, this method has to be implemented for each language model. See module docstring for namings.

Parameters:

token (str) – Token to be checked if it is a subword.

Return type:

bool

Returns:

True if the given token is a subword prefix. False otherwise.

property mask: str

Returns the mask token.

Return type:

str

model: Any
tokenizer: Any
class alibi.utils.lang_model.DistilbertBaseUncased(preloading=True)[source]

Bases: LanguageModel

SUBWORD_PREFIX = '##'

Language model subword prefix.

__init__(preloading=True)[source]

Initialize DistilbertBaseUncased.

Parameters:

preloading (bool) – See alibi.utils.lang_model.LanguageModel.__init__().

caller: Callable
is_subword_prefix(token)[source]

Checks if the given token is a part of the tail of a word. Note that a word can be split in multiple tokens (e.g., word = [head_token tail_token_1 tail_token_2 ... tail_token_k]). Each language model has a convention on how to mark a tail token. For example DistilbertBaseUncased and BertBaseUncased have the tail tokens prefixed with the special set of characters '##'. On the other hand, for RobertaBase only the head token is prefixed with the special character 'Ġ' and thus we need to check the absence of the prefix to identify the tail tokens. We call those special characters SUBWORD_PREFIX. Due to different conventions, this method has to be implemented for each language model. See module docstring for namings.

Parameters:

token (str) – Token to be checked if it is a subword.

Return type:

bool

Returns:

True if the given token is a subword prefix. False otherwise.

property mask: str

Returns the mask token.

Return type:

str

model: Any
tokenizer: Any
class alibi.utils.lang_model.LanguageModel(model_path, preloading=True)[source]

Bases: ABC

SUBWORD_PREFIX = ''

Language model subword prefix.

__init__(model_path, preloading=True)[source]

Initialize the language model.

Parameters:
  • model_path (str) – transformers package model path.

  • preloading (bool) – Whether to preload the online version of the transformer. If False, a call to from_disk method is expected.

caller: Callable
from_disk(path)[source]

Loads a model from disk.

Parameters:

path (Union[str, Path]) – Path to the checkpoint.

head_tail_split(text)[source]

Split the text in head and tail. Some language models support a maximum number of tokens. Thus is necessary to split the text to meet this constraint. After the text is split in head and tail, only the head is considered for operation. Thus the tail will remain unchanged.

Parameters:

text (str) – Text to be split in head and tail.

Return type:

Tuple[str, str, List[str], List[str]]

Returns:

Tuple consisting of the head, tail and their corresponding list of tokens.

is_punctuation(token, punctuation)[source]

Checks if the given token is punctuation.

Parameters:
  • token (str) – Token to be checked if it is punctuation.

  • punctuation (str) – String containing all punctuation to be considered.

Return type:

bool

Returns:

True if the token is a punctuation. False otherwise.

is_stop_word(tokenized_text, start_idx, punctuation, stopwords)[source]

Checks if the given word starting at the given index is in the list of stopwords.

Parameters:
  • tokenized_text (List[str]) – Tokenized text.

  • start_idx (int) – Starting index of a word.

  • stopwords (Optional[List[str]]) – List of stop words. The words in this list should be lowercase.

  • punctuation (str) – Punctuation to be considered. See alibi.utils.lang_model.LanguageModel.select_entire_word().

Return type:

bool

Returns:

True if the token is in the stopwords list. False otherwise.

abstract is_subword_prefix(token)[source]

Checks if the given token is a part of the tail of a word. Note that a word can be split in multiple tokens (e.g., word = [head_token tail_token_1 tail_token_2 ... tail_token_k]). Each language model has a convention on how to mark a tail token. For example DistilbertBaseUncased and BertBaseUncased have the tail tokens prefixed with the special set of characters '##'. On the other hand, for RobertaBase only the head token is prefixed with the special character 'Ġ' and thus we need to check the absence of the prefix to identify the tail tokens. We call those special characters SUBWORD_PREFIX. Due to different conventions, this method has to be implemented for each language model. See module docstring for namings.

Parameters:

token (str) – Token to be checked if it is a subword.

Return type:

bool

Returns:

True if the given token is a subword prefix. False otherwise.

abstract property mask: str

Returns the mask token.

Return type:

str

property mask_id: int

Returns the mask token id

Return type:

int

property max_num_tokens: int

Returns the maximum number of token allowed by the model.

Return type:

int

model: Any
predict_batch_lm(x, vocab_size, batch_size)[source]

Tensorflow language model batch predictions for AnchorText.

Parameters:
  • x (BatchEncoding) – Batch of instances.

  • vocab_size (int) – Vocabulary size of language model.

  • batch_size (int) – Batch size used for predictions.

Return type:

ndarray

Returns:

y – Array with model predictions.

select_word(tokenized_text, start_idx, punctuation)[source]

Given a tokenized text and the starting index of a word, the function selects the entire word. Note that a word is composed of multiple tokens (e.g., word = [head_token tail_token_1 tail_token_2 ... tail_token_k]). The tail tokens can be identified based on the presence/absence of SUBWORD_PREFIX. See alibi.utils.lang_model.LanguageModel.is_subword_prefix() for more details.

Parameters:
  • tokenized_text (List[str]) – Tokenized text.

  • start_idx (int) – Starting index of a word.

  • punctuation (str) – String of punctuation to be considered. If it encounters a token composed only of characters in punctuation it terminates the search.

Return type:

str

Returns:

The word obtained by concatenation [head_token tail_token_1 tail_token_2 ... tail_token_k].

to_disk(path)[source]

Saves a model to disk.

Parameters:

path (Union[str, Path]) – Path to the checkpoint.

tokenizer: Any
class alibi.utils.lang_model.RobertaBase(preloading=True)[source]

Bases: LanguageModel

SUBWORD_PREFIX = 'Ġ'

Language model subword prefix.

__init__(preloading=True)[source]

Initialize RobertaBase.

Parameters:

preloading (bool) – See alibi.utils.lang_model.LanguageModel.__init__() constructor.

caller: Callable
is_subword_prefix(token)[source]

Checks if the given token is a part of the tail of a word. Note that a word can be split in multiple tokens (e.g., word = [head_token tail_token_1 tail_token_2 ... tail_token_k]). Each language model has a convention on how to mark a tail token. For example DistilbertBaseUncased and BertBaseUncased have the tail tokens prefixed with the special set of characters '##'. On the other hand, for RobertaBase only the head token is prefixed with the special character 'Ġ' and thus we need to check the absence of the prefix to identify the tail tokens. We call those special characters SUBWORD_PREFIX. Due to different conventions, this method has to be implemented for each language model. See module docstring for namings.

Parameters:

token (str) – Token to be checked if it is a subword.

Return type:

bool

Returns:

True if the given token is a subword prefix. False otherwise.

property mask: str

Returns the mask token.

Return type:

str

model: Any
tokenizer: Any