alibi.utils.lang_model module
This module defines a wrapper for transformer-based masked language models used in AnchorText as a perturbation strategy. The LanguageModel base class defines basic functionalities as loading, storing, and predicting.
Language model’s tokenizers usually work at a subword level, and thus, a word can be split into subwords. For example,
a word can be decomposed as: word = [head_token tail_token_1 tail_token_2 ... tail_token_k]
. For language models
such as DistilbertBaseUncased and BertBaseUncased, the tail tokens can be identified by a special prefix '##'
.
On the other hand, for RobertaBase only the head is prefixed with the special character 'Ġ'
, thus the tail tokens
can be identified by the absence of the special token. In this module, we refer to a tail token as a subword prefix.
We will use the notion of a subword to refer to either a head or a tail token.
To generate interpretable perturbed instances, we do not mask subwords, but entire words. Note that this operation is equivalent to replacing the head token with the special mask token, and removing the tail tokens if they exist. Thus, the LanguageModel class offers additional functionalities such as: checking if a token is a subword prefix, selection of a word (head_token along with the tail_tokens), etc.
Some language models can work with a limited number of tokens, thus the input text has to be split. Thus, a text will be split in head and tail, where the number of tokens in the head is less or equal to the maximum allowed number of tokens to be processed by the language model. In the AnchorText only the head is perturbed. To keep the results interpretable, we ensure that the head will not end with a subword, and will contain only full words.
- class alibi.utils.lang_model.BertBaseUncased(preloading=True)[source]
Bases:
LanguageModel
- SUBWORD_PREFIX = '##'
Language model subword prefix.
- __init__(preloading=True)[source]
Initialize BertBaseUncased.
- Parameters:
preloading (
bool
) – Seealibi.utils.lang_model.LanguageModel.__init__()
.
- is_subword_prefix(token)[source]
Checks if the given token is a part of the tail of a word. Note that a word can be split in multiple tokens (e.g.,
word = [head_token tail_token_1 tail_token_2 ... tail_token_k]
). Each language model has a convention on how to mark a tail token. For example DistilbertBaseUncased and BertBaseUncased have the tail tokens prefixed with the special set of characters'##'
. On the other hand, for RobertaBase only the head token is prefixed with the special character'Ġ'
and thus we need to check the absence of the prefix to identify the tail tokens. We call those special characters SUBWORD_PREFIX. Due to different conventions, this method has to be implemented for each language model. See module docstring for namings.
- class alibi.utils.lang_model.DistilbertBaseUncased(preloading=True)[source]
Bases:
LanguageModel
- SUBWORD_PREFIX = '##'
Language model subword prefix.
- __init__(preloading=True)[source]
Initialize DistilbertBaseUncased.
- Parameters:
preloading (
bool
) – Seealibi.utils.lang_model.LanguageModel.__init__()
.
- is_subword_prefix(token)[source]
Checks if the given token is a part of the tail of a word. Note that a word can be split in multiple tokens (e.g.,
word = [head_token tail_token_1 tail_token_2 ... tail_token_k]
). Each language model has a convention on how to mark a tail token. For example DistilbertBaseUncased and BertBaseUncased have the tail tokens prefixed with the special set of characters'##'
. On the other hand, for RobertaBase only the head token is prefixed with the special character'Ġ'
and thus we need to check the absence of the prefix to identify the tail tokens. We call those special characters SUBWORD_PREFIX. Due to different conventions, this method has to be implemented for each language model. See module docstring for namings.
- class alibi.utils.lang_model.LanguageModel(model_path, preloading=True)[source]
Bases:
ABC
- SUBWORD_PREFIX = ''
Language model subword prefix.
- head_tail_split(text)[source]
Split the text in head and tail. Some language models support a maximum number of tokens. Thus is necessary to split the text to meet this constraint. After the text is split in head and tail, only the head is considered for operation. Thus the tail will remain unchanged.
- is_stop_word(tokenized_text, start_idx, punctuation, stopwords)[source]
Checks if the given word starting at the given index is in the list of stopwords.
- Parameters:
- Return type:
- Returns:
True
if the token is in the stopwords list.False
otherwise.
- abstract is_subword_prefix(token)[source]
Checks if the given token is a part of the tail of a word. Note that a word can be split in multiple tokens (e.g.,
word = [head_token tail_token_1 tail_token_2 ... tail_token_k]
). Each language model has a convention on how to mark a tail token. For example DistilbertBaseUncased and BertBaseUncased have the tail tokens prefixed with the special set of characters'##'
. On the other hand, for RobertaBase only the head token is prefixed with the special character'Ġ'
and thus we need to check the absence of the prefix to identify the tail tokens. We call those special characters SUBWORD_PREFIX. Due to different conventions, this method has to be implemented for each language model. See module docstring for namings.
- property max_num_tokens: int
Returns the maximum number of token allowed by the model.
- Return type:
- predict_batch_lm(x, vocab_size, batch_size)[source]
Tensorflow language model batch predictions for AnchorText.
- select_word(tokenized_text, start_idx, punctuation)[source]
Given a tokenized text and the starting index of a word, the function selects the entire word. Note that a word is composed of multiple tokens (e.g.,
word = [head_token tail_token_1 tail_token_2 ... tail_token_k]
). The tail tokens can be identified based on the presence/absence of SUBWORD_PREFIX. Seealibi.utils.lang_model.LanguageModel.is_subword_prefix()
for more details.- Parameters:
- Return type:
- Returns:
The word obtained by concatenation
[head_token tail_token_1 tail_token_2 ... tail_token_k]
.
- class alibi.utils.lang_model.RobertaBase(preloading=True)[source]
Bases:
LanguageModel
- SUBWORD_PREFIX = 'Ġ'
Language model subword prefix.
- __init__(preloading=True)[source]
Initialize RobertaBase.
- Parameters:
preloading (
bool
) – Seealibi.utils.lang_model.LanguageModel.__init__()
constructor.
- is_subword_prefix(token)[source]
Checks if the given token is a part of the tail of a word. Note that a word can be split in multiple tokens (e.g.,
word = [head_token tail_token_1 tail_token_2 ... tail_token_k]
). Each language model has a convention on how to mark a tail token. For example DistilbertBaseUncased and BertBaseUncased have the tail tokens prefixed with the special set of characters'##'
. On the other hand, for RobertaBase only the head token is prefixed with the special character'Ġ'
and thus we need to check the absence of the prefix to identify the tail tokens. We call those special characters SUBWORD_PREFIX. Due to different conventions, this method has to be implemented for each language model. See module docstring for namings.