Input sequencesο
These types represent all the different kinds of sequence that can be used as input of a Tokenizer.
Globally, any sequence can be either a string or a list of strings, according to the operating
mode of the tokenizer: raw text
vs pre-tokenized
.
- tokenizers.TextInputSequence = <class 'str'>ο
A
str
that represents an input sequence
- tokenizers.PreTokenizedInputSequenceο
A pre-tokenized input sequence. Can be one of:
A
List
ofstr
A
Tuple
ofstr
alias of
Union
[List
[str
],Tuple
[str
]]
- tokenizers.InputSequenceο
Represents all the possible types of input sequences for encoding. Can be:
When
is_pretokenized=False
:TextInputSequence
When
is_pretokenized=True
:PreTokenizedInputSequence
alias of
Union
[str
,List
[str
],Tuple
[str
]]
Encode inputsο
These types represent all the different kinds of input that a Tokenizer
accepts
when using encode_batch()
.
- tokenizers.TextEncodeInputο
Represents a textual input for encoding. Can be either:
A single sequence:
TextInputSequence
A pair of sequences:
A
Tuple
ofTextInputSequence
Or a
List
ofTextInputSequence
of size 2
alias of
Union
[str
,Tuple
[str
,str
],List
[str
]]
- tokenizers.PreTokenizedEncodeInputο
Represents a pre-tokenized input for encoding. Can be either:
A single sequence:
PreTokenizedInputSequence
A pair of sequences:
A
Tuple
ofPreTokenizedInputSequence
Or a
List
ofPreTokenizedInputSequence
of size 2
alias of
Union
[List
[str
],Tuple
[str
],Tuple
[Union
[List
[str
],Tuple
[str
]],Union
[List
[str
],Tuple
[str
]]],List
[Union
[List
[str
],Tuple
[str
]]]]
- tokenizers.EncodeInputο
Represents all the possible types of input for encoding. Can be:
When
is_pretokenized=False
:TextEncodeInput
When
is_pretokenized=True
:PreTokenizedEncodeInput
alias of
Union
[str
,Tuple
[str
,str
],List
[str
],Tuple
[str
],Tuple
[Union
[List
[str
],Tuple
[str
]],Union
[List
[str
],Tuple
[str
]]],List
[Union
[List
[str
],Tuple
[str
]]]]
Tokenizerο
- class tokenizers.Tokenizer(self, model)ο
A
Tokenizer
works as a pipeline. It processes some raw text as input and outputs anEncoding
.- add_special_tokens(tokens)ο
Add the given special tokens to the Tokenizer.
If these tokens are already part of the vocabulary, it just let the Tokenizer know about them. If they donβt exist, the Tokenizer creates them, giving them a new id.
These special tokens will never be processed by the model (ie wonβt be split into multiple tokens), and they can be removed from the output when decoding.
- Parameters
tokens (A
List
ofAddedToken
orstr
) β The list of special tokens we want to add to the vocabulary. Each token can either be a string or an instance ofAddedToken
for more customization.- Returns
The number of tokens that were created in the vocabulary
- Return type
int
- add_tokens(tokens)ο
Add the given tokens to the vocabulary
The given tokens are added only if they donβt already exist in the vocabulary. Each token then gets a new attributed id.
- Parameters
tokens (A
List
ofAddedToken
orstr
) β The list of tokens we want to add to the vocabulary. Each token can be either a string or an instance ofAddedToken
for more customization.- Returns
The number of tokens that were created in the vocabulary
- Return type
int
- decode(ids, skip_special_tokens=True)ο
Decode the given list of ids back to a string
This is used to decode anything coming back from a Language Model
- Parameters
ids (A
List/Tuple
ofint
) β The list of ids that we want to decodeskip_special_tokens (
bool
, defaults toTrue
) β Whether the special tokens should be removed from the decoded string
- Returns
The decoded string
- Return type
str
- decode_batch(sequences, skip_special_tokens=True)ο
Decode a batch of ids back to their corresponding string
- Parameters
sequences (
List
ofList[int]
) β The batch of sequences we want to decodeskip_special_tokens (
bool
, defaults toTrue
) β Whether the special tokens should be removed from the decoded strings
- Returns
A list of decoded strings
- Return type
List[str]
- enable_padding(direction='right', pad_id=0, pad_type_id=0, pad_token='[PAD]', length=None, pad_to_multiple_of=None)ο
Enable the padding
- Parameters
direction (
str
, optional, defaults toright
) β The direction in which to pad. Can be eitherright
orleft
pad_to_multiple_of (
int
, optional) β If specified, the padding length should always snap to the next multiple of the given value. For example if we were going to pad witha length of 250 butpad_to_multiple_of=8
then we will pad to 256.pad_id (
int
, defaults to 0) β The id to be used when paddingpad_type_id (
int
, defaults to 0) β The type id to be used when paddingpad_token (
str
, defaults to[PAD]
) β The pad token to be used when paddinglength (
int
, optional) β If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.
- enable_truncation(max_length, stride=0, strategy='longest_first', direction='right')ο
Enable truncation
- Parameters
max_length (
int
) β The max length at which to truncatestride (
int
, optional) β The length of the previous first sequence to be included in the overflowing sequencestrategy (
str
, optional, defaults tolongest_first
) β The strategy used to truncation. Can be one oflongest_first
,only_first
oronly_second
.direction (
str
, defaults toright
) β Truncate direction
- encode(sequence, pair=None, is_pretokenized=False, add_special_tokens=True)ο
Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.
Example
Here are some examples of the inputs that are accepted:
encode("A single sequence")` encode("A sequence", "And its pair")` encode([ "A", "pre", "tokenized", "sequence" ], is_pretokenized=True)` encode( [ "A", "pre", "tokenized", "sequence" ], [ "And", "its", "pair" ], is_pretokenized=True )
- Parameters
sequence (
InputSequence
) βThe main input sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the
is_pretokenized
argument:If
is_pretokenized=False
:TextInputSequence
If
is_pretokenized=True
:PreTokenizedInputSequence
pair (
InputSequence
, optional) β An optional input sequence. The expected format is the same that forsequence
.is_pretokenized (
bool
, defaults toFalse
) β Whether the input is already pre-tokenizedadd_special_tokens (
bool
, defaults toTrue
) β Whether to add the special tokens
- Returns
The encoded result
- Return type
- encode_batch(input, is_pretokenized=False, add_special_tokens=True)ο
Encode the given batch of inputs. This method accept both raw text sequences as well as already pre-tokenized sequences.
Example
Here are some examples of the inputs that are accepted:
encode_batch([ "A single sequence", ("A tuple with a sequence", "And its pair"), [ "A", "pre", "tokenized", "sequence" ], ([ "A", "pre", "tokenized", "sequence" ], "And its pair") ])
- Parameters
input (A
List
/Tuple
ofEncodeInput
) βA list of single sequences or pair sequences to encode. Each sequence can be either raw text or pre-tokenized, according to the
is_pretokenized
argument:If
is_pretokenized=False
:TextEncodeInput
If
is_pretokenized=True
:PreTokenizedEncodeInput
is_pretokenized (
bool
, defaults toFalse
) β Whether the input is already pre-tokenizedadd_special_tokens (
bool
, defaults toTrue
) β Whether to add the special tokens
- Returns
The encoded batch
- Return type
A
List
ofEncoding
- static from_pretrained(identifier, revision='main', auth_token=None)ο
Instantiate a new
Tokenizer
from an existing file on the Hugging Face Hub.- Parameters
identifier (
str
) β The identifier of a Model on the Hugging Face Hub, that contains a tokenizer.json filerevision (
str
, defaults to main) β A branch or commit idauth_token (
str
, optional, defaults to None) β An optional auth token used to access private repositories on the Hugging Face Hub
- Returns
The new tokenizer
- Return type
- get_vocab(with_added_tokens=True)ο
Get the underlying vocabulary
- Parameters
with_added_tokens (
bool
, defaults toTrue
) β Whether to include the added tokens- Returns
The vocabulary
- Return type
Dict[str, int]
- get_vocab_size(with_added_tokens=True)ο
Get the size of the underlying vocabulary
- Parameters
with_added_tokens (
bool
, defaults toTrue
) β Whether to include the added tokens- Returns
The size of the vocabulary
- Return type
int
- id_to_token(id)ο
Convert the given id to its corresponding token if it exists
- Parameters
id (
int
) β The id to convert- Returns
An optional token,
None
if out of vocabulary- Return type
Optional[str]
- no_padding()ο
Disable padding
- no_truncation()ο
Disable truncation
- normalizerο
The optional
Normalizer
in use by the Tokenizer
- num_special_tokens_to_add(is_pair)ο
Return the number of special tokens that would be added for single/pair sentences. :param is_pair: Boolean indicating if the input would be a single sentence or a pair :return:
- paddingο
Get the current padding parameters
Cannot be set, use
enable_padding()
instead- Returns
A dict with the current padding parameters if padding is enabled
- Return type
(
dict
, optional)
- post_process(encoding, pair=None, add_special_tokens=True)ο
Apply all the post-processing steps to the given encodings.
The various steps are:
Truncate according to the set truncation params (provided with
enable_truncation()
)Apply the
PostProcessor
Pad according to the set padding params (provided with
enable_padding()
)
- post_processorο
The optional
PostProcessor
in use by the Tokenizer
- pre_tokenizerο
The optional
PreTokenizer
in use by the Tokenizer
- save(path, pretty=True)ο
Save the
Tokenizer
to the file at the given path.- Parameters
path (
str
) β A path to a file in which to save the serialized tokenizer.pretty (
bool
, defaults toTrue
) β Whether the JSON file should be pretty formatted.
- to_str(pretty=False)ο
Gets a serialized string representing this
Tokenizer
.- Parameters
pretty (
bool
, defaults toFalse
) β Whether the JSON string should be pretty formatted.- Returns
A string representing the serialized Tokenizer
- Return type
str
- token_to_id(token)ο
Convert the given token to its corresponding id if it exists
- Parameters
token (
str
) β The token to convert- Returns
An optional id,
None
if out of vocabulary- Return type
Optional[int]
- train(files, trainer=None)ο
Train the Tokenizer using the given files.
Reads the files line by line, while keeping all the whitespace, even new lines. If you want to train from data store in-memory, you can check
train_from_iterator()
- Parameters
files (
List[str]
) β A list of path to the files that we should use for trainingtrainer (
Trainer
, optional) β An optional trainer that should be used to train our Model
- train_from_iterator(iterator, trainer=None, length=None)ο
Train the Tokenizer using the provided iterator.
You can provide anything that is a Python Iterator
A list of sequences
List[str]
A generator that yields
str
orList[str]
A Numpy array of strings
β¦
- Parameters
iterator (
Iterator
) β Any iterator over strings or list of stringstrainer (
Trainer
, optional) β An optional trainer that should be used to train our Modellength (
int
, optional) β The total number of sequences in the iterator. This is used to provide meaningful progress tracking
- truncationο
Get the currently set truncation parameters
Cannot set, use
enable_truncation()
instead- Returns
A dict with the current truncation parameters if truncation is enabled
- Return type
(
dict
, optional)
Encodingο
- class tokenizers.Encodingο
The
Encoding
represents the output of aTokenizer
.- attention_maskο
The attention mask
This indicates to the LM which tokens should be attended to, and which should not. This is especially important when batching sequences, where we need to applying padding.
- Returns
The attention mask
- Return type
List[int]
- char_to_token(char_pos, sequence_index=0)ο
Get the token that contains the char at the given position in the input sequence.
- Parameters
char_pos (
int
) β The position of a char in the input stringsequence_index (
int
, defaults to0
) β The index of the sequence that contains the target char
- Returns
The index of the token that contains this char in the encoded sequence
- Return type
int
- char_to_word(char_pos, sequence_index=0)ο
Get the word that contains the char at the given position in the input sequence.
- Parameters
char_pos (
int
) β The position of a char in the input stringsequence_index (
int
, defaults to0
) β The index of the sequence that contains the target char
- Returns
The index of the word that contains this char in the input sequence
- Return type
int
- idsο
The generated IDs
The IDs are the main input to a Language Model. They are the token indices, the numerical representations that a LM understands.
- Returns
The list of IDs
- Return type
List[int]
- static merge(encodings, growing_offsets=True)ο
Merge the list of encodings into one final
Encoding
- n_sequencesο
The number of sequences represented
- Returns
The number of sequences in this
Encoding
- Return type
int
- offsetsο
The offsets associated to each token
These offsets letβs you slice the input string, and thus retrieve the original part that led to producing the corresponding token.
- Returns
The list of offsets
- Return type
A
List
ofTuple[int, int]
- overflowingο
A
List
of overflowingEncoding
When using truncation, the
Tokenizer
takes care of splitting the output into as many pieces as required to match the specified maximum length. This field lets you retrieve all the subsequent pieces.When you use pairs of sequences, the overflowing pieces will contain enough variations to cover all the possible combinations, while respecting the provided maximum length.
- pad(length, direction='right', pad_id=0, pad_type_id=0, pad_token='[PAD]')ο
Pad the
Encoding
at the given length- Parameters
length (
int
) β The desired lengthdirection β (
str
, defaults toright
): The expected padding direction. Can be eitherright
orleft
pad_id (
int
, defaults to0
) β The ID corresponding to the padding tokenpad_type_id (
int
, defaults to0
) β The type ID corresponding to the padding tokenpad_token (
str
, defaults to [PAD]) β The pad token to use
- sequence_idsο
The generated sequence indices.
They represent the index of the input sequence associated to each token. The sequence id can be None if the token is not related to any input sequence, like for example with special tokens.
- Returns
A list of optional sequence index.
- Return type
A
List
ofOptional[int]
- set_sequence_id(sequence_id)ο
Set the given sequence index
Set the given sequence index for the whole range of tokens contained in this
Encoding
.
- special_tokens_maskο
The special token mask
This indicates which tokens are special tokens, and which are not.
- Returns
The special tokens mask
- Return type
List[int]
- token_to_chars(token_index)ο
Get the offsets of the token at the given index.
The returned offsets are related to the input sequence that contains the token. In order to determine in which input sequence it belongs, you must call
token_to_sequence()
.- Parameters
token_index (
int
) β The index of a token in the encoded sequence.- Returns
The token offsets
(first, last + 1)
- Return type
Tuple[int, int]
- token_to_sequence(token_index)ο
Get the index of the sequence represented by the given token.
In the general use case, this method returns
0
for a single sequence or the first sequence of a pair, and1
for the second sequence of a pair- Parameters
token_index (
int
) β The index of a token in the encoded sequence.- Returns
The sequence id of the given token
- Return type
int
- token_to_word(token_index)ο
Get the index of the word that contains the token in one of the input sequences.
The returned word index is related to the input sequence that contains the token. In order to determine in which input sequence it belongs, you must call
token_to_sequence()
.- Parameters
token_index (
int
) β The index of a token in the encoded sequence.- Returns
The index of the word in the relevant input sequence.
- Return type
int
- tokensο
The generated tokens
They are the string representation of the IDs.
- Returns
The list of tokens
- Return type
List[str]
- truncate(max_length, stride=0, direction='right')ο
Truncate the
Encoding
at the given lengthIf this
Encoding
represents multiple sequences, when truncating this information is lost. It will be considered as representing a single sequence.- Parameters
max_length (
int
) β The desired lengthstride (
int
, defaults to0
) β The length of previous content to be included in each overflowing piecedirection (
str
, defaults toright
) β Truncate direction
- type_idsο
The generated type IDs
Generally used for tasks like sequence classification or question answering, these tokens let the LM know which input sequence corresponds to each tokens.
- Returns
The list of type ids
- Return type
List[int]
- word_idsο
The generated word indices.
They represent the index of the word associated to each token. When the input is pre-tokenized, they correspond to the ID of the given input label, otherwise they correspond to the words indices as defined by the
PreTokenizer
that was used.For special tokens and such (any token that was generated from something that was not part of the input), the output is
None
- Returns
A list of optional word index.
- Return type
A
List
ofOptional[int]
- word_to_chars(word_index, sequence_index=0)ο
Get the offsets of the word at the given index in one of the input sequences.
- Parameters
word_index (
int
) β The index of a word in one of the input sequences.sequence_index (
int
, defaults to0
) β The index of the sequence that contains the target word
- Returns
The range of characters (span)
(first, last + 1)
- Return type
Tuple[int, int]
- word_to_tokens(word_index, sequence_index=0)ο
Get the encoded tokens corresponding to the word at the given index in one of the input sequences.
- Parameters
word_index (
int
) β The index of a word in one of the input sequences.sequence_index (
int
, defaults to0
) β The index of the sequence that contains the target word
- Returns
The range of tokens:
(first, last + 1)
- Return type
Tuple[int, int]
- wordsο
The generated word indices.
Warning
This is deprecated and will be removed in a future version. Please use
word_ids
instead.They represent the index of the word associated to each token. When the input is pre-tokenized, they correspond to the ID of the given input label, otherwise they correspond to the words indices as defined by the
PreTokenizer
that was used.For special tokens and such (any token that was generated from something that was not part of the input), the output is
None
- Returns
A list of optional word index.
- Return type
A
List
ofOptional[int]
Added Tokensο
- class tokenizers.AddedToken(self, content, single_word=False, lstrip=False, rstrip=False, normalized=True)ο
Represents a token that can be be added to a
Tokenizer
. It can have special options that defines the way it should behave.- Parameters
content (
str
) β The content of the tokensingle_word (
bool
, defaults toFalse
) β Defines whether this token should only match single words. IfTrue
, this token will never match inside of a word. For example the tokening
would match ontokenizing
if this option isFalse
, but not if it isTrue
. The notion of βinside of a wordβ is defined by the word boundaries pattern in regular expressions (ie. the token should start and end with word boundaries).lstrip (
bool
, defaults toFalse
) β Defines whether this token should strip all potential whitespaces on its left side. IfTrue
, this token will greedily match any whitespace on its left. For example if we try to match the token[MASK]
withlstrip=True
, in the text"I saw a [MASK]"
, we would match on" [MASK]"
. (Note the space on the left).rstrip (
bool
, defaults toFalse
) β Defines whether this token should strip all potential whitespaces on its right side. IfTrue
, this token will greedily match any whitespace on its right. It works just likelstrip
but on the right.normalized (
bool
, defaults toTrue
withadd_tokens()
andFalse
withadd_special_tokens()
) β Defines whether this token should match against the normalized version of the input text. For example, with the added token"yesterday"
, and a normalizer in charge of lowercasing the text, the token could be extract from the input"I saw a lion Yesterday"
.
- contentο
Get the content of this
AddedToken
- normalizedο
Get the value of the
normalized
option
- single_wordο
Get the value of the
single_word
option
Modelsο
- class tokenizers.models.BPE(self, vocab=None, merges=None, cache_capacity=None, dropout=None, unk_token=None, continuing_subword_prefix=None, end_of_word_suffix=None, fuse_unk=None)ο
An implementation of the BPE (Byte-Pair Encoding) algorithm
- Parameters
vocab (
Dict[str, int]
, optional) β A dictionnary of string keys and their ids{"am": 0,...}
merges (
List[Tuple[str, str]]
, optional) β A list of pairs of tokens (Tuple[str, str]
)[("a", "b"),...]
cache_capacity (
int
, optional) β The number of words that the BPE cache can contain. The cache allows to speed-up the process by keeping the result of the merge operations for a number of words.dropout (
float
, optional) β A float between 0 and 1 that represents the BPE dropout to use.unk_token (
str
, optional) β The unknown token to be used by the model.continuing_subword_prefix (
str
, optional) β The prefix to attach to subword units that donβt represent a beginning of word.end_of_word_suffix (
str
, optional) β The suffix to attach to subword units that represent an end of word.fuse_unk (
bool
, optional) β Whether to fuse any subsequent unknown tokens into a single one
- from_file(vocab, merge, **kwargs)ο
Instantiate a BPE model from the given files.
This method is roughly equivalent to doing:
vocab, merges = BPE.read_file(vocab_filename, merges_filename) bpe = BPE(vocab, merges)
If you donβt need to keep the
vocab, merges
values lying around, this method is more optimized than manually callingread_file()
to initialize aBPE
- Parameters
vocab (
str
) β The path to avocab.json
filemerges (
str
) β The path to amerges.txt
file
- Returns
An instance of BPE loaded from these files
- Return type
- static read_file(self, vocab, merges)ο
Read a
vocab.json
and amerges.txt
filesThis method provides a way to read and parse the content of these files, returning the relevant data structures. If you want to instantiate some BPE models from memory, this method gives you the expected input from the standard files.
- Parameters
vocab (
str
) β The path to avocab.json
filemerges (
str
) β The path to amerges.txt
file
- Returns
The vocabulary and merges loaded into memory
- Return type
A
Tuple
with the vocab and the merges
- class tokenizers.models.Modelο
Base class for all models
The model represents the actual tokenization algorithm. This is the part that will contain and manage the learned vocabulary.
This class cannot be constructed directly. Please use one of the concrete models.
- get_trainer()ο
Get the associated
Trainer
Retrieve the
Trainer
associated to thisModel
.- Returns
The Trainer used to train this model
- Return type
- id_to_token(id)ο
Get the token associated to an ID
- Parameters
id (
int
) β An ID to convert to a token- Returns
The token associated to the ID
- Return type
str
- save(folder, prefix)ο
Save the current model
Save the current model in the given folder, using the given prefix for the various files that will get created. Any file with the same name that already exists in this folder will be overwritten.
- Parameters
folder (
str
) β The path to the target folder in which to save the various filesprefix (
str
, optional) β An optional prefix, used to prefix each file name
- Returns
The list of saved files
- Return type
List[str]
- token_to_id(tokens)ο
Get the ID associated to a token
- Parameters
token (
str
) β A token to convert to an ID- Returns
The ID associated to the token
- Return type
int
- tokenize(sequence)ο
Tokenize a sequence
- Parameters
sequence (
str
) β A sequence to tokenize- Returns
The generated tokens
- Return type
A
List
ofToken
- class tokenizers.models.Unigram(self, vocab)ο
An implementation of the Unigram algorithm
- Parameters
vocab (
List[Tuple[str, float]]
, optional) β A list of vocabulary items and their relative score [(βamβ, -0.2442),β¦]
- class tokenizers.models.WordLevel(self, vocab, unk_token)ο
An implementation of the WordLevel algorithm
Most simple tokenizer model based on mapping tokens to their corresponding id.
- Parameters
vocab (
str
, optional) β A dictionnary of string keys and their ids{"am": 0,...}
unk_token (
str
, optional) β The unknown token to be used by the model.
- from_file(unk_token)ο
Instantiate a WordLevel model from the given file
This method is roughly equivalent to doing:
vocab = WordLevel.read_file(vocab_filename) wordlevel = WordLevel(vocab)
If you donβt need to keep the
vocab
values lying around, this method is more optimized than manually callingread_file()
to initialize aWordLevel
- Parameters
vocab (
str
) β The path to avocab.json
file- Returns
An instance of WordLevel loaded from file
- Return type
- static read_file(vocab)ο
Read a
vocab.json
This method provides a way to read and parse the content of a vocabulary file, returning the relevant data structures. If you want to instantiate some WordLevel models from memory, this method gives you the expected input from the standard files.
- Parameters
vocab (
str
) β The path to avocab.json
file- Returns
The vocabulary as a
dict
- Return type
Dict[str, int]
- class tokenizers.models.WordPiece(self, vocab, unk_token, max_input_chars_per_word)ο
An implementation of the WordPiece algorithm
- Parameters
vocab (
Dict[str, int]
, optional) β A dictionnary of string keys and their ids{"am": 0,...}
unk_token (
str
, optional) β The unknown token to be used by the model.max_input_chars_per_word (
int
, optional) β The maximum number of characters to authorize in a single word.
- from_file(**kwargs)ο
Instantiate a WordPiece model from the given file
This method is roughly equivalent to doing:
vocab = WordPiece.read_file(vocab_filename) wordpiece = WordPiece(vocab)
If you donβt need to keep the
vocab
values lying around, this method is more optimized than manually callingread_file()
to initialize aWordPiece
- Parameters
vocab (
str
) β The path to avocab.txt
file- Returns
An instance of WordPiece loaded from file
- Return type
- static read_file(vocab)ο
Read a
vocab.txt
fileThis method provides a way to read and parse the content of a standard vocab.txt file as used by the WordPiece Model, returning the relevant data structures. If you want to instantiate some WordPiece models from memory, this method gives you the expected input from the standard files.
- Parameters
vocab (
str
) β The path to avocab.txt
file- Returns
The vocabulary as a
dict
- Return type
Dict[str, int]
Normalizersο
- class tokenizers.normalizers.BertNormalizer(self, clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True)ο
Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing
- Parameters
clean_text (
bool
, optional, defaults toTrue
) β Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one.handle_chinese_chars (
bool
, optional, defaults toTrue
) β Whether to handle chinese chars by putting spaces around them.strip_accents (
bool
, optional) β Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert).lowercase (
bool
, optional, defaults toTrue
) β Whether to lowercase.
- class tokenizers.normalizers.Lowercase(self)ο
Lowercase Normalizer
- class tokenizers.normalizers.NFC(self)ο
NFC Unicode Normalizer
- class tokenizers.normalizers.NFD(self)ο
NFD Unicode Normalizer
- class tokenizers.normalizers.NFKC(self)ο
NFKC Unicode Normalizer
- class tokenizers.normalizers.NFKD(self)ο
NFKD Unicode Normalizer
- class tokenizers.normalizers.Nmt(self)ο
Nmt normalizer
- class tokenizers.normalizers.Normalizerο
Base class for all normalizers
This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.
- normalize(normalized)ο
Normalize a
NormalizedString
in-placeThis method allows to modify a
NormalizedString
to keep track of the alignment information. If you just want to see the result of the normalization on a raw string, you can usenormalize_str()
- Parameters
normalized (
NormalizedString
) β The normalized string on which to apply thisNormalizer
- normalize_str(sequence)ο
Normalize the given string
This method provides a way to visualize the effect of a
Normalizer
but it does not keep track of the alignment information. If you need to get/convert offsets, you can usenormalize()
- Parameters
sequence (
str
) β A string to normalize- Returns
A string after normalization
- Return type
str
- class tokenizers.normalizers.Precompiled(self, precompiled_charsmap)ο
Precompiled normalizer Donβt use manually it is used for compatiblity for SentencePiece.
- class tokenizers.normalizers.Replace(self, pattern, content)ο
Replace normalizer
- class tokenizers.normalizers.Sequenceο
Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order
- Parameters
normalizers (
List[Normalizer]
) β A list of Normalizer to be run as a sequence
- class tokenizers.normalizers.Strip(self, left=True, right=True)ο
Strip normalizer
- class tokenizers.normalizers.StripAccents(self)ο
StripAccents normalizer
Pre-tokenizersο
- class tokenizers.pre_tokenizers.BertPreTokenizer(self)ο
This pre-tokenizer splits tokens on spaces, and also on punctuation. Each occurence of a punctuation character will be treated separately.
- class tokenizers.pre_tokenizers.ByteLevel(self, add_prefix_space=True, use_regex=True)ο
ByteLevel PreTokenizer
This pre-tokenizer takes care of replacing all bytes of the given string with a corresponding representation, as well as splitting into words.
- Parameters
add_prefix_space (
bool
, optional, defaults toTrue
) β Whether to add a space to the first word if there isnβt already one. This lets us treat hello exactly like say hello.
- static alphabet()ο
Returns the alphabet used by this PreTokenizer.
Since the ByteLevel works as its name suggests, at the byte level, it encodes each byte value to a unique visible character. This means that there is a total of 256 different characters composing this alphabet.
- Returns
A list of characters that compose the alphabet
- Return type
List[str]
- class tokenizers.pre_tokenizers.CharDelimiterSplitο
This pre-tokenizer simply splits on the provided char. Works like .split(delimiter)
- Parameters
delimiter β str: The delimiter char that will be used to split input
- class tokenizers.pre_tokenizers.Digits(self, individual_digits=False)ο
This pre-tokenizer simply splits using the digits in separate tokens
- Parameters
individual_digits (
bool
, optional, defaults toFalse
) βIf set to True, digits will each be separated as follows:
"Call 123 please" -> "Call ", "1", "2", "3", " please"
If set to False, digits will grouped as follows:
"Call 123 please" -> "Call ", "123", " please"
- class tokenizers.pre_tokenizers.Metaspace(self, replacement='_', add_prefix_space=True)ο
Metaspace pre-tokenizer
This pre-tokenizer replaces any whitespace by the provided replacement character. It then tries to split on these spaces.
- Parameters
replacement (
str
, optional, defaults toβ
) β The replacement character. Must be exactly one character. By default we use the β (U+2581) meta symbol (Same as in SentencePiece).add_prefix_space (
bool
, optional, defaults toTrue
) β Whether to add a space to the first word if there isnβt already one. This lets us treat hello exactly like say hello.
- class tokenizers.pre_tokenizers.PreTokenizerο
Base class for all pre-tokenizers
This class is not supposed to be instantiated directly. Instead, any implementation of a PreTokenizer will return an instance of this class when instantiated.
- pre_tokenize(pretok)ο
Pre-tokenize a
PyPreTokenizedString
in-placeThis method allows to modify a
PreTokenizedString
to keep track of the pre-tokenization, and leverage the capabilities of thePreTokenizedString
. If you just want to see the result of the pre-tokenization of a raw string, you can usepre_tokenize_str()
- Parameters
( (pretok) β class:~tokenizers.PreTokenizedString): The pre-tokenized string on which to apply this :class:`~tokenizers.pre_tokenizers.PreTokenizer
- pre_tokenize_str(sequence)ο
Pre tokenize the given string
This method provides a way to visualize the effect of a
PreTokenizer
but it does not keep track of the alignment, nor does it provide all the capabilities of thePreTokenizedString
. If you need some of these, you can usepre_tokenize()
- Parameters
sequence (
str
) β A string to pre-tokeize- Returns
A list of tuple with the pre-tokenized parts and their offsets
- Return type
List[Tuple[str, Offsets]]
- class tokenizers.pre_tokenizers.Punctuation(self, behavior='isolated')ο
This pre-tokenizer simply splits on punctuation as individual characters.
- Parameters
behavior (
SplitDelimiterBehavior
) β The behavior to use when splitting. Choices: βremovedβ, βisolatedβ (default), βmerged_with_previousβ, βmerged_with_nextβ, βcontiguousβ
- class tokenizers.pre_tokenizers.Sequence(self, pretokenizers)ο
This pre-tokenizer composes other pre_tokenizers and applies them in sequence
- class tokenizers.pre_tokenizers.Split(self, pattern, behavior, invert=False)ο
Split PreTokenizer
This versatile pre-tokenizer splits using the provided pattern and according to the provided behavior. The pattern can be inverted by making use of the invert flag.
- Parameters
pattern (
str
orRegex
) β A pattern used to split the string. Usually a string or a Regexbehavior (
SplitDelimiterBehavior
) β The behavior to use when splitting. Choices: βremovedβ, βisolatedβ, βmerged_with_previousβ, βmerged_with_nextβ, βcontiguousβinvert (
bool
, optional, defaults toFalse
) β Whether to invert the pattern.
- class tokenizers.pre_tokenizers.UnicodeScripts(self)ο
This pre-tokenizer splits on characters that belong to different language family It roughly follows https://github.com/google/sentencepiece/blob/master/data/Scripts.txt Actually Hiragana and Katakana are fused with Han, and 0x30FC is Han too. This mimicks SentencePiece Unigram implementation.
- class tokenizers.pre_tokenizers.Whitespace(self)ο
This pre-tokenizer simply splits using the following regex: w+|[^ws]+
- class tokenizers.pre_tokenizers.WhitespaceSplit(self)ο
This pre-tokenizer simply splits on the whitespace. Works like .split()
Post-processorο
- class tokenizers.processors.BertProcessing(self, sep, cls)ο
This post-processor takes care of adding the special tokens needed by a Bert model:
a SEP token
a CLS token
- Parameters
sep (
Tuple[str, int]
) β A tuple with the string representation of the SEP token, and its idcls (
Tuple[str, int]
) β A tuple with the string representation of the CLS token, and its id
- class tokenizers.processors.ByteLevel(self, trim_offsets=True)ο
This post-processor takes care of trimming the offsets.
By default, the ByteLevel BPE might include whitespaces in the produced tokens. If you donβt want the offsets to include these whitespaces, then this PostProcessor must be used.
- Parameters
trim_offsets (
bool
) β Whether to trim the whitespaces from the produced offsets.
- class tokenizers.processors.PostProcessorο
Base class for all post-processors
This class is not supposed to be instantiated directly. Instead, any implementation of a PostProcessor will return an instance of this class when instantiated.
- num_special_tokens_to_add(is_pair)ο
Return the number of special tokens that would be added for single/pair sentences.
- Parameters
is_pair (
bool
) β Whether the input would be a pair of sequences- Returns
The number of tokens to add
- Return type
int
- process(encoding, pair=None, add_special_tokens=True)ο
Post-process the given encodings, generating the final one
- class tokenizers.processors.RobertaProcessing(self, sep, cls, trim_offsets=True, add_prefix_space=True)ο
This post-processor takes care of adding the special tokens needed by a Roberta model:
a SEP token
a CLS token
It also takes care of trimming the offsets. By default, the ByteLevel BPE might include whitespaces in the produced tokens. If you donβt want the offsets to include these whitespaces, then this PostProcessor should be initialized with
trim_offsets=True
- Parameters
sep (
Tuple[str, int]
) β A tuple with the string representation of the SEP token, and its idcls (
Tuple[str, int]
) β A tuple with the string representation of the CLS token, and its idtrim_offsets (
bool
, optional, defaults toTrue
) β Whether to trim the whitespaces from the produced offsets.add_prefix_space (
bool
, optional, defaults toTrue
) β Whether the add_prefix_space option was enabled during pre-tokenization. This is relevant because it defines the way the offsets are trimmed out.
- class tokenizers.processors.TemplateProcessing(self, single, pair, special_tokens)ο
Provides a way to specify templates in order to add the special tokens to each input sequence as relevant.
Letβs take
BERT
tokenizer as an example. It uses two special tokens, used to delimitate each sequence.[CLS]
is always used at the beginning of the first sequence, and[SEP]
is added at the end of both the first, and the pair sequences. The final result looks like this:Single sequence:
[CLS] Hello there [SEP]
Pair sequences:
[CLS] My name is Anthony [SEP] What is my name? [SEP]
With the type ids as following:
[CLS] ... [SEP] ... [SEP] 0 0 0 1 1
You can achieve such behavior using a TemplateProcessing:
TemplateProcessing( single="[CLS] $0 [SEP]", pair="[CLS] $A [SEP] $B:1 [SEP]:1", special_tokens=[("[CLS]", 1), ("[SEP]", 0)], )
In this example, each input sequence is identified using a
$
construct. This identifier lets us specify each input sequence, and the type_id to use. When nothing is specified, it uses the default values. Here are the different ways to specify it:Specifying the sequence, with default
type_id == 0
:$A
or$B
Specifying the type_id with default
sequence == A
:$0
,$1
,$2
, β¦Specifying both:
$A:0
,$B:1
, β¦
The same construct is used for special tokens:
<identifier>(:<type_id>)?
.Warning: You must ensure that you are giving the correct tokens/ids as these will be added to the Encoding without any further check. If the given ids correspond to something totally different in a Tokenizer using this PostProcessor, it might lead to unexpected results.
- Parameters
single (
Template
) β The template used for single sequencespair (
Template
) β The template used when both sequences are specifiedspecial_tokens (
Tokens
) β The list of special tokens used in each sequences
Types:
- Template (
str
orList
): If a
str
is provided, the whitespace is used as delimiter between tokensIf a
List[str]
is provided, a list of tokens
- Tokens (
List[Union[Tuple[int, str], Tuple[str, int], dict]]
): A
Tuple
with both a token and its associated ID, in any order- A
dict
with the following keys: βidβ:
str
=> The special token id, as specified in the Templateβidsβ:
List[int]
=> The associated IDsβtokensβ:
List[str]
=> The associated tokens
- A
The given dict expects the provided
ids
andtokens
lists to have the same length.
Trainersο
- class tokenizers.trainers.BpeTrainerο
Trainer capable of training a BPE model
- Parameters
vocab_size (
int
, optional) β The size of the final vocabulary, including all tokens and alphabet.min_frequency (
int
, optional) β The minimum frequency a pair should have in order to be merged.show_progress (
bool
, optional) β Whether to show progress bars while training.special_tokens (
List[Union[str, AddedToken]]
, optional) β A list of special tokens the model should know of.limit_alphabet (
int
, optional) β The maximum different characters to keep in the alphabet.initial_alphabet (
List[str]
, optional) β A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.continuing_subword_prefix (
str
, optional) β A prefix to be used for every subword that is not a beginning-of-word.end_of_word_suffix (
str
, optional) β A suffix to be used for every subword that is a end-of-word.
- class tokenizers.trainers.Trainerο
Base class for all trainers
This class is not supposed to be instantiated directly. Instead, any implementation of a Trainer will return an instance of this class when instantiated.
- class tokenizers.trainers.UnigramTrainer(self, vocab_size=8000, show_progress=True, special_tokens=[], shrinking_factor=0.75, unk_token=None, max_piece_length=16, n_sub_iterations=2)ο
Trainer capable of training a Unigram model
- Parameters
vocab_size (
int
) β The size of the final vocabulary, including all tokens and alphabet.show_progress (
bool
) β Whether to show progress bars while training.special_tokens (
List[Union[str, AddedToken]]
) β A list of special tokens the model should know of.initial_alphabet (
List[str]
) β A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.shrinking_factor (
float
) β The shrinking factor used at each step of the training to prune the vocabulary.unk_token (
str
) β The token used for out-of-vocabulary tokens.max_piece_length (
int
) β The maximum length of a given token.n_sub_iterations (
int
) β The number of iterations of the EM algorithm to perform before pruning the vocabulary.
- class tokenizers.trainers.WordLevelTrainerο
Trainer capable of training a WorldLevel model
- Parameters
vocab_size (
int
, optional) β The size of the final vocabulary, including all tokens and alphabet.min_frequency (
int
, optional) β The minimum frequency a pair should have in order to be merged.show_progress (
bool
, optional) β Whether to show progress bars while training.special_tokens (
List[Union[str, AddedToken]]
) β A list of special tokens the model should know of.
- class tokenizers.trainers.WordPieceTrainer(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix='##', end_of_word_suffix=None)ο
Trainer capable of training a WordPiece model
- Parameters
vocab_size (
int
, optional) β The size of the final vocabulary, including all tokens and alphabet.min_frequency (
int
, optional) β The minimum frequency a pair should have in order to be merged.show_progress (
bool
, optional) β Whether to show progress bars while training.special_tokens (
List[Union[str, AddedToken]]
, optional) β A list of special tokens the model should know of.limit_alphabet (
int
, optional) β The maximum different characters to keep in the alphabet.initial_alphabet (
List[str]
, optional) β A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.continuing_subword_prefix (
str
, optional) β A prefix to be used for every subword that is not a beginning-of-word.end_of_word_suffix (
str
, optional) β A suffix to be used for every subword that is a end-of-word.
Decodersο
- class tokenizers.decoders.BPEDecoder(self, suffix='</w>')ο
BPEDecoder Decoder
- Parameters
suffix (
str
, optional, defaults to</w>
) β The suffix that was used to caracterize an end-of-word. This suffix will be replaced by whitespaces during the decoding
- class tokenizers.decoders.ByteLevel(self)ο
ByteLevel Decoder
This decoder is to be used in tandem with the
ByteLevel
PreTokenizer
.
- class tokenizers.decoders.CTC(self, pad_token='<pad>', word_delimiter_token='|', cleanup=True)ο
CTC Decoder
- Parameters
pad_token (
str
, optional, defaults to<pad>
) β The pad token used by CTC to delimit a new token.word_delimiter_token (
str
, optional, defaults to|
) β The word delimiter token. It will be replaced by a <space>cleanup (
bool
, optional, defaults toTrue
) β Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.
- class tokenizers.decoders.Decoderο
Base class for all decoders
This class is not supposed to be instantiated directly. Instead, any implementation of a Decoder will return an instance of this class when instantiated.
- decode(tokens)ο
Decode the given list of tokens to a final string
- Parameters
tokens (
List[str]
) β The list of tokens to decode- Returns
The decoded string
- Return type
str
- class tokenizers.decoders.Metaspaceο
Metaspace Decoder
- Parameters
replacement (
str
, optional, defaults toβ
) β The replacement character. Must be exactly one character. By default we use the β (U+2581) meta symbol (Same as in SentencePiece).add_prefix_space (
bool
, optional, defaults toTrue
) β Whether to add a space to the first word if there isnβt already one. This lets us treat hello exactly like say hello.
- class tokenizers.decoders.WordPiece(self, prefix='##', cleanup=True)ο
WordPiece Decoder
- Parameters
prefix (
str
, optional, defaults to##
) β The prefix to use for subwords that are not a beginning-of-wordcleanup (
bool
, optional, defaults toTrue
) β Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.
Visualizerο
- class tokenizers.tools.Annotation(start: int, end: int, label: str)ο
- class tokenizers.tools.EncodingVisualizer(tokenizer: tokenizers.Tokenizer, default_to_notebook: bool = True, annotation_converter: Optional[Callable[[Any], tokenizers.tools.visualizer.Annotation]] = None)ο
Build an EncodingVisualizer
- Parameters
tokenizer (
Tokenizer
) β A tokenizer instancedefault_to_notebook (
bool
) β Whether to render html output in a notebook by defaultannotation_converter (
Callable
, optional) β An optional (lambda) function that takes an annotation in any format and returns an Annotation object
- __call__(text: str, annotations: List[tokenizers.tools.visualizer.Annotation] = [], default_to_notebook: Optional[bool] = None) Optional[str] ο
Build a visualization of the given text
- Parameters
text (
str
) β The text to tokenizeannotations (
List[Annotation]
, optional) β An optional list of annotations of the text. The can either be an annotation class or anything else if you instantiated the visualizer with a converter functiondefault_to_notebook (
bool
, optional, defaults to False) β If True, will render the html in a notebook. Otherwise returns an html string.
- Returns
The HTML string if default_to_notebook is False, otherwise (default) returns None and renders the HTML in the notebook