Input sequences

These types represent all the different kinds of sequence that can be used as input of a Tokenizer. Globally, any sequence can be either a string or a list of strings, according to the operating mode of the tokenizer: raw text vs pre-tokenized.

tokenizers.TextInputSequence = <class 'str'>

A str that represents an input sequence

tokenizers.PreTokenizedInputSequence

A pre-tokenized input sequence. Can be one of:

  • A List of str

  • A Tuple of str

alias of Union[List[str], Tuple[str]]

tokenizers.InputSequence

Represents all the possible types of input sequences for encoding. Can be:

alias of Union[str, List[str], Tuple[str]]

Encode inputs

These types represent all the different kinds of input that a Tokenizer accepts when using encode_batch().

tokenizers.TextEncodeInput

Represents a textual input for encoding. Can be either:

alias of Union[str, Tuple[str, str], List[str]]

tokenizers.PreTokenizedEncodeInput

Represents a pre-tokenized input for encoding. Can be either:

alias of Union[List[str], Tuple[str], Tuple[Union[List[str], Tuple[str]], Union[List[str], Tuple[str]]], List[Union[List[str], Tuple[str]]]]

tokenizers.EncodeInput

Represents all the possible types of input for encoding. Can be:

alias of Union[str, Tuple[str, str], List[str], Tuple[str], Tuple[Union[List[str], Tuple[str]], Union[List[str], Tuple[str]]], List[Union[List[str], Tuple[str]]]]

Tokenizer

class tokenizers.Tokenizer(self, model)

A Tokenizer works as a pipeline. It processes some raw text as input and outputs an Encoding.

Parameters

model (Model) – The core algorithm that this Tokenizer should be using.

add_special_tokens(tokens)

Add the given special tokens to the Tokenizer.

If these tokens are already part of the vocabulary, it just let the Tokenizer know about them. If they don’t exist, the Tokenizer creates them, giving them a new id.

These special tokens will never be processed by the model (ie won’t be split into multiple tokens), and they can be removed from the output when decoding.

Parameters

tokens (A List of AddedToken or str) – The list of special tokens we want to add to the vocabulary. Each token can either be a string or an instance of AddedToken for more customization.

Returns

The number of tokens that were created in the vocabulary

Return type

int

add_tokens(tokens)

Add the given tokens to the vocabulary

The given tokens are added only if they don’t already exist in the vocabulary. Each token then gets a new attributed id.

Parameters

tokens (A List of AddedToken or str) – The list of tokens we want to add to the vocabulary. Each token can be either a string or an instance of AddedToken for more customization.

Returns

The number of tokens that were created in the vocabulary

Return type

int

decode(ids, skip_special_tokens=True)

Decode the given list of ids back to a string

This is used to decode anything coming back from a Language Model

Parameters
  • ids (A List/Tuple of int) – The list of ids that we want to decode

  • skip_special_tokens (bool, defaults to True) – Whether the special tokens should be removed from the decoded string

Returns

The decoded string

Return type

str

decode_batch(sequences, skip_special_tokens=True)

Decode a batch of ids back to their corresponding string

Parameters
  • sequences (List of List[int]) – The batch of sequences we want to decode

  • skip_special_tokens (bool, defaults to True) – Whether the special tokens should be removed from the decoded strings

Returns

A list of decoded strings

Return type

List[str]

decoder

The optional Decoder in use by the Tokenizer

enable_padding(direction='right', pad_id=0, pad_type_id=0, pad_token='[PAD]', length=None, pad_to_multiple_of=None)

Enable the padding

Parameters
  • direction (str, optional, defaults to right) – The direction in which to pad. Can be either right or left

  • pad_to_multiple_of (int, optional) – If specified, the padding length should always snap to the next multiple of the given value. For example if we were going to pad witha length of 250 but pad_to_multiple_of=8 then we will pad to 256.

  • pad_id (int, defaults to 0) – The id to be used when padding

  • pad_type_id (int, defaults to 0) – The type id to be used when padding

  • pad_token (str, defaults to [PAD]) – The pad token to be used when padding

  • length (int, optional) – If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.

enable_truncation(max_length, stride=0, strategy='longest_first', direction='right')

Enable truncation

Parameters
  • max_length (int) – The max length at which to truncate

  • stride (int, optional) – The length of the previous first sequence to be included in the overflowing sequence

  • strategy (str, optional, defaults to longest_first) – The strategy used to truncation. Can be one of longest_first, only_first or only_second.

  • direction (str, defaults to right) – Truncate direction

encode(sequence, pair=None, is_pretokenized=False, add_special_tokens=True)

Encode the given sequence and pair. This method can process raw text sequences as well as already pre-tokenized sequences.

Example

Here are some examples of the inputs that are accepted:

encode("A single sequence")`
encode("A sequence", "And its pair")`
encode([ "A", "pre", "tokenized", "sequence" ], is_pretokenized=True)`
encode(
    [ "A", "pre", "tokenized", "sequence" ], [ "And", "its", "pair" ],
    is_pretokenized=True
)
Parameters
  • sequence (InputSequence) –

    The main input sequence we want to encode. This sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument:

  • pair (InputSequence, optional) – An optional input sequence. The expected format is the same that for sequence.

  • is_pretokenized (bool, defaults to False) – Whether the input is already pre-tokenized

  • add_special_tokens (bool, defaults to True) – Whether to add the special tokens

Returns

The encoded result

Return type

Encoding

encode_batch(input, is_pretokenized=False, add_special_tokens=True)

Encode the given batch of inputs. This method accept both raw text sequences as well as already pre-tokenized sequences.

Example

Here are some examples of the inputs that are accepted:

encode_batch([
    "A single sequence",
    ("A tuple with a sequence", "And its pair"),
    [ "A", "pre", "tokenized", "sequence" ],
    ([ "A", "pre", "tokenized", "sequence" ], "And its pair")
])
Parameters
  • input (A List/Tuple of EncodeInput) –

    A list of single sequences or pair sequences to encode. Each sequence can be either raw text or pre-tokenized, according to the is_pretokenized argument:

  • is_pretokenized (bool, defaults to False) – Whether the input is already pre-tokenized

  • add_special_tokens (bool, defaults to True) – Whether to add the special tokens

Returns

The encoded batch

Return type

A List of Encoding

static from_buffer(buffer)

Instantiate a new Tokenizer from the given buffer.

Parameters

buffer (bytes) – A buffer containing a previously serialized Tokenizer

Returns

The new tokenizer

Return type

Tokenizer

static from_file(path)

Instantiate a new Tokenizer from the file at the given path.

Parameters

path (str) – A path to a local JSON file representing a previously serialized Tokenizer

Returns

The new tokenizer

Return type

Tokenizer

static from_pretrained(identifier, revision='main', auth_token=None)

Instantiate a new Tokenizer from an existing file on the Hugging Face Hub.

Parameters
  • identifier (str) – The identifier of a Model on the Hugging Face Hub, that contains a tokenizer.json file

  • revision (str, defaults to main) – A branch or commit id

  • auth_token (str, optional, defaults to None) – An optional auth token used to access private repositories on the Hugging Face Hub

Returns

The new tokenizer

Return type

Tokenizer

static from_str(json)

Instantiate a new Tokenizer from the given JSON string.

Parameters

json (str) – A valid JSON string representing a previously serialized Tokenizer

Returns

The new tokenizer

Return type

Tokenizer

get_vocab(with_added_tokens=True)

Get the underlying vocabulary

Parameters

with_added_tokens (bool, defaults to True) – Whether to include the added tokens

Returns

The vocabulary

Return type

Dict[str, int]

get_vocab_size(with_added_tokens=True)

Get the size of the underlying vocabulary

Parameters

with_added_tokens (bool, defaults to True) – Whether to include the added tokens

Returns

The size of the vocabulary

Return type

int

id_to_token(id)

Convert the given id to its corresponding token if it exists

Parameters

id (int) – The id to convert

Returns

An optional token, None if out of vocabulary

Return type

Optional[str]

model

The Model in use by the Tokenizer

no_padding()

Disable padding

no_truncation()

Disable truncation

normalizer

The optional Normalizer in use by the Tokenizer

num_special_tokens_to_add(is_pair)

Return the number of special tokens that would be added for single/pair sentences. :param is_pair: Boolean indicating if the input would be a single sentence or a pair :return:

padding

Get the current padding parameters

Cannot be set, use enable_padding() instead

Returns

A dict with the current padding parameters if padding is enabled

Return type

(dict, optional)

post_process(encoding, pair=None, add_special_tokens=True)

Apply all the post-processing steps to the given encodings.

The various steps are:

  1. Truncate according to the set truncation params (provided with enable_truncation())

  2. Apply the PostProcessor

  3. Pad according to the set padding params (provided with enable_padding())

Parameters
  • encoding (Encoding) – The Encoding corresponding to the main sequence.

  • pair (Encoding, optional) – An optional Encoding corresponding to the pair sequence.

  • add_special_tokens (bool) – Whether to add the special tokens

Returns

The final post-processed encoding

Return type

Encoding

post_processor

The optional PostProcessor in use by the Tokenizer

pre_tokenizer

The optional PreTokenizer in use by the Tokenizer

save(path, pretty=True)

Save the Tokenizer to the file at the given path.

Parameters
  • path (str) – A path to a file in which to save the serialized tokenizer.

  • pretty (bool, defaults to True) – Whether the JSON file should be pretty formatted.

to_str(pretty=False)

Gets a serialized string representing this Tokenizer.

Parameters

pretty (bool, defaults to False) – Whether the JSON string should be pretty formatted.

Returns

A string representing the serialized Tokenizer

Return type

str

token_to_id(token)

Convert the given token to its corresponding id if it exists

Parameters

token (str) – The token to convert

Returns

An optional id, None if out of vocabulary

Return type

Optional[int]

train(files, trainer=None)

Train the Tokenizer using the given files.

Reads the files line by line, while keeping all the whitespace, even new lines. If you want to train from data store in-memory, you can check train_from_iterator()

Parameters
  • files (List[str]) – A list of path to the files that we should use for training

  • trainer (Trainer, optional) – An optional trainer that should be used to train our Model

train_from_iterator(iterator, trainer=None, length=None)

Train the Tokenizer using the provided iterator.

You can provide anything that is a Python Iterator

  • A list of sequences List[str]

  • A generator that yields str or List[str]

  • A Numpy array of strings

  • …

Parameters
  • iterator (Iterator) – Any iterator over strings or list of strings

  • trainer (Trainer, optional) – An optional trainer that should be used to train our Model

  • length (int, optional) – The total number of sequences in the iterator. This is used to provide meaningful progress tracking

truncation

Get the currently set truncation parameters

Cannot set, use enable_truncation() instead

Returns

A dict with the current truncation parameters if truncation is enabled

Return type

(dict, optional)

Encoding

class tokenizers.Encoding

The Encoding represents the output of a Tokenizer.

attention_mask

The attention mask

This indicates to the LM which tokens should be attended to, and which should not. This is especially important when batching sequences, where we need to applying padding.

Returns

The attention mask

Return type

List[int]

char_to_token(char_pos, sequence_index=0)

Get the token that contains the char at the given position in the input sequence.

Parameters
  • char_pos (int) – The position of a char in the input string

  • sequence_index (int, defaults to 0) – The index of the sequence that contains the target char

Returns

The index of the token that contains this char in the encoded sequence

Return type

int

char_to_word(char_pos, sequence_index=0)

Get the word that contains the char at the given position in the input sequence.

Parameters
  • char_pos (int) – The position of a char in the input string

  • sequence_index (int, defaults to 0) – The index of the sequence that contains the target char

Returns

The index of the word that contains this char in the input sequence

Return type

int

ids

The generated IDs

The IDs are the main input to a Language Model. They are the token indices, the numerical representations that a LM understands.

Returns

The list of IDs

Return type

List[int]

static merge(encodings, growing_offsets=True)

Merge the list of encodings into one final Encoding

Parameters
  • encodings (A List of Encoding) – The list of encodings that should be merged in one

  • growing_offsets (bool, defaults to True) – Whether the offsets should accumulate while merging

Returns

The resulting Encoding

Return type

Encoding

n_sequences

The number of sequences represented

Returns

The number of sequences in this Encoding

Return type

int

offsets

The offsets associated to each token

These offsets let’s you slice the input string, and thus retrieve the original part that led to producing the corresponding token.

Returns

The list of offsets

Return type

A List of Tuple[int, int]

overflowing

A List of overflowing Encoding

When using truncation, the Tokenizer takes care of splitting the output into as many pieces as required to match the specified maximum length. This field lets you retrieve all the subsequent pieces.

When you use pairs of sequences, the overflowing pieces will contain enough variations to cover all the possible combinations, while respecting the provided maximum length.

pad(length, direction='right', pad_id=0, pad_type_id=0, pad_token='[PAD]')

Pad the Encoding at the given length

Parameters
  • length (int) – The desired length

  • direction – (str, defaults to right): The expected padding direction. Can be either right or left

  • pad_id (int, defaults to 0) – The ID corresponding to the padding token

  • pad_type_id (int, defaults to 0) – The type ID corresponding to the padding token

  • pad_token (str, defaults to [PAD]) – The pad token to use

sequence_ids

The generated sequence indices.

They represent the index of the input sequence associated to each token. The sequence id can be None if the token is not related to any input sequence, like for example with special tokens.

Returns

A list of optional sequence index.

Return type

A List of Optional[int]

set_sequence_id(sequence_id)

Set the given sequence index

Set the given sequence index for the whole range of tokens contained in this Encoding.

special_tokens_mask

The special token mask

This indicates which tokens are special tokens, and which are not.

Returns

The special tokens mask

Return type

List[int]

token_to_chars(token_index)

Get the offsets of the token at the given index.

The returned offsets are related to the input sequence that contains the token. In order to determine in which input sequence it belongs, you must call token_to_sequence().

Parameters

token_index (int) – The index of a token in the encoded sequence.

Returns

The token offsets (first, last + 1)

Return type

Tuple[int, int]

token_to_sequence(token_index)

Get the index of the sequence represented by the given token.

In the general use case, this method returns 0 for a single sequence or the first sequence of a pair, and 1 for the second sequence of a pair

Parameters

token_index (int) – The index of a token in the encoded sequence.

Returns

The sequence id of the given token

Return type

int

token_to_word(token_index)

Get the index of the word that contains the token in one of the input sequences.

The returned word index is related to the input sequence that contains the token. In order to determine in which input sequence it belongs, you must call token_to_sequence().

Parameters

token_index (int) – The index of a token in the encoded sequence.

Returns

The index of the word in the relevant input sequence.

Return type

int

tokens

The generated tokens

They are the string representation of the IDs.

Returns

The list of tokens

Return type

List[str]

truncate(max_length, stride=0, direction='right')

Truncate the Encoding at the given length

If this Encoding represents multiple sequences, when truncating this information is lost. It will be considered as representing a single sequence.

Parameters
  • max_length (int) – The desired length

  • stride (int, defaults to 0) – The length of previous content to be included in each overflowing piece

  • direction (str, defaults to right) – Truncate direction

type_ids

The generated type IDs

Generally used for tasks like sequence classification or question answering, these tokens let the LM know which input sequence corresponds to each tokens.

Returns

The list of type ids

Return type

List[int]

word_ids

The generated word indices.

They represent the index of the word associated to each token. When the input is pre-tokenized, they correspond to the ID of the given input label, otherwise they correspond to the words indices as defined by the PreTokenizer that was used.

For special tokens and such (any token that was generated from something that was not part of the input), the output is None

Returns

A list of optional word index.

Return type

A List of Optional[int]

word_to_chars(word_index, sequence_index=0)

Get the offsets of the word at the given index in one of the input sequences.

Parameters
  • word_index (int) – The index of a word in one of the input sequences.

  • sequence_index (int, defaults to 0) – The index of the sequence that contains the target word

Returns

The range of characters (span) (first, last + 1)

Return type

Tuple[int, int]

word_to_tokens(word_index, sequence_index=0)

Get the encoded tokens corresponding to the word at the given index in one of the input sequences.

Parameters
  • word_index (int) – The index of a word in one of the input sequences.

  • sequence_index (int, defaults to 0) – The index of the sequence that contains the target word

Returns

The range of tokens: (first, last + 1)

Return type

Tuple[int, int]

words

The generated word indices.

Warning

This is deprecated and will be removed in a future version. Please use word_ids instead.

They represent the index of the word associated to each token. When the input is pre-tokenized, they correspond to the ID of the given input label, otherwise they correspond to the words indices as defined by the PreTokenizer that was used.

For special tokens and such (any token that was generated from something that was not part of the input), the output is None

Returns

A list of optional word index.

Return type

A List of Optional[int]

Added Tokens

class tokenizers.AddedToken(self, content, single_word=False, lstrip=False, rstrip=False, normalized=True)

Represents a token that can be be added to a Tokenizer. It can have special options that defines the way it should behave.

Parameters
  • content (str) – The content of the token

  • single_word (bool, defaults to False) – Defines whether this token should only match single words. If True, this token will never match inside of a word. For example the token ing would match on tokenizing if this option is False, but not if it is True. The notion of β€œinside of a word” is defined by the word boundaries pattern in regular expressions (ie. the token should start and end with word boundaries).

  • lstrip (bool, defaults to False) – Defines whether this token should strip all potential whitespaces on its left side. If True, this token will greedily match any whitespace on its left. For example if we try to match the token [MASK] with lstrip=True, in the text "I saw a [MASK]", we would match on " [MASK]". (Note the space on the left).

  • rstrip (bool, defaults to False) – Defines whether this token should strip all potential whitespaces on its right side. If True, this token will greedily match any whitespace on its right. It works just like lstrip but on the right.

  • normalized (bool, defaults to True with add_tokens() and False with add_special_tokens()) – Defines whether this token should match against the normalized version of the input text. For example, with the added token "yesterday", and a normalizer in charge of lowercasing the text, the token could be extract from the input "I saw a lion Yesterday".

content

Get the content of this AddedToken

lstrip

Get the value of the lstrip option

normalized

Get the value of the normalized option

rstrip

Get the value of the rstrip option

single_word

Get the value of the single_word option

Models

class tokenizers.models.BPE(self, vocab=None, merges=None, cache_capacity=None, dropout=None, unk_token=None, continuing_subword_prefix=None, end_of_word_suffix=None, fuse_unk=None)

An implementation of the BPE (Byte-Pair Encoding) algorithm

Parameters
  • vocab (Dict[str, int], optional) – A dictionnary of string keys and their ids {"am": 0,...}

  • merges (List[Tuple[str, str]], optional) – A list of pairs of tokens (Tuple[str, str]) [("a", "b"),...]

  • cache_capacity (int, optional) – The number of words that the BPE cache can contain. The cache allows to speed-up the process by keeping the result of the merge operations for a number of words.

  • dropout (float, optional) – A float between 0 and 1 that represents the BPE dropout to use.

  • unk_token (str, optional) – The unknown token to be used by the model.

  • continuing_subword_prefix (str, optional) – The prefix to attach to subword units that don’t represent a beginning of word.

  • end_of_word_suffix (str, optional) – The suffix to attach to subword units that represent an end of word.

  • fuse_unk (bool, optional) – Whether to fuse any subsequent unknown tokens into a single one

from_file(vocab, merge, **kwargs)

Instantiate a BPE model from the given files.

This method is roughly equivalent to doing:

vocab, merges = BPE.read_file(vocab_filename, merges_filename)
bpe = BPE(vocab, merges)

If you don’t need to keep the vocab, merges values lying around, this method is more optimized than manually calling read_file() to initialize a BPE

Parameters
  • vocab (str) – The path to a vocab.json file

  • merges (str) – The path to a merges.txt file

Returns

An instance of BPE loaded from these files

Return type

BPE

static read_file(self, vocab, merges)

Read a vocab.json and a merges.txt files

This method provides a way to read and parse the content of these files, returning the relevant data structures. If you want to instantiate some BPE models from memory, this method gives you the expected input from the standard files.

Parameters
  • vocab (str) – The path to a vocab.json file

  • merges (str) – The path to a merges.txt file

Returns

The vocabulary and merges loaded into memory

Return type

A Tuple with the vocab and the merges

class tokenizers.models.Model

Base class for all models

The model represents the actual tokenization algorithm. This is the part that will contain and manage the learned vocabulary.

This class cannot be constructed directly. Please use one of the concrete models.

get_trainer()

Get the associated Trainer

Retrieve the Trainer associated to this Model.

Returns

The Trainer used to train this model

Return type

Trainer

id_to_token(id)

Get the token associated to an ID

Parameters

id (int) – An ID to convert to a token

Returns

The token associated to the ID

Return type

str

save(folder, prefix)

Save the current model

Save the current model in the given folder, using the given prefix for the various files that will get created. Any file with the same name that already exists in this folder will be overwritten.

Parameters
  • folder (str) – The path to the target folder in which to save the various files

  • prefix (str, optional) – An optional prefix, used to prefix each file name

Returns

The list of saved files

Return type

List[str]

token_to_id(tokens)

Get the ID associated to a token

Parameters

token (str) – A token to convert to an ID

Returns

The ID associated to the token

Return type

int

tokenize(sequence)

Tokenize a sequence

Parameters

sequence (str) – A sequence to tokenize

Returns

The generated tokens

Return type

A List of Token

class tokenizers.models.Unigram(self, vocab)

An implementation of the Unigram algorithm

Parameters

vocab (List[Tuple[str, float]], optional) – A list of vocabulary items and their relative score [(β€œam”, -0.2442),…]

class tokenizers.models.WordLevel(self, vocab, unk_token)

An implementation of the WordLevel algorithm

Most simple tokenizer model based on mapping tokens to their corresponding id.

Parameters
  • vocab (str, optional) – A dictionnary of string keys and their ids {"am": 0,...}

  • unk_token (str, optional) – The unknown token to be used by the model.

from_file(unk_token)

Instantiate a WordLevel model from the given file

This method is roughly equivalent to doing:

vocab = WordLevel.read_file(vocab_filename)
wordlevel = WordLevel(vocab)

If you don’t need to keep the vocab values lying around, this method is more optimized than manually calling read_file() to initialize a WordLevel

Parameters

vocab (str) – The path to a vocab.json file

Returns

An instance of WordLevel loaded from file

Return type

WordLevel

static read_file(vocab)

Read a vocab.json

This method provides a way to read and parse the content of a vocabulary file, returning the relevant data structures. If you want to instantiate some WordLevel models from memory, this method gives you the expected input from the standard files.

Parameters

vocab (str) – The path to a vocab.json file

Returns

The vocabulary as a dict

Return type

Dict[str, int]

class tokenizers.models.WordPiece(self, vocab, unk_token, max_input_chars_per_word)

An implementation of the WordPiece algorithm

Parameters
  • vocab (Dict[str, int], optional) – A dictionnary of string keys and their ids {"am": 0,...}

  • unk_token (str, optional) – The unknown token to be used by the model.

  • max_input_chars_per_word (int, optional) – The maximum number of characters to authorize in a single word.

from_file(**kwargs)

Instantiate a WordPiece model from the given file

This method is roughly equivalent to doing:

vocab = WordPiece.read_file(vocab_filename)
wordpiece = WordPiece(vocab)

If you don’t need to keep the vocab values lying around, this method is more optimized than manually calling read_file() to initialize a WordPiece

Parameters

vocab (str) – The path to a vocab.txt file

Returns

An instance of WordPiece loaded from file

Return type

WordPiece

static read_file(vocab)

Read a vocab.txt file

This method provides a way to read and parse the content of a standard vocab.txt file as used by the WordPiece Model, returning the relevant data structures. If you want to instantiate some WordPiece models from memory, this method gives you the expected input from the standard files.

Parameters

vocab (str) – The path to a vocab.txt file

Returns

The vocabulary as a dict

Return type

Dict[str, int]

Normalizers

class tokenizers.normalizers.BertNormalizer(self, clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True)

Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing

Parameters
  • clean_text (bool, optional, defaults to True) – Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one.

  • handle_chinese_chars (bool, optional, defaults to True) – Whether to handle chinese chars by putting spaces around them.

  • strip_accents (bool, optional) – Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert).

  • lowercase (bool, optional, defaults to True) – Whether to lowercase.

class tokenizers.normalizers.Lowercase(self)

Lowercase Normalizer

class tokenizers.normalizers.NFC(self)

NFC Unicode Normalizer

class tokenizers.normalizers.NFD(self)

NFD Unicode Normalizer

class tokenizers.normalizers.NFKC(self)

NFKC Unicode Normalizer

class tokenizers.normalizers.NFKD(self)

NFKD Unicode Normalizer

class tokenizers.normalizers.Nmt(self)

Nmt normalizer

class tokenizers.normalizers.Normalizer

Base class for all normalizers

This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.

normalize(normalized)

Normalize a NormalizedString in-place

This method allows to modify a NormalizedString to keep track of the alignment information. If you just want to see the result of the normalization on a raw string, you can use normalize_str()

Parameters

normalized (NormalizedString) – The normalized string on which to apply this Normalizer

normalize_str(sequence)

Normalize the given string

This method provides a way to visualize the effect of a Normalizer but it does not keep track of the alignment information. If you need to get/convert offsets, you can use normalize()

Parameters

sequence (str) – A string to normalize

Returns

A string after normalization

Return type

str

class tokenizers.normalizers.Precompiled(self, precompiled_charsmap)

Precompiled normalizer Don’t use manually it is used for compatiblity for SentencePiece.

class tokenizers.normalizers.Replace(self, pattern, content)

Replace normalizer

class tokenizers.normalizers.Sequence

Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order

Parameters

normalizers (List[Normalizer]) – A list of Normalizer to be run as a sequence

class tokenizers.normalizers.Strip(self, left=True, right=True)

Strip normalizer

class tokenizers.normalizers.StripAccents(self)

StripAccents normalizer

Pre-tokenizers

class tokenizers.pre_tokenizers.BertPreTokenizer(self)

This pre-tokenizer splits tokens on spaces, and also on punctuation. Each occurence of a punctuation character will be treated separately.

class tokenizers.pre_tokenizers.ByteLevel(self, add_prefix_space=True, use_regex=True)

ByteLevel PreTokenizer

This pre-tokenizer takes care of replacing all bytes of the given string with a corresponding representation, as well as splitting into words.

Parameters

add_prefix_space (bool, optional, defaults to True) – Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello.

static alphabet()

Returns the alphabet used by this PreTokenizer.

Since the ByteLevel works as its name suggests, at the byte level, it encodes each byte value to a unique visible character. This means that there is a total of 256 different characters composing this alphabet.

Returns

A list of characters that compose the alphabet

Return type

List[str]

class tokenizers.pre_tokenizers.CharDelimiterSplit

This pre-tokenizer simply splits on the provided char. Works like .split(delimiter)

Parameters

delimiter – str: The delimiter char that will be used to split input

class tokenizers.pre_tokenizers.Digits(self, individual_digits=False)

This pre-tokenizer simply splits using the digits in separate tokens

Parameters

individual_digits (bool, optional, defaults to False) –

If set to True, digits will each be separated as follows:

"Call 123 please" -> "Call ", "1", "2", "3", " please"

If set to False, digits will grouped as follows:

"Call 123 please" -> "Call ", "123", " please"

class tokenizers.pre_tokenizers.Metaspace(self, replacement='_', add_prefix_space=True)

Metaspace pre-tokenizer

This pre-tokenizer replaces any whitespace by the provided replacement character. It then tries to split on these spaces.

Parameters
  • replacement (str, optional, defaults to ▁) – The replacement character. Must be exactly one character. By default we use the ▁ (U+2581) meta symbol (Same as in SentencePiece).

  • add_prefix_space (bool, optional, defaults to True) – Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello.

class tokenizers.pre_tokenizers.PreTokenizer

Base class for all pre-tokenizers

This class is not supposed to be instantiated directly. Instead, any implementation of a PreTokenizer will return an instance of this class when instantiated.

pre_tokenize(pretok)

Pre-tokenize a PyPreTokenizedString in-place

This method allows to modify a PreTokenizedString to keep track of the pre-tokenization, and leverage the capabilities of the PreTokenizedString. If you just want to see the result of the pre-tokenization of a raw string, you can use pre_tokenize_str()

Parameters

( (pretok) – class:~tokenizers.PreTokenizedString): The pre-tokenized string on which to apply this :class:`~tokenizers.pre_tokenizers.PreTokenizer

pre_tokenize_str(sequence)

Pre tokenize the given string

This method provides a way to visualize the effect of a PreTokenizer but it does not keep track of the alignment, nor does it provide all the capabilities of the PreTokenizedString. If you need some of these, you can use pre_tokenize()

Parameters

sequence (str) – A string to pre-tokeize

Returns

A list of tuple with the pre-tokenized parts and their offsets

Return type

List[Tuple[str, Offsets]]

class tokenizers.pre_tokenizers.Punctuation(self, behavior='isolated')

This pre-tokenizer simply splits on punctuation as individual characters.

Parameters

behavior (SplitDelimiterBehavior) – The behavior to use when splitting. Choices: β€œremoved”, β€œisolated” (default), β€œmerged_with_previous”, β€œmerged_with_next”, β€œcontiguous”

class tokenizers.pre_tokenizers.Sequence(self, pretokenizers)

This pre-tokenizer composes other pre_tokenizers and applies them in sequence

class tokenizers.pre_tokenizers.Split(self, pattern, behavior, invert=False)

Split PreTokenizer

This versatile pre-tokenizer splits using the provided pattern and according to the provided behavior. The pattern can be inverted by making use of the invert flag.

Parameters
  • pattern (str or Regex) – A pattern used to split the string. Usually a string or a Regex

  • behavior (SplitDelimiterBehavior) – The behavior to use when splitting. Choices: β€œremoved”, β€œisolated”, β€œmerged_with_previous”, β€œmerged_with_next”, β€œcontiguous”

  • invert (bool, optional, defaults to False) – Whether to invert the pattern.

class tokenizers.pre_tokenizers.UnicodeScripts(self)

This pre-tokenizer splits on characters that belong to different language family It roughly follows https://github.com/google/sentencepiece/blob/master/data/Scripts.txt Actually Hiragana and Katakana are fused with Han, and 0x30FC is Han too. This mimicks SentencePiece Unigram implementation.

class tokenizers.pre_tokenizers.Whitespace(self)

This pre-tokenizer simply splits using the following regex: w+|[^ws]+

class tokenizers.pre_tokenizers.WhitespaceSplit(self)

This pre-tokenizer simply splits on the whitespace. Works like .split()

Post-processor

class tokenizers.processors.BertProcessing(self, sep, cls)

This post-processor takes care of adding the special tokens needed by a Bert model:

  • a SEP token

  • a CLS token

Parameters
  • sep (Tuple[str, int]) – A tuple with the string representation of the SEP token, and its id

  • cls (Tuple[str, int]) – A tuple with the string representation of the CLS token, and its id

class tokenizers.processors.ByteLevel(self, trim_offsets=True)

This post-processor takes care of trimming the offsets.

By default, the ByteLevel BPE might include whitespaces in the produced tokens. If you don’t want the offsets to include these whitespaces, then this PostProcessor must be used.

Parameters

trim_offsets (bool) – Whether to trim the whitespaces from the produced offsets.

class tokenizers.processors.PostProcessor

Base class for all post-processors

This class is not supposed to be instantiated directly. Instead, any implementation of a PostProcessor will return an instance of this class when instantiated.

num_special_tokens_to_add(is_pair)

Return the number of special tokens that would be added for single/pair sentences.

Parameters

is_pair (bool) – Whether the input would be a pair of sequences

Returns

The number of tokens to add

Return type

int

process(encoding, pair=None, add_special_tokens=True)

Post-process the given encodings, generating the final one

Parameters
  • encoding (Encoding) – The encoding for the first sequence

  • pair (Encoding, optional) – The encoding for the pair sequence

  • add_special_tokens (bool) – Whether to add the special tokens

Returns

The final encoding

Return type

Encoding

class tokenizers.processors.RobertaProcessing(self, sep, cls, trim_offsets=True, add_prefix_space=True)

This post-processor takes care of adding the special tokens needed by a Roberta model:

  • a SEP token

  • a CLS token

It also takes care of trimming the offsets. By default, the ByteLevel BPE might include whitespaces in the produced tokens. If you don’t want the offsets to include these whitespaces, then this PostProcessor should be initialized with trim_offsets=True

Parameters
  • sep (Tuple[str, int]) – A tuple with the string representation of the SEP token, and its id

  • cls (Tuple[str, int]) – A tuple with the string representation of the CLS token, and its id

  • trim_offsets (bool, optional, defaults to True) – Whether to trim the whitespaces from the produced offsets.

  • add_prefix_space (bool, optional, defaults to True) – Whether the add_prefix_space option was enabled during pre-tokenization. This is relevant because it defines the way the offsets are trimmed out.

class tokenizers.processors.TemplateProcessing(self, single, pair, special_tokens)

Provides a way to specify templates in order to add the special tokens to each input sequence as relevant.

Let’s take BERT tokenizer as an example. It uses two special tokens, used to delimitate each sequence. [CLS] is always used at the beginning of the first sequence, and [SEP] is added at the end of both the first, and the pair sequences. The final result looks like this:

  • Single sequence: [CLS] Hello there [SEP]

  • Pair sequences: [CLS] My name is Anthony [SEP] What is my name? [SEP]

With the type ids as following:

[CLS]   ...   [SEP]   ...   [SEP]
  0      0      0      1      1

You can achieve such behavior using a TemplateProcessing:

TemplateProcessing(
    single="[CLS] $0 [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[("[CLS]", 1), ("[SEP]", 0)],
)

In this example, each input sequence is identified using a $ construct. This identifier lets us specify each input sequence, and the type_id to use. When nothing is specified, it uses the default values. Here are the different ways to specify it:

  • Specifying the sequence, with default type_id == 0: $A or $B

  • Specifying the type_id with default sequence == A: $0, $1, $2, …

  • Specifying both: $A:0, $B:1, …

The same construct is used for special tokens: <identifier>(:<type_id>)?.

Warning: You must ensure that you are giving the correct tokens/ids as these will be added to the Encoding without any further check. If the given ids correspond to something totally different in a Tokenizer using this PostProcessor, it might lead to unexpected results.

Parameters
  • single (Template) – The template used for single sequences

  • pair (Template) – The template used when both sequences are specified

  • special_tokens (Tokens) – The list of special tokens used in each sequences

Types:

Template (str or List):
  • If a str is provided, the whitespace is used as delimiter between tokens

  • If a List[str] is provided, a list of tokens

Tokens (List[Union[Tuple[int, str], Tuple[str, int], dict]]):
  • A Tuple with both a token and its associated ID, in any order

  • A dict with the following keys:
    • β€œid”: str => The special token id, as specified in the Template

    • β€œids”: List[int] => The associated IDs

    • β€œtokens”: List[str] => The associated tokens

The given dict expects the provided ids and tokens lists to have the same length.

Trainers

class tokenizers.trainers.BpeTrainer

Trainer capable of training a BPE model

Parameters
  • vocab_size (int, optional) – The size of the final vocabulary, including all tokens and alphabet.

  • min_frequency (int, optional) – The minimum frequency a pair should have in order to be merged.

  • show_progress (bool, optional) – Whether to show progress bars while training.

  • special_tokens (List[Union[str, AddedToken]], optional) – A list of special tokens the model should know of.

  • limit_alphabet (int, optional) – The maximum different characters to keep in the alphabet.

  • initial_alphabet (List[str], optional) – A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.

  • continuing_subword_prefix (str, optional) – A prefix to be used for every subword that is not a beginning-of-word.

  • end_of_word_suffix (str, optional) – A suffix to be used for every subword that is a end-of-word.

class tokenizers.trainers.Trainer

Base class for all trainers

This class is not supposed to be instantiated directly. Instead, any implementation of a Trainer will return an instance of this class when instantiated.

class tokenizers.trainers.UnigramTrainer(self, vocab_size=8000, show_progress=True, special_tokens=[], shrinking_factor=0.75, unk_token=None, max_piece_length=16, n_sub_iterations=2)

Trainer capable of training a Unigram model

Parameters
  • vocab_size (int) – The size of the final vocabulary, including all tokens and alphabet.

  • show_progress (bool) – Whether to show progress bars while training.

  • special_tokens (List[Union[str, AddedToken]]) – A list of special tokens the model should know of.

  • initial_alphabet (List[str]) – A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.

  • shrinking_factor (float) – The shrinking factor used at each step of the training to prune the vocabulary.

  • unk_token (str) – The token used for out-of-vocabulary tokens.

  • max_piece_length (int) – The maximum length of a given token.

  • n_sub_iterations (int) – The number of iterations of the EM algorithm to perform before pruning the vocabulary.

class tokenizers.trainers.WordLevelTrainer

Trainer capable of training a WorldLevel model

Parameters
  • vocab_size (int, optional) – The size of the final vocabulary, including all tokens and alphabet.

  • min_frequency (int, optional) – The minimum frequency a pair should have in order to be merged.

  • show_progress (bool, optional) – Whether to show progress bars while training.

  • special_tokens (List[Union[str, AddedToken]]) – A list of special tokens the model should know of.

class tokenizers.trainers.WordPieceTrainer(self, vocab_size=30000, min_frequency=0, show_progress=True, special_tokens=[], limit_alphabet=None, initial_alphabet=[], continuing_subword_prefix='##', end_of_word_suffix=None)

Trainer capable of training a WordPiece model

Parameters
  • vocab_size (int, optional) – The size of the final vocabulary, including all tokens and alphabet.

  • min_frequency (int, optional) – The minimum frequency a pair should have in order to be merged.

  • show_progress (bool, optional) – Whether to show progress bars while training.

  • special_tokens (List[Union[str, AddedToken]], optional) – A list of special tokens the model should know of.

  • limit_alphabet (int, optional) – The maximum different characters to keep in the alphabet.

  • initial_alphabet (List[str], optional) – A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.

  • continuing_subword_prefix (str, optional) – A prefix to be used for every subword that is not a beginning-of-word.

  • end_of_word_suffix (str, optional) – A suffix to be used for every subword that is a end-of-word.

Decoders

class tokenizers.decoders.BPEDecoder(self, suffix='</w>')

BPEDecoder Decoder

Parameters

suffix (str, optional, defaults to </w>) – The suffix that was used to caracterize an end-of-word. This suffix will be replaced by whitespaces during the decoding

class tokenizers.decoders.ByteLevel(self)

ByteLevel Decoder

This decoder is to be used in tandem with the ByteLevel PreTokenizer.

class tokenizers.decoders.CTC(self, pad_token='<pad>', word_delimiter_token='|', cleanup=True)

CTC Decoder

Parameters
  • pad_token (str, optional, defaults to <pad>) – The pad token used by CTC to delimit a new token.

  • word_delimiter_token (str, optional, defaults to |) – The word delimiter token. It will be replaced by a <space>

  • cleanup (bool, optional, defaults to True) – Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.

class tokenizers.decoders.Decoder

Base class for all decoders

This class is not supposed to be instantiated directly. Instead, any implementation of a Decoder will return an instance of this class when instantiated.

decode(tokens)

Decode the given list of tokens to a final string

Parameters

tokens (List[str]) – The list of tokens to decode

Returns

The decoded string

Return type

str

class tokenizers.decoders.Metaspace

Metaspace Decoder

Parameters
  • replacement (str, optional, defaults to ▁) – The replacement character. Must be exactly one character. By default we use the ▁ (U+2581) meta symbol (Same as in SentencePiece).

  • add_prefix_space (bool, optional, defaults to True) – Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello.

class tokenizers.decoders.WordPiece(self, prefix='##', cleanup=True)

WordPiece Decoder

Parameters
  • prefix (str, optional, defaults to ##) – The prefix to use for subwords that are not a beginning-of-word

  • cleanup (bool, optional, defaults to True) – Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.

Visualizer

class tokenizers.tools.Annotation(start: int, end: int, label: str)
class tokenizers.tools.EncodingVisualizer(tokenizer: tokenizers.Tokenizer, default_to_notebook: bool = True, annotation_converter: Optional[Callable[[Any], tokenizers.tools.visualizer.Annotation]] = None)

Build an EncodingVisualizer

Parameters
  • tokenizer (Tokenizer) – A tokenizer instance

  • default_to_notebook (bool) – Whether to render html output in a notebook by default

  • annotation_converter (Callable, optional) – An optional (lambda) function that takes an annotation in any format and returns an Annotation object

__call__(text: str, annotations: List[tokenizers.tools.visualizer.Annotation] = [], default_to_notebook: Optional[bool] = None) Optional[str]

Build a visualization of the given text

Parameters
  • text (str) – The text to tokenize

  • annotations (List[Annotation], optional) – An optional list of annotations of the text. The can either be an annotation class or anything else if you instantiated the visualizer with a converter function

  • default_to_notebook (bool, optional, defaults to False) – If True, will render the html in a notebook. Otherwise returns an html string.

Returns

The HTML string if default_to_notebook is False, otherwise (default) returns None and renders the HTML in the notebook