Tokenizers documentation

Input Sequences

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Input Sequences

Python
Rust
Node

These types represent all the different kinds of sequence that can be used as input of a Tokenizer. Globally, any sequence can be either a string or a list of strings, according to the operating mode of the tokenizer: raw text vs pre-tokenized.

TextInputSequence

tokenizers.TextInputSequence

A str that represents an input sequence

PreTokenizedInputSequence

tokenizers.PreTokenizedInputSequence

A pre-tokenized input sequence. Can be one of:

  • A List of str
  • A Tuple of str

alias of Union[List[str], Tuple[str]].

InputSequence

tokenizers.InputSequence

Represents all the possible types of input sequences for encoding. Can be:

alias of Union[str, List[str], Tuple[str]].

< > Update on GitHub