Tokenizers documentation

Normalizers

You are viewing v0.13.4.rc2 version. A newer version v0.20.3 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Normalizers

Python
Rust
Node

BertNormalizer

class tokenizers.normalizers.BertNormalizer

( clean_text = True handle_chinese_chars = True strip_accents = None lowercase = True )

Parameters

  • clean_text (bool, optional, defaults to True) — Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one.
  • handle_chinese_chars (bool, optional, defaults to True) — Whether to handle chinese chars by putting spaces around them.
  • strip_accents (bool, optional) — Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert).
  • lowercase (bool, optional, defaults to True) — Whether to lowercase.

BertNormalizer

Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing

Lowercase

class tokenizers.normalizers.Lowercase

( )

Lowercase Normalizer

NFC

class tokenizers.normalizers.NFC

( )

NFC Unicode Normalizer

NFD

class tokenizers.normalizers.NFD

( )

NFD Unicode Normalizer

NFKC

class tokenizers.normalizers.NFKC

( )

NFKC Unicode Normalizer

NFKD

class tokenizers.normalizers.NFKD

( )

NFKD Unicode Normalizer

Nmt

class tokenizers.normalizers.Nmt

( )

Nmt normalizer

Normalizer

class tokenizers.normalizers.Normalizer

( )

Base class for all normalizers

This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.

normalize

( normalized )

Parameters

  • normalized (NormalizedString) — The normalized string on which to apply this Normalizer

Normalize a NormalizedString in-place

This method allows to modify a NormalizedString to keep track of the alignment information. If you just want to see the result of the normalization on a raw string, you can use normalize_str()

normalize_str

( sequence ) str

Parameters

  • sequence (str) — A string to normalize

Returns

str

A string after normalization

Normalize the given string

This method provides a way to visualize the effect of a Normalizer but it does not keep track of the alignment information. If you need to get/convert offsets, you can use normalize()

Precompiled

class tokenizers.normalizers.Precompiled

( precompiled_charsmap )

Precompiled normalizer Don’t use manually it is used for compatiblity for SentencePiece.

Replace

class tokenizers.normalizers.Replace

( pattern content )

Replace normalizer

Sequence

class tokenizers.normalizers.Sequence

( )

Parameters

  • normalizers (List[Normalizer]) — A list of Normalizer to be run as a sequence

Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order

Strip

class tokenizers.normalizers.Strip

( left = True right = True )

Strip normalizer

StripAccents

class tokenizers.normalizers.StripAccents

( )

StripAccents normalizer