Normalizers

class tokenizers.normalizers.BertNormalizer

( clean_text = True handle_chinese_chars = True strip_accents = None lowercase = True )

Parameters

clean_text (bool, optional, defaults to True) — Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one.
handle_chinese_chars (bool, optional, defaults to True) — Whether to handle chinese chars by putting spaces around them.
strip_accents (bool, optional) — Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert).
lowercase (bool, optional, defaults to True) — Whether to lowercase.

BertNormalizer

Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing

class tokenizers.normalizers.Lowercase

( )

Lowercase Normalizer

class tokenizers.normalizers.NFC

( )

NFC Unicode Normalizer

class tokenizers.normalizers.NFD

( )

NFD Unicode Normalizer

class tokenizers.normalizers.NFKC

( )

NFKC Unicode Normalizer

class tokenizers.normalizers.NFKD

( )

NFKD Unicode Normalizer

class tokenizers.normalizers.Nmt

( )

Nmt normalizer

class tokenizers.normalizers.Normalizer

( )

Base class for all normalizers

This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.

normalize

( normalized )

Parameters

normalized (NormalizedString) — The normalized string on which to apply this Normalizer

Normalize a NormalizedString in-place

This method allows to modify a NormalizedString to keep track of the alignment information. If you just want to see the result of the normalization on a raw string, you can use normalize_str()

normalize_str

( sequence ) → str

Parameters

sequence (str) — A string to normalize

Returns

str

A string after normalization

Normalize the given string

This method provides a way to visualize the effect of a Normalizer but it does not keep track of the alignment information. If you need to get/convert offsets, you can use normalize()

class tokenizers.normalizers.Precompiled

( precompiled_charsmap )

Precompiled normalizer Don’t use manually it is used for compatiblity for SentencePiece.

class tokenizers.normalizers.Replace

( pattern content )

Replace normalizer

class tokenizers.normalizers.Sequence

( )

Parameters

normalizers (List[Normalizer]) — A list of Normalizer to be run as a sequence

Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order

class tokenizers.normalizers.Strip

( left = True right = True )

Strip normalizer

class tokenizers.normalizers.StripAccents

( )

StripAccents normalizer

Tokenizers

Normalizers

BertNormalizer

class tokenizers.normalizers.BertNormalizer

Lowercase

class tokenizers.normalizers.Lowercase

NFC

class tokenizers.normalizers.NFC

NFD

class tokenizers.normalizers.NFD

NFKC

class tokenizers.normalizers.NFKC

NFKD

class tokenizers.normalizers.NFKD

Nmt

class tokenizers.normalizers.Nmt

Normalizer

class tokenizers.normalizers.Normalizer

normalize

normalize_str

Precompiled

class tokenizers.normalizers.Precompiled

Replace

class tokenizers.normalizers.Replace

Sequence

class tokenizers.normalizers.Sequence

Strip

class tokenizers.normalizers.Strip

StripAccents

class tokenizers.normalizers.StripAccents