Normalizers
BertNormalizer
class tokenizers.normalizers.BertNormalizer
( clean_text = True handle_chinese_chars = True strip_accents = None lowercase = True )
Parameters
-
clean_text (
bool
, optional, defaults toTrue
) — Whether to clean the text, by removing any control characters and replacing all whitespaces by the classic one. -
handle_chinese_chars (
bool
, optional, defaults toTrue
) — Whether to handle chinese chars by putting spaces around them. -
strip_accents (
bool
, optional) — Whether to strip all accents. If this option is not specified (ie == None), then it will be determined by the value for lowercase (as in the original Bert). -
lowercase (
bool
, optional, defaults toTrue
) — Whether to lowercase.
BertNormalizer
Takes care of normalizing raw text before giving it to a Bert model. This includes cleaning the text, handling accents, chinese chars and lowercasing
Lowercase
NFC
NFD
NFKC
NFKD
Nmt
Normalizer
Base class for all normalizers
This class is not supposed to be instantiated directly. Instead, any implementation of a Normalizer will return an instance of this class when instantiated.
normalize
( normalized )
Parameters
-
normalized (
NormalizedString
) — The normalized string on which to apply this Normalizer
Normalize a NormalizedString
in-place
This method allows to modify a NormalizedString
to
keep track of the alignment information. If you just want to see the result
of the normalization on a raw string, you can use
normalize_str()
Normalize the given string
This method provides a way to visualize the effect of a
Normalizer but it does not keep track of the alignment
information. If you need to get/convert offsets, you can use
normalize()
Precompiled
Precompiled normalizer Don’t use manually it is used for compatiblity for SentencePiece.
Replace
Sequence
Allows concatenating multiple other Normalizer as a Sequence. All the normalizers run in sequence in the given order