WikiText-WordLevel
This is a simple word-level tokenizer created using the Tokenizers library. It was trained for educational purposes on the combined train, validation, and test splits of the WikiText-103 corpus.
- Tokenizer Type: Word-Level
- Vocabulary Size: 75K
- Special Tokens:
<s>
(start of sequence),</s>
(end of sequence),<unk>
(unknown token) - Normalization: NFC (Normalization Form Canonical Composition), Strip, Lowercase
- Pre-tokenization: Whitespace
- Code: wikitext-wordlevel.py
The tokenizer can be used as simple as follows.
tokenizer = Tokenizer.from_pretrained('dustalov/wikitext-wordlevel')
tokenizer.encode("I'll see you soon").ids # => [68, 14, 2746, 577, 184, 595]
tokenizer.encode("I'll see you soon").tokens # => ['i', "'", 'll', 'see', 'you', 'soon']
tokenizer.decode([68, 14, 2746, 577, 184, 595]) # => "i ' ll see you soon"