Historical Irish Bylevel BPE tokenizer was trained on Old, Middle, Early Modern, Classical Modern and pre-reform Modern Irish texts from St. Gall Glosses, Würzburg Glosses, CELT and the book subcorpus Historical Irish Corpus. The training data spans ca. 550 — 1926 and covers a wide variety of genres, such as bardic poetry, native Irish stories, translations and adaptations of continental epic and romance, annals, genealogies, grammatical and medical tracts, diaries, and religious writing. Due to code-switching in some texts, the model has some Latin in the vocabulary.
Byte-Pair Encoding (BPE) was introduced in Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015). BPE relies on a pre-tokenizer that splits the training data into words. After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until the vocabulary has attained the desired vocabulary size, which is a hyperparameter. This tokenizer was trained with vocab_size=25000
and min_frequency=2
.
To have a better base vocabulary, GPT-2 uses bytes as the base vocabulary, which is a clever trick to force the base vocabulary to be of size 256 while ensuring that every base character is included in the vocabulary. With some additional rules to deal with punctuation, the GPT2’s tokenizer can tokenize every text without the need for the <unk>
symbol.
Use
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ancatmara/historical-irish-tokenizer-bytelevel-bpe")
texts = ['Boí Óengus in n-aidchi n-aili inna chotlud.', 'Co n-accae ní, in n-ingin cucci for crunn síuil dó.']
tokenizer(texts, max_length=128, truncation=True)
Out:
>>> {'input_ids': [[0, 19093, 5413, 323, 272, 19, 4672, 272, 19, 935, 1940, 13074, 20, 2], [0, 1936, 272, 19, 8716, 75, 439, 18, 323, 272, 19, 3886, 833, 7328, 382, 553, 1097, 685, 466, 507, 20, 2]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
tokenizer.decode([0, 19093, 5413, 323, 272, 19, 4672, 272, 19, 935, 1940, 13074, 20, 2])
Out:
>>> '<s>Boí Óengus in n-aidchi n-aili inna chotlud.</s>'