Greek Tokenizer
Tokenizer trained from scratch based on BPE algorithm on Greek corpus.
Usage:
To use this tokenizer, you can load it from the Hugging Face Hub:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer")
Example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer")
# Tokenize input text
input_text = "Αυτό είναι ένα παράδειγμα."
inputs = tokenizer(input_text, return_tensors="pt")
# Print the tokenized input (IDs and tokens)
print("Token IDs:", inputs["input_ids"].tolist())
# Convert token IDs to tokens
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("Tokens:", tokens)
# Manually join tokens to form the tokenized string
tokenized_string = ' '.join(tokens)
print("Tokenized String:", tokenized_string)
It can also be used as a head start for pretraining a GPT2 base model on the Greek language.
Training Details
Vocabulary Size: 52000
Special Tokens: [PAD], [UNK], [CLS], [SEP], [MASK]
Benefits and Why to use:
Many generic tokenizers split words in multiple tokens.
This tokenizer, efficient and only splits words that was not trained on.
In the example above , the output of this tokenizer is only five tokens, while another tokenizer e.g. Llama-3 results in 9 or more tokens.
This can have an impact in inference costs and downstream applications.
Update
July 2024:
A new version is imminent that further decrease the fertility for Greek and English language.