Greek Tokenizer

Tokenizer trained from scratch based on BPE algorithm on Greek corpus.

Usage:

To use this tokenizer, you can load it from the Hugging Face Hub:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer")

Example:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gsar78/Greek_Tokenizer")

# Tokenize input text
input_text = "Αυτό είναι ένα παράδειγμα."
inputs = tokenizer(input_text, return_tensors="pt")

# Print the tokenized input (IDs and tokens)
print("Token IDs:", inputs["input_ids"].tolist())

# Convert token IDs to tokens
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("Tokens:", tokens)

# Manually join tokens to form the tokenized string
tokenized_string = ' '.join(tokens)
print("Tokenized String:", tokenized_string)

It can also be used as a head start for pretraining a GPT2 base model on the Greek language.

Training Details

Vocabulary Size: 52000

Special Tokens: [PAD], [UNK], [CLS], [SEP], [MASK]

Benefits and Why to use:

Many generic tokenizers split words in multiple tokens.

This tokenizer, efficient and only splits words that was not trained on.

In the example above , the output of this tokenizer is only five tokens, while another tokenizer e.g. Llama-3 results in 9 or more tokens.

This can have an impact in inference costs and downstream applications.

Update

July 2024:

A new version is imminent that further decrease the fertility for Greek and English language.