The T5 base model for the Czech Language

This is the t5 base model for the Czech language that is based on the smaller version of the google/mt5-base model (https://huggingface.co/google/mt5-base). To make this model, I retained only the Czech and some of the English embeddings from the original multilingual model.

Modifications to the original multilingual t5 base model:

1- Parameters of the original model were reduced from 582M to 244M parameters.

2- By choosing the top 20K Czech and 10K English tokens, sentencepiece vocabulary was shrinked from 250K to 30K tokens.

3- The original size was reduced from 2.2GB to 0.9GB.

Notes:

Since this is the base t5 model of the Czech language, before using it for any downstream tasks, it needs to be finetuned with appropriate datasets in the first place.

References:

The substantial amount of this work to create this model is mostly based on the the post written by David Dale: "How to adapt a multilingual T5 model for a single language" (https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90)