metadata

base_model: mistralai/Mistral-7B-Instruct-v0.2
license: apache-2.0
language:
  - tt
tags:
  - tweety
datasets:
  - oscar-corpus/OSCAR-2301

Tweety-Tatar-7B: A Tatar Large Language Model

Tweety Tatar / Base 7b / 2024-v1

Model description

This model is our trans-tokenized LLM for the Tatar language, converted from the Mistral-7B-Instruct-v0.2 model trained by MistralAI. Trans-tokenized LLMs are language models finetuned to produce output in a particular language, using a novel tokenizer native to that language.

Developed by: François Remy (UGent), Alfiya Khabibullina (BeCode), et al.
Funded by: IDLab / GPULab (UGent)
Model type: Foundation model using the mistral architecture
Language(s) (NLP): Tatar
License: Apache 2.0

In-scope usage

This model can be used as-is to perform basic language modeling operations in Tatar, or finetuned to perform more complex operations. This model has not undergone Instruction- or Chat-based finetuning, which means that the model functions best in few-shot settings.

Usage instructions

This model can be used just like any LLM in the HuggingFace framework:

import transformers

MODEL_NAME = "Tweeties/tweety-tatar-base-7b-2024-v1"
generate = transformers.pipeline("text-generation", model=MODEL_NAME)

Word Analogies

ANALOGY_PROMPT = """Бу аналоглар таблицасын тутырыгыз:
* {x1} : {y1}
* {x2} :"""
def score_analogy(x1, y1, x2, y2):
    Y2_PROMPT = ANALOGY_PROMPT.replace('{x1}', x1).replace('{y1}', y1).replace('{x2}', x2)
    answer = generate(Y2_PROMPT, use_cache=True, do_sample=False, max_new_tokens=10, return_full_text=False, pad_token_id=generate.tokenizer.eos_token_id, eos_token_id=generate.tokenizer.convert_tokens_to_ids(['<0x0A>','</s>']))[0]['generated_text'].strip()
    return 1 if answer == y2 else 0

score_analogy('Мәскәү', 'Русия', 'Әнкара', 'Төркия') # 1

Summarization


SUMMARIZE = "Түбәндәге текстка йомгак ясагыз:\n"
LONG_TEXT = "\n\nОзын текст:\n"
LONG_TEXT_DEMO = "Кеше организмы катлаулы организм, аның өчен кирәкле туклыклы матдәләрнең аерым баланс таләп итә. Кеше организмының туклану рационы нигездә пешекләнгән ризыклардан тора икән, аның организмы бу ысул белән туклануга җайлаша. Әмма, шул ук кеше кинәт чимал диетасына күчә икән, аның организмы әлеге үзгәрешне кабул итә алмый, бу мөмкин кадәр зыян китерергә мөмкин." # The human body is a complex organism that requires a specific balance of nutrients. If the human body's diet consists mainly of cooked foods, its body adapts to this type of nutrition. However, if the same person suddenly switches to a raw diet, his body cannot adapt to this change, which can be harmful. # The human body is a complex organism that requires a specific balance of nutrients to function optimally. When a person's diet consists primarily of cooked food, their body adapts to this way of eating. However, if that same person suddenly switches to a raw food diet, their body may not be able to handle the sudden change, leading to potential harm. 
SHORT_TEXT = "\n\nКыска текст:\n"
SHORT_TEXT_DEMO = "Әмма пешкән ризык ашауга гына күнгән организмга кинәт чи ризык белән туклануга күчүнең зарарлы нәтиҗәсе дә булырга мөмкин." # However, a body accustomed to eating only cooked food can have harmful consequences when suddenly switching to eating raw food.

def generate_tatar_summary(tatar_text_to_summarize: str) -> str:

    # craft the 1-shot example
    input_ids = torch.concat([
        tokenizer.encode(SUMMARIZE, return_tensors='pt'),
        tokenizer.encode(LONG_TEXT, add_special_tokens=False, return_tensors='pt'),
        tokenizer.encode(LONG_TEXT_DEMO, add_special_tokens=False, return_tensors='pt'),
        tokenizer.encode(SHORT_TEXT, add_special_tokens=False, return_tensors='pt'),
        tokenizer.encode(SHORT_TEXT_DEMO, add_special_tokens=False, return_tensors='pt'),
        tokenizer.encode("\n\n", add_special_tokens=False, return_tensors='pt')
    ], axis=1)
    
    # craft the input
    input_ids = torch.concat([
        input_ids,
        tokenizer.encode(SUMMARIZE, return_tensors='pt'),
        tokenizer.encode(LONG_TEXT, add_special_tokens=False, return_tensors='pt'),
        tokenizer.encode(tatar_text_to_summarize, add_special_tokens=False, return_tensors='pt'),
        tokenizer.encode(SHORT_TEXT, add_special_tokens=False, return_tensors='pt'),
    ], axis=1)

    # generate the output
    model_inputs = {'input_ids':input_ids.to(cuda_device)}
    model_outputs = model.generate(
        **model_inputs,
        max_new_tokens=80,
        num_beams=8,
        no_repeat_ngram_size=6,
        early_stopping=False,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.convert_tokens_to_ids(['<0x0A>','</s>']),
    )

    # decode the output
    return (tokenizer.decode(model_outputs[0][input_ids.shape[1]:])).rstrip()

generate_tatar_summary("Зур шартлау (ингл. Big Bang) – Галәмнең башлангыч, сингуляр халәттә торган чорын тасвирлаучы космологик модель. Әле ХХ гасырда да без яшәгән Галәм статик структуралы, дигән фикер яшәгән. Ягъни, Галәмнең башы һәм ахыры юк, имеш, ул һәрвакыт булган һәм булачак. Бу фикер фән дөньясында бик озак, астрономия фәненең бөтен нигезләрен җимереп яңа теория барлыкка килгәнче яшәгән. Бу теориянең исеме – «Зур шартлау» теориясе.")

Citation

If you use this model, please cite our work as:

@article{tweeties2024,
    title = {Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP},
    author = {François Remy and Pieter Delobelle and Hayastan Avetisyan and Alfiya Khabibullina and Miryam de Lhoneux and Thomas Demeester},
    url = {https://raw.githubusercontent.com/LAGoM-NLP/transtokenizer/paper/Trans-Tokenization.pdf},
    year = {2024},
    note = {Under review at COLM 2024}
}