---
license: apache-2.0
language:
- bg
datasets:
- chitanka
tags:
- torch
inference: false
---

# Bulgarian language poetry generation

Pretrained model using causal language modeling (CLM) objective based on [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). <br/>
Developed by [Radostin Cholakov](https://www.linkedin.com/in/radostin-cholakov-bb4422146/) as a part of the [AzBuki.ML](https://azbuki-ml.com) initiatives.

# How to use?

```python
>>> from transformers import AutoModel, AutoTokenizer
>>>
>>> model_id = "radi-cho/poetry-bg"
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
>>>
>>> input_ids = tokenizer.encode(
>>>     "[HED]Суетата на живота[NEL][BDY]", 
>>>     add_special_tokens=False, 
>>>     return_tensors='pt')
>>>
>>> output_ids = model.generate(
>>>     input_ids, 
>>>     do_sample=True, 
>>>     max_length=250,
>>>     top_p=0.98,
>>>     top_k=0,
>>>     pad_token_id=2,
>>>     eos_token_id=50258)
>>>
>>> output = tokenizer.decode(output_ids[0])
>>>
>>> output = output.replace('[NEL]', '\n')
>>> output = output.replace('[BDY]', '\n')
>>> output = output.replace('[HED]', '')
>>> output = output.replace('[SEP]', '')
>>>
>>> print(output)
Суетата на живота

Да страдам ли?
Да страдам ли за това?
Не, не за това, че умирам...
Но само за това,
че миговете ми са рани.

Аз съм сам и търся утеха.
```

# Custom Tokens
We introduced 3 custom tokens in the tokenizer - `[NEL]`, `[BDY]`, `[HED]`
- `[HED]` denotes where the title of the poem begins;
- `[BDY]` denotes where the body of the poem begins;
- `[NEL]` marks the end of a verse and should be decoded as a new line;

`[SEP]` (with id 50258) is the *end of sequence* token.

# Credits
- Inspired by [rmihaylov/gpt2-medium-bg](https://huggingface.co/rmihaylov/gpt2-medium-bg).
- Data: [https://chitanka.info/texts/type/poetry](https://chitanka.info/texts/type/poetry);