--- license: apache-2.0 language: - bg datasets: - chitanka tags: - torch inference: false --- # Bulgarian language poetry generation Pretrained model using causal language modeling (CLM) objective based on [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf).
Developed by [Radostin Cholakov](https://www.linkedin.com/in/radostin-cholakov-bb4422146/) as a part of the [AzBuki.ML](https://azbuki-ml.com) initiatives. # How to use? ```python >>> from transformers import AutoModel, AutoTokenizer >>> >>> model_id = "radi-cho/poetry-bg" >>> tokenizer = AutoTokenizer.from_pretrained(model_id) >>> model = AutoModel.from_pretrained(model_id, trust_remote_code=True) >>> >>> input_ids = tokenizer.encode( >>> "[HED]Суетата на живота[NEL][BDY]", >>> add_special_tokens=False, >>> return_tensors='pt') >>> >>> output_ids = model.generate( >>> input_ids, >>> do_sample=True, >>> max_length=250, >>> top_p=0.98, >>> top_k=0, >>> pad_token_id=2, >>> eos_token_id=50258) >>> >>> output = tokenizer.decode(output_ids[0]) >>> >>> output = output.replace('[NEL]', '\n') >>> output = output.replace('[BDY]', '\n') >>> output = output.replace('[HED]', '') >>> output = output.replace('[SEP]', '') >>> >>> print(output) Суетата на живота Да страдам ли? Да страдам ли за това? Не, не за това, че умирам... Но само за това, че миговете ми са рани. Аз съм сам и търся утеха. ``` # Custom Tokens We introduced 3 custom tokens in the tokenizer - `[NEL]`, `[BDY]`, `[HED]` - `[HED]` denotes where the title of the poem begins; - `[BDY]` denotes where the body of the poem begins; - `[NEL]` marks the end of a verse and should be decoded as a new line; `[SEP]` (with id 50258) is the *end of sequence* token. # Credits - Inspired by [rmihaylov/gpt2-medium-bg](https://huggingface.co/rmihaylov/gpt2-medium-bg). - Data: [https://chitanka.info/texts/type/poetry](https://chitanka.info/texts/type/poetry);