RoBERTa Language Model pre-trained on German Business Registry Publications

Released, Jan 2023, this is a German RoBERTa language model trained on 250.000 CD files ("Chronologische Abdrücke") provided by the German Business Registry ("Deutsches Handelsregister").

The model can be considered as a "base" model, similar to the original RoBERTa "base" model (https://huggingface.co/roberta-base).

Parameters: ~110M

Intended uses & limitations

You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. Common downstream tasks are named entity recognition (NER) and relation extraction (RE). Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at generative models.

A use case for this model would be to fine-tune it on a NER task and use it to structure company data published by the German Business Registry.

Questions

If you have any questions feel free to drop a message to [email protected] Additionally, if you have interest in structured company data and/or publications let us know as well!