|
--- |
|
license: cc-by-4.0 |
|
base_model: ltg/norbert3-large |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: norbert3-large-ner |
|
results: [] |
|
datasets: |
|
- wikiann |
|
- norne |
|
language: |
|
- 'no' |
|
- nb |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# Model Card: norbert3-large-ner (Fine-Tuned with WikiANN & norne) |
|
|
|
## Overview |
|
|
|
- **Model Name:** Kushtrim/norbert3-large-ner |
|
- **Model Type:** Named Entity Recognition (NER) |
|
- **Language:** Multilingual with focus on Norwegian (Norsk) |
|
- **Fine-Tuned with:** [WikiANN](https://huggingface.co/datasets/wikiann) & [norne](https://huggingface.co/datasets/norne) datasets |
|
|
|
## Description |
|
|
|
The `Kushtrim/norbert3-large-ner` is a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model that has been fine-tuned on ltg/norbert3-large[^1] for Named Entity Recognition (NER) in the Norwegian language (Norsk). This model has been fine-tuned using the WikiANN & norne datasets, which includes annotated named entities from various languages, including Norwegian. |
|
|
|
Named Entity Recognition is the task of identifying and classifying named entities in text, such as persons, organizations, locations, dates, and more. This model can be used to extract valuable information from Norwegian text with a focus on NER. |
|
|
|
## Intended Use |
|
|
|
The `Kushtrim/norbert3-large-ner` model, fine-tuned with the WikiANN & norne datasets, is designed for Named Entity Recognition (NER) applications in Norwegian text. It is particularly well-suited for identifying and classifying various types of named entities within Norwegian language content, including the following categories: |
|
|
|
- **Persons (PER):** Recognizing individuals' names, both at the beginning and within their names. |
|
- **Organizations (ORG):** Identifying organization names, distinguishing between the beginning and inside of these names. |
|
- **Locations (LOC):** Recognizing location names, including both the beginning and interior of these names. |
|
- **Miscellaneous (MISC):** Handling miscellaneous entities or categories within text. |
|
|
|
## Labels |
|
|
|
| Label | Description | |
|
|---------------------|-------------------------------------------------------------------| |
|
| Person (PER) | Real or fictional characters and animals | |
|
| Organization (ORG) | Any collection of people, such as firms, institutions, organizations, music groups, sports teams, unions, political parties etc. | |
|
| Location (LOC) | Geographical places, buildings and facilities | |
|
| Geo-political entity (GPE) | Geographical regions defined by political and/or social groups. A GPE entity subsumes and does not distinguish between a nation, its region, its government, or its people. | |
|
| Product (PROD) | Artificially produced entities are regarded products. This may include more abstract entities, such as speeches, radio shows, programming languages, contracts, laws and ideas. | |
|
| Event (EVT) | Festivals, cultural events, sports events, weather phenomena, wars, etc. Events are bounded in time and space. | |
|
| Derived (DRV) | Words (and phrases?) that are dervied from a name, but not a name in themselves. They typically contain a full name and are capitalized, but are not proper nouns. Examples (fictive) are "Brann-treneren" ("the Brann coach") or "Oslo-mannen" ("the man from Oslo"). | |
|
| Miscellaneous (MISC) | Names that do not belong in the other categories. Examples are animals species and names of medical conditions. Entities that are manufactured or produced are of type Products, whereas thing naturally or spontaneously occurring are of type Miscellaneous. | |
|
|
|
*Source of label information: [norne](https://huggingface.co/datasets/norne)* |
|
|
|
## Usage |
|
```python |
|
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer |
|
import pandas as pd |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Kushtrim/norbert3-large-ner", trust_remote_code=True) |
|
model = AutoModelForTokenClassification.from_pretrained("Kushtrim/norbert3-large-ner", trust_remote_code=True) |
|
|
|
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy='first') |
|
|
|
text = "Sett inn tekst her" |
|
|
|
results = ner(text) |
|
|
|
pd.DataFrame.from_records(results) |
|
``` |
|
|
|
|
|
[^1]: Samuel, D., Kutuzov, A., Touileb, S., Velldal, E., Øvrelid, L., Rønningstad, E., Sigdel, E., & Palatkina, A. (2023). *NorBench -- A Benchmark for Norwegian Language Models.* In *Editor(s) of the Conference (Ed.),* ***Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa),*** 618-633. University of Tartu Library. [URL](https://aclanthology.org/2023.nodalida-1.61) |
|
|
|
|
|
|
|
|