tomaarsen
/

span-marker-roberta-large-fewnerd-fine-super

@@ -1,63 +1,149 @@
 ---
-license: apache-2.0
 library_name: span-marker
 tags:
-  - span-marker
-  - token-classification
-  - ner
-  - named-entity-recognition
 pipeline_tag: token-classification
 widget:
-- text: "Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris."
-  example_title: "Amelia Earhart"
-- text: "Leonardo di ser Piero da Vinci painted the Mona Lisa based on Italian noblewoman Lisa del Giocondo."
-  example_title: "Leonardo da Vinci"
 model-index:
-  - name: SpanMarker w. roberta-large on finegrained, supervised FewNERD by Tom Aarsen
-    results:
-      - task:
-          type: token-classification
-          name: Named Entity Recognition
-        dataset:
-          type: DFKI-SLT/few-nerd
-          name: finegrained, supervised FewNERD
-          config: supervised
-          split: test
-          revision: 2e3e727c63604fbfa2ff4cc5055359c84fe5ef2c
-        metrics:
-          - type: f1
-            value: 0.7103
-            name: F1
-          - type: precision
-            value: 0.7136
-            name: Precision
-          - type: recall
-            value: 0.7070
-            name: Recall
 datasets:
-  - DFKI-SLT/few-nerd
 language:
-  - en
 metrics:
-  - f1
-  - recall
-  - precision
 ---
-# SpanMarker for Named Entity Recognition
-This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition. In particular, this SpanMarker model uses [roberta-large](https://huggingface.co/roberta-large) as the underlying encoder. See [train.py](train.py) for the training script.
-## Usage
-To use this model for inference, first install the `span_marker` library:
-```bash
-pip install span_marker
-```
-You can then run inference with this model like so:
 ```python
 from span_marker import SpanMarkerModel
@@ -65,21 +151,68 @@ from span_marker import SpanMarkerModel
 # Download from the 🤗 Hub
 model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
 # Run inference
-entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
 ```
-### Limitations
-**Warning**: This model works best when punctuation is separated from the prior words, so
 ```python
-# ✅
-model.predict("He plays J. Robert Oppenheimer , an American theoretical physicist .")
-# ❌
-model.predict("He plays J. Robert Oppenheimer, an American theoretical physicist.")
-# You can also supply a list of words directly: ✅
-model.predict(["He", "plays", "J.", "Robert", "Oppenheimer", ",", "an", "American", "theoretical", "physicist", "."])
 ```
-The same may be beneficial for some languages, such as splitting `"l'ocean Atlantique"` into `"l' ocean Atlantique"`.
-See the [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) repository for documentation and additional information on this library.

 ---
+license: cc-by-nc-sa-4.0
 library_name: span-marker
 tags:
+- span-marker
+- token-classification
+- ner
+- named-entity-recognition
+- generated_from_span_marker_trainer
 pipeline_tag: token-classification
 widget:
+- text: >-
+    Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic
+    to Paris.
+  example_title: Amelia Earhart
+- text: >-
+    Leonardo di ser Piero da Vinci painted the Mona Lisa based on Italian
+    noblewoman Lisa del Giocondo.
+  example_title: Leonardo da Vinci
 model-index:
+- name: SpanMarker w. roberta-large on finegrained, supervised FewNERD by Tom Aarsen
+  results:
+  - task:
+      type: token-classification
+      name: Named Entity Recognition
+    dataset:
+      type: DFKI-SLT/few-nerd
+      name: finegrained, supervised FewNERD
+      config: supervised
+      split: test
+      revision: 2e3e727c63604fbfa2ff4cc5055359c84fe5ef2c
+    metrics:
+    - type: f1
+      value: 0.7103
+      name: F1
+    - type: precision
+      value: 0.7136
+      name: Precision
+    - type: recall
+      value: 0.707
+      name: Recall
 datasets:
+- DFKI-SLT/few-nerd
 language:
+- en
 metrics:
+- f1
+- recall
+- precision
 ---
+# SpanMarker with roberta-large on FewNERD
+This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [roberta-large](https://huggingface.co/models/roberta-large) as the underlying encoder. See [train.py](train.py) for the training script.
+## Model Details
+### Model Description
+- **Model Type:** SpanMarker
+- **Encoder:** [roberta-large](https://huggingface.co/models/roberta-large)
+- **Maximum Sequence Length:** 256 tokens
+- **Maximum Entity Length:** 8 words
+- **Training Dataset:** [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd)
+- **Language:** en
+- **License:** cc-by-nc-sa-4.0
+### Model Sources
+- **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
+- **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
+### Model Labels
+| Label                                    | Examples                                                                                                 |
+|:-----------------------------------------|:---------------------------------------------------------------------------------------------------------|
+| art-broadcastprogram                     | "Street Cents", "The Gale Storm Show : Oh , Susanna", "Corazones"                                        |
+| art-film                                 | "Shawshank Redemption", "Bosch", "L'Atlantide"                                                           |
+| art-music                                | "Hollywood Studio Symphony", "Champion Lover", "Atkinson , Danko and Ford ( with Brockie and Hilton )"   |
+| art-other                                | "Aphrodite of Milos", "Venus de Milo", "The Today Show"                                                  |
+| art-painting                             | "Production/Reproduction", "Cofiwch Dryweryn", "Touit"                                                   |
+| art-writtenart                           | "Imelda de ' Lambertazzi", "Time", "The Seven Year Itch"                                                 |
+| building-airport                         | "Sheremetyevo International Airport", "Newark Liberty International Airport", "Luton Airport"            |
+| building-hospital                        | "Memorial Sloan-Kettering Cancer Center", "Hokkaido University Hospital", "Yeungnam University Hospital" |
+| building-hotel                           | "Flamingo Hotel", "The Standard Hotel", "Radisson Blu Sea Plaza Hotel"                                   |
+| building-library                         | "British Library", "Berlin State Library", "Bayerische Staatsbibliothek"                                 |
+| building-other                           | "Alpha Recording Studios", "Henry Ford Museum", "Communiplex"                                            |
+| building-restaurant                      | "Fatburger", "Carnegie Deli", "Trumbull"                                                                 |
+| building-sportsfacility                  | "Sports Center", "Glenn Warner Soccer Facility", "Boston Garden"                                         |
+| building-theater                         | "Pittsburgh Civic Light Opera", "National Paris Opera", "Sanders Theatre"                                |
+| event-attack/battle/war/militaryconflict | "Jurist", "Vietnam War", "Easter Offensive"                                                              |
+| event-disaster                           | "the 1912 North Mount Lyell Disaster", "1990s North Korean famine", "1693 Sicily earthquake"             |
+| event-election                           | "March 1898 elections", "Elections to the European Parliament", "1982 Mitcham and Morden by-election"    |
+| event-other                              | "Eastwood Scoring Stage", "Union for a Popular Movement", "Masaryk Democratic Movement"                  |
+| event-protest                            | "Russian Revolution", "French Revolution", "Iranian Constitutional Revolution"                           |
+| event-sportsevent                        | "World Cup", "Stanley Cup", "National Champions"                                                         |
+| location-GPE                             | "Croatian", "the Republic of Croatia", "Mediterranean Basin"                                             |
+| location-bodiesofwater                   | "Arthur Kill", "Norfolk coast", "Atatürk Dam Lake"                                                       |
+| location-island                          | "new Samsat district", "Staten Island", "Laccadives"                                                     |
+| location-mountain                        | "Ruweisat Ridge", "Salamander Glacier", "Miteirya Ridge"                                                 |
+| location-other                           | "Northern City Line", "Victoria line", "Cartuther"                                                       |
+| location-park                            | "Gramercy Park", "Shenandoah National Park", "Painted Desert Community Complex Historic District"        |
+| location-road/railway/highway/transit    | "NJT", "Friern Barnet Road", "Newark-Elizabeth Rail Link"                                                |
+| organization-company                     | "Church 's Chicken", "Dixy Chicken", "Texas Chicken"                                                     |
+| organization-education                   | "MIT", "Barnard College", "Belfast Royal Academy and the Ulster College of Physical Education"           |
+| organization-government/governmentagency | "Supreme Court", "Congregazione dei Nobili", "Diet"                                                      |
+| organization-media/newspaper             | "Al Jazeera", "Clash", "TimeOut Melbourne"                                                               |
+| organization-other                       | "IAEA", "4th Army", "Defence Sector C"                                                                   |
+| organization-politicalparty              | "Al Wafa ' Islamic", "Kenseitō", "Shimpotō"                                                              |
+| organization-religion                    | "Jewish", "UPCUSA", "Christian"                                                                          |
+| organization-showorganization            | "Mr. Mister", "Lizzy", "Bochumer Symphoniker"                                                            |
+| organization-sportsleague                | "China League One", "NHL", "First Division"                                                              |
+| organization-sportsteam                  | "Arsenal", "Luc Alphand Aventures", "Tottenham"                                                          |
+| other-astronomything                     | "Algol", "`` Caput Larvae ''", "Zodiac"                                                                  |
+| other-award                              | "GCON", "Grand Commander of the Order of the Niger", "Order of the Republic of Guinea and Nigeria"       |
+| other-biologything                       | "BAR", "N-terminal lipid", "Amphiphysin"                                                                 |
+| other-chemicalthing                      | "carbon dioxide", "sulfur", "uranium"                                                                    |
+| other-currency                           | "$", "Travancore Rupee", "lac crore"                                                                     |
+| other-disease                            | "bladder cancer", "French Dysentery Epidemic of 1779", "hypothyroidism"                                  |
+| other-educationaldegree                  | "Bachelor", "Master", "BSc ( Hons ) in physics"                                                          |
+| other-god                                | "El", "Fujin", "Raijin"                                                                                  |
+| other-language                           | "Latin", "Breton-speaking", "English"                                                                    |
+| other-law                                | "Leahy–Smith America Invents Act ( AIA", "Thirty Years ' Peace", "United States Freedom Support Act"     |
+| other-livingthing                        | "monkeys", "patchouli", "insects"                                                                        |
+| other-medical                            | "Pediatrics", "pediatrician", "amitriptyline"                                                            |
+| person-actor                             | "Tchéky Karyo", "Ellaline Terriss", "Edmund Payne"                                                       |
+| person-artist/author                     | "George Axelrod", "Gaetano Donizett", "Hicks"                                                            |
+| person-athlete                           | "Jaguar", "Tozawa", "Neville"                                                                            |
+| person-director                          | "Bob Swaim", "Frank Darabont", "Richard Quine"                                                           |
+| person-other                             | "Richard Benson", "Holden", "Campbell"                                                                   |
+| person-politician                        | "Emeric", "Rivière", "William"                                                                           |
+| person-scholar                           | "Stalmine", "Stedman", "Wurdack"                                                                         |
+| person-soldier                           | "Helmuth Weidling", "Joachim Ziegler", "Krukenberg"                                                      |
+| product-airplane                         | "Luton", "Spey-equipped FGR.2s", "EC135T2 CPDS"                                                          |
+| product-car                              | "100EX", "Phantom", "Corvettes - GT1 C6R"                                                                |
+| product-food                             | "red grape", "yakiniku", "V. labrusca"                                                                   |
+| product-game                             | "Airforce Delta", "Splinter Cell", "Hardcore RPG"                                                        |
+| product-other                            | "Fairbottom Bobs", "X11", "PDP-1"                                                                        |
+| product-ship                             | "HMS `` Chinkara ''", "Congress", "Essex"                                                                |
+| product-software                         | "Wikipedia", "Apdf", "AmiPDF"                                                                            |
+| product-train                            | "Royal Scots Grey", "High Speed Trains", "55022"                                                         |
+| product-weapon                           | "AR-15 's", "ZU-23-2M Wróbel", "ZU-23-2MR Wróbel II"                                                     |
+## Uses
+### Direct Use
 ```python
 from span_marker import SpanMarkerModel
 # Download from the 🤗 Hub
 model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
 # Run inference
+entities = model.predict("Most of the Steven Seagal movie `` Under Siege `` ( co-starring Tommy Lee Jones ) was filmed on the , which is docked on Mobile Bay at Battleship Memorial Park and open to the public .")
 ```
+### Downstream Use
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
 ```python
+from span_marker import SpanMarkerModel, Trainer
+# Download from the 🤗 Hub
+model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
+# Specify a Dataset with "tokens" and "ner_tag" columns
+dataset = load_dataset("conll2003") # For example CoNLL2003
+# Initialize a Trainer using the pretrained model & dataset
+trainer = Trainer(
+    model=model,
+    train_dataset=dataset["train"],
+    eval_dataset=dataset["validation"],
+)
+trainer.train()
+trainer.save_model("tomaarsen/span-marker-roberta-large-fewnerd-fine-super-finetuned")
 ```
+</details>
+### ⚠️ Tokenizer Warning
+The [roberta-large](https://huggingface.co/models/roberta-large) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
+In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. Some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and join the resulting words with a space.
+## Training Details
+### Training Set Metrics
+| Training set          | Min | Median  | Max |
+|:----------------------|:----|:--------|:----|
+| Sentence length       | 1   | 24.4945 | 267 |
+| Entities per sentence | 0   | 2.5832  | 88  |
+### Training Hyperparameters
+- learning_rate: 1e-05
+- train_batch_size: 8
+- eval_batch_size: 8
+- seed: 42
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_ratio: 0.1
+- num_epochs: 3
+### Training Hardware
+- **On Cloud**: No
+- **GPU Model**: 1 x NVIDIA GeForce RTX 3090
+- **CPU Model**: 13th Gen Intel(R) Core(TM) i7-13700K
+- **RAM Size**: 31.78 GB
+### Framework Versions
+- Python: 3.9.16
+- SpanMarker: 1.3.1.dev
+- Transformers : 4.29.2
+- PyTorch: 2.0.1+cu118
+- Datasets: 2.14.3
+- Tokenizers: 0.13.2