tomaarsen HF staff commited on
Commit
08484ef
1 Parent(s): f3a722a

Heavily update README

Browse files
Files changed (1) hide show
  1. README.md +188 -55
README.md CHANGED
@@ -1,63 +1,149 @@
1
-
2
  ---
3
- license: apache-2.0
4
  library_name: span-marker
5
  tags:
6
- - span-marker
7
- - token-classification
8
- - ner
9
- - named-entity-recognition
 
10
  pipeline_tag: token-classification
11
  widget:
12
- - text: "Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris."
13
- example_title: "Amelia Earhart"
14
- - text: "Leonardo di ser Piero da Vinci painted the Mona Lisa based on Italian noblewoman Lisa del Giocondo."
15
- example_title: "Leonardo da Vinci"
 
 
 
 
16
  model-index:
17
- - name: SpanMarker w. roberta-large on finegrained, supervised FewNERD by Tom Aarsen
18
- results:
19
- - task:
20
- type: token-classification
21
- name: Named Entity Recognition
22
- dataset:
23
- type: DFKI-SLT/few-nerd
24
- name: finegrained, supervised FewNERD
25
- config: supervised
26
- split: test
27
- revision: 2e3e727c63604fbfa2ff4cc5055359c84fe5ef2c
28
- metrics:
29
- - type: f1
30
- value: 0.7103
31
- name: F1
32
- - type: precision
33
- value: 0.7136
34
- name: Precision
35
- - type: recall
36
- value: 0.7070
37
- name: Recall
38
  datasets:
39
- - DFKI-SLT/few-nerd
40
  language:
41
- - en
42
  metrics:
43
- - f1
44
- - recall
45
- - precision
46
  ---
47
 
48
- # SpanMarker for Named Entity Recognition
49
 
50
- This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition. In particular, this SpanMarker model uses [roberta-large](https://huggingface.co/roberta-large) as the underlying encoder. See [train.py](train.py) for the training script.
51
 
52
- ## Usage
53
 
54
- To use this model for inference, first install the `span_marker` library:
55
 
56
- ```bash
57
- pip install span_marker
58
- ```
 
 
 
 
 
 
 
 
 
 
59
 
60
- You can then run inference with this model like so:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
  ```python
63
  from span_marker import SpanMarkerModel
@@ -65,21 +151,68 @@ from span_marker import SpanMarkerModel
65
  # Download from the 🤗 Hub
66
  model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
67
  # Run inference
68
- entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
69
  ```
70
 
71
- ### Limitations
 
 
 
72
 
73
- **Warning**: This model works best when punctuation is separated from the prior words, so
74
  ```python
75
- #
76
- model.predict("He plays J. Robert Oppenheimer , an American theoretical physicist .")
77
- # ❌
78
- model.predict("He plays J. Robert Oppenheimer, an American theoretical physicist.")
79
 
80
- # You can also supply a list of words directly: ✅
81
- model.predict(["He", "plays", "J.", "Robert", "Oppenheimer", ",", "an", "American", "theoretical", "physicist", "."])
 
 
 
 
 
 
 
 
 
 
 
 
82
  ```
83
- The same may be beneficial for some languages, such as splitting `"l'ocean Atlantique"` into `"l' ocean Atlantique"`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
- See the [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) repository for documentation and additional information on this library.
 
 
 
 
 
 
 
1
  ---
2
+ license: cc-by-nc-sa-4.0
3
  library_name: span-marker
4
  tags:
5
+ - span-marker
6
+ - token-classification
7
+ - ner
8
+ - named-entity-recognition
9
+ - generated_from_span_marker_trainer
10
  pipeline_tag: token-classification
11
  widget:
12
+ - text: >-
13
+ Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic
14
+ to Paris.
15
+ example_title: Amelia Earhart
16
+ - text: >-
17
+ Leonardo di ser Piero da Vinci painted the Mona Lisa based on Italian
18
+ noblewoman Lisa del Giocondo.
19
+ example_title: Leonardo da Vinci
20
  model-index:
21
+ - name: SpanMarker w. roberta-large on finegrained, supervised FewNERD by Tom Aarsen
22
+ results:
23
+ - task:
24
+ type: token-classification
25
+ name: Named Entity Recognition
26
+ dataset:
27
+ type: DFKI-SLT/few-nerd
28
+ name: finegrained, supervised FewNERD
29
+ config: supervised
30
+ split: test
31
+ revision: 2e3e727c63604fbfa2ff4cc5055359c84fe5ef2c
32
+ metrics:
33
+ - type: f1
34
+ value: 0.7103
35
+ name: F1
36
+ - type: precision
37
+ value: 0.7136
38
+ name: Precision
39
+ - type: recall
40
+ value: 0.707
41
+ name: Recall
42
  datasets:
43
+ - DFKI-SLT/few-nerd
44
  language:
45
+ - en
46
  metrics:
47
+ - f1
48
+ - recall
49
+ - precision
50
  ---
51
 
52
+ # SpanMarker with roberta-large on FewNERD
53
 
54
+ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [roberta-large](https://huggingface.co/models/roberta-large) as the underlying encoder. See [train.py](train.py) for the training script.
55
 
56
+ ## Model Details
57
 
58
+ ### Model Description
59
 
60
+ - **Model Type:** SpanMarker
61
+ - **Encoder:** [roberta-large](https://huggingface.co/models/roberta-large)
62
+ - **Maximum Sequence Length:** 256 tokens
63
+ - **Maximum Entity Length:** 8 words
64
+ - **Training Dataset:** [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd)
65
+ - **Language:** en
66
+ - **License:** cc-by-nc-sa-4.0
67
+
68
+
69
+ ### Model Sources
70
+
71
+ - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
72
+ - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
73
 
74
+ ### Model Labels
75
+ | Label | Examples |
76
+ |:-----------------------------------------|:---------------------------------------------------------------------------------------------------------|
77
+ | art-broadcastprogram | "Street Cents", "The Gale Storm Show : Oh , Susanna", "Corazones" |
78
+ | art-film | "Shawshank Redemption", "Bosch", "L'Atlantide" |
79
+ | art-music | "Hollywood Studio Symphony", "Champion Lover", "Atkinson , Danko and Ford ( with Brockie and Hilton )" |
80
+ | art-other | "Aphrodite of Milos", "Venus de Milo", "The Today Show" |
81
+ | art-painting | "Production/Reproduction", "Cofiwch Dryweryn", "Touit" |
82
+ | art-writtenart | "Imelda de ' Lambertazzi", "Time", "The Seven Year Itch" |
83
+ | building-airport | "Sheremetyevo International Airport", "Newark Liberty International Airport", "Luton Airport" |
84
+ | building-hospital | "Memorial Sloan-Kettering Cancer Center", "Hokkaido University Hospital", "Yeungnam University Hospital" |
85
+ | building-hotel | "Flamingo Hotel", "The Standard Hotel", "Radisson Blu Sea Plaza Hotel" |
86
+ | building-library | "British Library", "Berlin State Library", "Bayerische Staatsbibliothek" |
87
+ | building-other | "Alpha Recording Studios", "Henry Ford Museum", "Communiplex" |
88
+ | building-restaurant | "Fatburger", "Carnegie Deli", "Trumbull" |
89
+ | building-sportsfacility | "Sports Center", "Glenn Warner Soccer Facility", "Boston Garden" |
90
+ | building-theater | "Pittsburgh Civic Light Opera", "National Paris Opera", "Sanders Theatre" |
91
+ | event-attack/battle/war/militaryconflict | "Jurist", "Vietnam War", "Easter Offensive" |
92
+ | event-disaster | "the 1912 North Mount Lyell Disaster", "1990s North Korean famine", "1693 Sicily earthquake" |
93
+ | event-election | "March 1898 elections", "Elections to the European Parliament", "1982 Mitcham and Morden by-election" |
94
+ | event-other | "Eastwood Scoring Stage", "Union for a Popular Movement", "Masaryk Democratic Movement" |
95
+ | event-protest | "Russian Revolution", "French Revolution", "Iranian Constitutional Revolution" |
96
+ | event-sportsevent | "World Cup", "Stanley Cup", "National Champions" |
97
+ | location-GPE | "Croatian", "the Republic of Croatia", "Mediterranean Basin" |
98
+ | location-bodiesofwater | "Arthur Kill", "Norfolk coast", "Atatürk Dam Lake" |
99
+ | location-island | "new Samsat district", "Staten Island", "Laccadives" |
100
+ | location-mountain | "Ruweisat Ridge", "Salamander Glacier", "Miteirya Ridge" |
101
+ | location-other | "Northern City Line", "Victoria line", "Cartuther" |
102
+ | location-park | "Gramercy Park", "Shenandoah National Park", "Painted Desert Community Complex Historic District" |
103
+ | location-road/railway/highway/transit | "NJT", "Friern Barnet Road", "Newark-Elizabeth Rail Link" |
104
+ | organization-company | "Church 's Chicken", "Dixy Chicken", "Texas Chicken" |
105
+ | organization-education | "MIT", "Barnard College", "Belfast Royal Academy and the Ulster College of Physical Education" |
106
+ | organization-government/governmentagency | "Supreme Court", "Congregazione dei Nobili", "Diet" |
107
+ | organization-media/newspaper | "Al Jazeera", "Clash", "TimeOut Melbourne" |
108
+ | organization-other | "IAEA", "4th Army", "Defence Sector C" |
109
+ | organization-politicalparty | "Al Wafa ' Islamic", "Kenseitō", "Shimpotō" |
110
+ | organization-religion | "Jewish", "UPCUSA", "Christian" |
111
+ | organization-showorganization | "Mr. Mister", "Lizzy", "Bochumer Symphoniker" |
112
+ | organization-sportsleague | "China League One", "NHL", "First Division" |
113
+ | organization-sportsteam | "Arsenal", "Luc Alphand Aventures", "Tottenham" |
114
+ | other-astronomything | "Algol", "`` Caput Larvae ''", "Zodiac" |
115
+ | other-award | "GCON", "Grand Commander of the Order of the Niger", "Order of the Republic of Guinea and Nigeria" |
116
+ | other-biologything | "BAR", "N-terminal lipid", "Amphiphysin" |
117
+ | other-chemicalthing | "carbon dioxide", "sulfur", "uranium" |
118
+ | other-currency | "$", "Travancore Rupee", "lac crore" |
119
+ | other-disease | "bladder cancer", "French Dysentery Epidemic of 1779", "hypothyroidism" |
120
+ | other-educationaldegree | "Bachelor", "Master", "BSc ( Hons ) in physics" |
121
+ | other-god | "El", "Fujin", "Raijin" |
122
+ | other-language | "Latin", "Breton-speaking", "English" |
123
+ | other-law | "Leahy–Smith America Invents Act ( AIA", "Thirty Years ' Peace", "United States Freedom Support Act" |
124
+ | other-livingthing | "monkeys", "patchouli", "insects" |
125
+ | other-medical | "Pediatrics", "pediatrician", "amitriptyline" |
126
+ | person-actor | "Tchéky Karyo", "Ellaline Terriss", "Edmund Payne" |
127
+ | person-artist/author | "George Axelrod", "Gaetano Donizett", "Hicks" |
128
+ | person-athlete | "Jaguar", "Tozawa", "Neville" |
129
+ | person-director | "Bob Swaim", "Frank Darabont", "Richard Quine" |
130
+ | person-other | "Richard Benson", "Holden", "Campbell" |
131
+ | person-politician | "Emeric", "Rivière", "William" |
132
+ | person-scholar | "Stalmine", "Stedman", "Wurdack" |
133
+ | person-soldier | "Helmuth Weidling", "Joachim Ziegler", "Krukenberg" |
134
+ | product-airplane | "Luton", "Spey-equipped FGR.2s", "EC135T2 CPDS" |
135
+ | product-car | "100EX", "Phantom", "Corvettes - GT1 C6R" |
136
+ | product-food | "red grape", "yakiniku", "V. labrusca" |
137
+ | product-game | "Airforce Delta", "Splinter Cell", "Hardcore RPG" |
138
+ | product-other | "Fairbottom Bobs", "X11", "PDP-1" |
139
+ | product-ship | "HMS `` Chinkara ''", "Congress", "Essex" |
140
+ | product-software | "Wikipedia", "Apdf", "AmiPDF" |
141
+ | product-train | "Royal Scots Grey", "High Speed Trains", "55022" |
142
+ | product-weapon | "AR-15 's", "ZU-23-2M Wróbel", "ZU-23-2MR Wróbel II" |
143
+
144
+ ## Uses
145
+
146
+ ### Direct Use
147
 
148
  ```python
149
  from span_marker import SpanMarkerModel
 
151
  # Download from the 🤗 Hub
152
  model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
153
  # Run inference
154
+ entities = model.predict("Most of the Steven Seagal movie `` Under Siege `` ( co-starring Tommy Lee Jones ) was filmed on the , which is docked on Mobile Bay at Battleship Memorial Park and open to the public .")
155
  ```
156
 
157
+ ### Downstream Use
158
+ You can finetune this model on your own dataset.
159
+
160
+ <details><summary>Click to expand</summary>
161
 
 
162
  ```python
163
+ from span_marker import SpanMarkerModel, Trainer
 
 
 
164
 
165
+ # Download from the 🤗 Hub
166
+ model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-roberta-large-fewnerd-fine-super")
167
+
168
+ # Specify a Dataset with "tokens" and "ner_tag" columns
169
+ dataset = load_dataset("conll2003") # For example CoNLL2003
170
+
171
+ # Initialize a Trainer using the pretrained model & dataset
172
+ trainer = Trainer(
173
+ model=model,
174
+ train_dataset=dataset["train"],
175
+ eval_dataset=dataset["validation"],
176
+ )
177
+ trainer.train()
178
+ trainer.save_model("tomaarsen/span-marker-roberta-large-fewnerd-fine-super-finetuned")
179
  ```
180
+ </details>
181
+
182
+ ### ⚠️ Tokenizer Warning
183
+ The [roberta-large](https://huggingface.co/models/roberta-large) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
184
+
185
+ In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. Some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and join the resulting words with a space.
186
+
187
+ ## Training Details
188
+
189
+ ### Training Set Metrics
190
+ | Training set | Min | Median | Max |
191
+ |:----------------------|:----|:--------|:----|
192
+ | Sentence length | 1 | 24.4945 | 267 |
193
+ | Entities per sentence | 0 | 2.5832 | 88 |
194
+
195
+ ### Training Hyperparameters
196
+ - learning_rate: 1e-05
197
+ - train_batch_size: 8
198
+ - eval_batch_size: 8
199
+ - seed: 42
200
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
201
+ - lr_scheduler_type: linear
202
+ - lr_scheduler_warmup_ratio: 0.1
203
+ - num_epochs: 3
204
+
205
+ ### Training Hardware
206
+ - **On Cloud**: No
207
+ - **GPU Model**: 1 x NVIDIA GeForce RTX 3090
208
+ - **CPU Model**: 13th Gen Intel(R) Core(TM) i7-13700K
209
+ - **RAM Size**: 31.78 GB
210
+
211
+ ### Framework Versions
212
 
213
+ - Python: 3.9.16
214
+ - SpanMarker: 1.3.1.dev
215
+ - Transformers : 4.29.2
216
+ - PyTorch: 2.0.1+cu118
217
+ - Datasets: 2.14.3
218
+ - Tokenizers: 0.13.2