metadata

library_name: span-marker
tags:
  - span-marker
  - token-classification
  - ner
  - named-entity-recognition
  - generated_from_span_marker_trainer
metrics:
  - precision
  - recall
  - f1
widget:
  - text: >-
      The Bengal tiger is the most common subspecies of tiger, constituting
      approximately 80% of the entire tiger population, and is found in
      Bangladesh, Bhutan, Myanmar, Nepal, and India.
  - text: >-
      In other countries, it is a non-commissioned rank (e.g. Spain, Italy,
      France, the Netherlands and the Indonesian Police ranks).
  - text: >-
      The filling consists of fish, pork and bacon, and is seasoned with salt
      (unless the pork is already salted).
  - text: >-
      This stood until August 20, 1993 when it was beaten by one 1 / 100th of a
      second by Colin Jackson of Great Britain in Stuttgart, Germany, a
      subsequent record that stood for 13 years.
  - text: >-
      Ann Patchett ’s novel " Bel Canto ", was another creative influence that
      helped her manage a plentiful cast of characters.
pipeline_tag: token-classification
model-index:
  - name: SpanMarker
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        dataset:
          name: Unknown
          type: unknown
          split: eval
        metrics:
          - type: f1
            value: 0.9130661114003124
            name: F1
          - type: precision
            value: 0.9148758606300855
            name: Precision
          - type: recall
            value: 0.9112635078969243
            name: Recall

SpanMarker

This is a SpanMarker model that can be used for Named Entity Recognition.

Model Details

Model Description

Model Type: SpanMarker
Maximum Sequence Length: 256 tokens
Maximum Entity Length: 6 words

Model Sources

Repository: SpanMarker on GitHub
Thesis: SpanMarker For Named Entity Recognition

Model Labels

Label	Examples
ANIM	"vertebrate", "moth", "G. firmus"
BIO	"Aspergillus", "Cladophora", "Zythiostroma"
CEL	"pulsar", "celestial bodies", "neutron star"
DIS	"social anxiety disorder", "insulin resistance", "Asperger syndrome"
EVE	"Spanish Civil War", "National Junior Angus Show", "French Revolution"
FOOD	"Neera", "Bellini ( cocktail )", "soju"
INST	"Apple II", "Encyclopaedia of Chess Openings", "Android"
LOC	"Kīlauea", "Hungary", "Vienna"
MEDIA	"CSI : Crime Scene Investigation", "Big Comic Spirits", "American Idol"
MYTH	"Priam", "Oźwiena", "Odysseus"
ORG	"San Francisco Giants", "Arm Holdings", "RTÉ One"
PER	"Amelia Bence", "Tito Lusiardo", "James Cameron"
PLANT	"vernal squill", "Sarracenia purpurea", "Drosera rotundifolia"
TIME	"prehistory", "Age of Enlightenment", "annual paid holiday"
VEHI	"Short 360", "Ferrari 355 Challenge", "Solution F / Chretien Helicopter"

Uses

Direct Use for Inference

from span_marker import SpanMarkerModel

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("span_marker_model_id")
# Run inference
entities = model.predict("Ann Patchett ’s novel \" Bel Canto \", was another creative influence that helped her manage a plentiful cast of characters.")

Downstream Use

You can finetune this model on your own dataset.

Click to expand

from span_marker import SpanMarkerModel, Trainer

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("span_marker_model_id")

# Specify a Dataset with "tokens" and "ner_tag" columns
dataset = load_dataset("conll2003") # For example CoNLL2003

# Initialize a Trainer using the pretrained model & dataset
trainer = Trainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("span_marker_model_id-finetuned")

Training Details

Training Set Metrics

Training set	Min	Median	Max
Sentence length	2	21.6493	237
Entities per sentence	0	1.5369	36

Training Hyperparameters

learning_rate: 1e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1
mixed_precision_training: Native AMP

Training Results

Epoch	Step	Validation Loss	Validation Precision	Validation Recall	Validation F1	Validation Accuracy
0.0576	1000	0.0142	0.8714	0.7729	0.8192	0.9698
0.1153	2000	0.0107	0.8316	0.8815	0.8558	0.9744
0.1729	3000	0.0092	0.8717	0.8797	0.8757	0.9780
0.2306	4000	0.0082	0.8811	0.8886	0.8848	0.9798
0.2882	5000	0.0084	0.8523	0.9163	0.8831	0.9790
0.3459	6000	0.0079	0.8700	0.9113	0.8902	0.9802
0.4035	7000	0.0070	0.9107	0.8859	0.8981	0.9822
0.4611	8000	0.0069	0.9259	0.8797	0.9022	0.9827
0.5188	9000	0.0067	0.9061	0.8965	0.9013	0.9829
0.5764	10000	0.0066	0.9034	0.8996	0.9015	0.9829
0.6341	11000	0.0064	0.9160	0.8996	0.9077	0.9839
0.6917	12000	0.0066	0.8952	0.9121	0.9036	0.9832
0.7494	13000	0.0062	0.9165	0.9009	0.9086	0.9841
0.8070	14000	0.0062	0.9010	0.9121	0.9065	0.9835
0.8647	15000	0.0062	0.9084	0.9127	0.9105	0.9842
0.9223	16000	0.0060	0.9151	0.9098	0.9125	0.9846
0.9799	17000	0.0060	0.9149	0.9113	0.9131	0.9848

Framework Versions

Python: 3.8.16
SpanMarker: 1.5.0
Transformers: 4.29.0.dev0
PyTorch: 1.10.1
Datasets: 2.15.0
Tokenizers: 0.13.2

Citation

BibTeX

@software{Aarsen_SpanMarker,
    author = {Aarsen, Tom},
    license = {Apache-2.0},
    title = {{SpanMarker for Named Entity Recognition}},
    url = {https://github.com/tomaarsen/SpanMarkerNER}
}