Fix broken URLs
Browse files
README.md
CHANGED
@@ -50,14 +50,14 @@ model-index:
|
|
50 |
|
51 |
# SpanMarker with roberta-large on FewNERD
|
52 |
|
53 |
-
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [roberta-large](https://huggingface.co/
|
54 |
|
55 |
## Model Details
|
56 |
|
57 |
### Model Description
|
58 |
|
59 |
- **Model Type:** SpanMarker
|
60 |
-
- **Encoder:** [roberta-large](https://huggingface.co/
|
61 |
- **Maximum Sequence Length:** 256 tokens
|
62 |
- **Maximum Entity Length:** 8 words
|
63 |
- **Training Dataset:** [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd)
|
@@ -179,7 +179,7 @@ trainer.save_model("tomaarsen/span-marker-roberta-large-fewnerd-fine-super-finet
|
|
179 |
</details>
|
180 |
|
181 |
### ⚠️ Tokenizer Warning
|
182 |
-
The [roberta-large](https://huggingface.co/
|
183 |
|
184 |
In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. Some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and join the resulting words with a space.
|
185 |
|
|
|
50 |
|
51 |
# SpanMarker with roberta-large on FewNERD
|
52 |
|
53 |
+
This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [roberta-large](https://huggingface.co/roberta-large) as the underlying encoder. See [train.py](train.py) for the training script.
|
54 |
|
55 |
## Model Details
|
56 |
|
57 |
### Model Description
|
58 |
|
59 |
- **Model Type:** SpanMarker
|
60 |
+
- **Encoder:** [roberta-large](https://huggingface.co/roberta-large)
|
61 |
- **Maximum Sequence Length:** 256 tokens
|
62 |
- **Maximum Entity Length:** 8 words
|
63 |
- **Training Dataset:** [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd)
|
|
|
179 |
</details>
|
180 |
|
181 |
### ⚠️ Tokenizer Warning
|
182 |
+
The [roberta-large](https://huggingface.co/roberta-large) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
|
183 |
|
184 |
In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. Some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and join the resulting words with a space.
|
185 |
|