tomaarsen HF staff commited on
Commit
cb20c77
1 Parent(s): a4a6054

Fix broken URLs

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -50,14 +50,14 @@ model-index:
50
 
51
  # SpanMarker with roberta-large on FewNERD
52
 
53
- This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [roberta-large](https://huggingface.co/models/roberta-large) as the underlying encoder. See [train.py](train.py) for the training script.
54
 
55
  ## Model Details
56
 
57
  ### Model Description
58
 
59
  - **Model Type:** SpanMarker
60
- - **Encoder:** [roberta-large](https://huggingface.co/models/roberta-large)
61
  - **Maximum Sequence Length:** 256 tokens
62
  - **Maximum Entity Length:** 8 words
63
  - **Training Dataset:** [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd)
@@ -179,7 +179,7 @@ trainer.save_model("tomaarsen/span-marker-roberta-large-fewnerd-fine-super-finet
179
  </details>
180
 
181
  ### ⚠️ Tokenizer Warning
182
- The [roberta-large](https://huggingface.co/models/roberta-large) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
183
 
184
  In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. Some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and join the resulting words with a space.
185
 
 
50
 
51
  # SpanMarker with roberta-large on FewNERD
52
 
53
+ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [roberta-large](https://huggingface.co/roberta-large) as the underlying encoder. See [train.py](train.py) for the training script.
54
 
55
  ## Model Details
56
 
57
  ### Model Description
58
 
59
  - **Model Type:** SpanMarker
60
+ - **Encoder:** [roberta-large](https://huggingface.co/roberta-large)
61
  - **Maximum Sequence Length:** 256 tokens
62
  - **Maximum Entity Length:** 8 words
63
  - **Training Dataset:** [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd)
 
179
  </details>
180
 
181
  ### ⚠️ Tokenizer Warning
182
+ The [roberta-large](https://huggingface.co/roberta-large) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
183
 
184
  In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. Some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and join the resulting words with a space.
185