File size: 24,689 Bytes
96ca210 8864f7c 03cc4e4 044610d 585fa7a 1f0729d 044610d 3e02732 044610d a4c8506 044610d 3e02732 e2c4635 3e02732 76f7f82 3e02732 76f7f82 3e02732 bd39ba3 3e02732 4567479 044610d 1f0729d 3e02732 1f0729d 044610d 7f3c969 76f7f82 7f3c969 044610d eeacaad 76f7f82 7f3c969 76f7f82 7f3c969 76f7f82 eeacaad 7f3c969 76f7f82 eeacaad 7f3c969 eeacaad 7f3c969 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 |
---
license: apache-2.0
language:
- es
- ca
- fr
- pt
- it
- ro
library_name: generic
tags:
- text2text-generation
- punctuation
- fullstop
- truecase
- capitalization
widget:
- text: "hola amigo cómo estás es un día lluvioso hoy"
- text: "este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt"
---
# Model
This model restores punctuation, predicts full stops (sentence boundaries), and predicts true-casing (capitalization)
for text in the 6 most popular Romance languages:
* Spanish
* French
* Portuguese
* Catalan
* Italian
* Romanian
Together, these languages cover approximately 97% of native speakers of the Romance language family.
The model comprises a SentencePiece tokenizer, a Transformer encoder, and MLP prediction heads.
This model predicts the following punctuation per input subtoken:
* .
* ,
* ?
* ¿
* ACRONYM
Though rare in these languages (relative to English), the special token `ACRONYM` allows fully punctuating tokens such as "`pm`" → "`p.m.`".
**Widget notes** If you use the widget, it'll take a minute to load the model since a "generic" library is used.
Further, the widget does not respect multi-line output, so fullstop predictions are annotated with "\n".
# Usage
The model is released as a `SentencePiece` tokenizer and an `ONNX` graph.
The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
```bash
pip install punctuators
```
If this package is broken, please let me know in the community tab (I update it for each model and break it a lot!).
<details open>
<summary>Example Usage</summary>
```python
from typing import List
from punctuators.models import PunctCapSegModelONNX
# Instantiate this model
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
m = PunctCapSegModelONNX.from_pretrained("pcs_romance")
# Define some input texts to punctuate, at least one per language
input_texts: List[str] = [
"este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt",
"hola amigo cómo estás es un día lluvioso hoy",
"hola amic com va avui ha estat un dia plujós el català prediu massa puntuació per com s'ha entrenat",
"ciao amico come va oggi è stata una giornata piovosa",
"olá amigo como tá indo estava chuvoso hoje",
"salut l'ami comment ça va il pleuvait aujourd'hui",
"salut prietene cum stă treaba azi a fost ploios",
]
results: List[List[str]] = m.infer(input_texts)
for input_text, output_texts in zip(input_texts, results):
print(f"Input: {input_text}")
print(f"Outputs:")
for text in output_texts:
print(f"\t{text}")
print()
```
Exact output may vary based on the model version; here is the current output:
</details>
<details open>
<summary>Expected Output</summary>
```text
Input: este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt
Outputs:
Este modelo fue entrenado en un GPU A100.
En realidad, no se que dice esta frase lo traduje con NMT.
Input: hola amigo cómo estás es un día lluvioso hoy
Outputs:
Hola, amigo.
¿Cómo estás?
Es un día lluvioso hoy.
Input: hola amic com va avui ha estat un dia plujós el català prediu massa puntuació per com s'ha entrenat
Outputs:
Hola, amic.
Com va avui?
Ha estat un dia plujós.
El català prediu massa puntuació per com s'ha entrenat.
Input: ciao amico come va oggi è stata una giornata piovosa
Outputs:
Ciao amico, come va?
Oggi è stata una giornata piovosa.
Input: olá amigo como tá indo estava chuvoso hoje
Outputs:
Olá, amigo, como tá indo?
Estava chuvoso hoje.
Input: salut l'ami comment ça va il pleuvait aujourd'hui
Outputs:
Salut l'ami.
Comment ça va?
Il pleuvait aujourd'hui.
Input: salut prietene cum stă treaba azi a fost ploios
Outputs:
Salut prietene, cum stă treaba azi?
A fost ploios.
```
</details>
If you prefer your output to not be broken into separate sentences, you can disable sentence boundary detection
in the API call:
```python
input_texts: List[str] = [
"hola amigo cómo estás es un día lluvioso hoy",
]
results: List[str] = m.infer(input_texts, apply_sbd=False)
print(results[0])
```
Instead of a `List[List[str]]` (a list of output sentences for each input), we get a `List[str]` (one output
sentence per input):
```text
Hola, amigo. ¿Cómo estás? Es un día lluvioso hoy.
```
# Training Data
For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/).
Catalan is not included in StatMT's News Crawl.
For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan.
Due to this, Catalan performance may be sub-par and may over-predict punctuation and sentence breaks, which is typical of OpenSubtitles.
# Training Parameters
This model was trained by concatenating between 1 and 14 random sentences.
The concatenation points became sentence boundary targets,
text was lower-cased to produce true-case targets,
and punctuation was removed to create punctuation targets.
Batches were built by randomly sampling from each language.
Each example is language homogenous (i.e., we only concatenate sentences from the same language).
Batches were multilingual. Neither language tags nor language-specific paths are utilized in the graph.
The maximum length during training was 256 subtokens.
The `punctuators` package can punctuate inputs of any length.
This is accomplished behind the scenes by splitting the input into overlapping subsegments of 256 tokens, and combining the results.
If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
# Contact
Contact me at [email protected] with requests or issues, or just let me know on the community tab.
# Metrics
Test sets were generated with 3,000 lines of held-out data per language (OpenSubtitles for Catalan, News Crawl for all others).
Examples were derived by concatenating 10 sentences per example, removing all punctuation, and lower-casing all letters.
Since punctuation is subjective (e.g., see "hello friend how's it going" in the above examples) punctuation metrics can be misleading.
Also, keep in mind that the data is noisy. Catalan is especially noisy, since it's OpenSubtitles (note how Catalan has a 50 instances of "¿" which should not appear).
Note that we call the label "¿" "pre-punctuation" since it is unique in that it appears before words, and thus
we predict it separate from the other punctuation tokens.
Generally, periods are easy, commas are a harder, question marks are hard, and acronyms are rare and noisy.
Expand any of the following tabs to see metrics for that language.
<details>
<summary>Spanish metrics</summary>
```text
Pre-punctuation report:
label precision recall f1 support
<NULL> (label_id: 0) 99.92 99.97 99.95 572069
¿ (label_id: 1) 81.93 60.46 69.57 1095
-------------------
micro avg 99.90 99.90 99.90 573164
macro avg 90.93 80.22 84.76 573164
weighted avg 99.89 99.90 99.89 573164
Punctuation report:
label precision recall f1 support
<NULL> (label_id: 0) 98.70 98.44 98.57 517310
<ACRONYM> (label_id: 1) 39.68 86.21 54.35 58
. (label_id: 2) 87.72 90.41 89.04 29267
, (label_id: 3) 73.17 74.68 73.92 25422
? (label_id: 4) 69.49 59.26 63.97 1107
-------------------
micro avg 96.90 96.90 96.90 573164
macro avg 73.75 81.80 75.97 573164
weighted avg 96.94 96.90 96.92 573164
True-casing report:
label precision recall f1 support
LOWER (label_id: 0) 99.85 99.73 99.79 2164982
UPPER (label_id: 1) 92.01 95.32 93.64 69437
-------------------
micro avg 99.60 99.60 99.60 2234419
macro avg 95.93 97.53 96.71 2234419
weighted avg 99.61 99.60 99.60 2234419
Fullstop report:
label precision recall f1 support
NOSTOP (label_id: 0) 100.00 99.98 99.99 543228
FULLSTOP (label_id: 1) 99.66 99.93 99.80 32931
-------------------
micro avg 99.98 99.98 99.98 576159
macro avg 99.83 99.96 99.89 576159
weighted avg 99.98 99.98 99.98 576159
```
</details>
<details>
<summary>Portuguese metrics</summary>
```text
Pre-punctuation report:
label precision recall f1 support
<NULL> (label_id: 0) 100.00 100.00 100.00 539822
¿ (label_id: 1) 0.00 0.00 0.00 0
-------------------
micro avg 100.00 100.00 100.00 539822
macro avg 100.00 100.00 100.00 539822
weighted avg 100.00 100.00 100.00 539822
Punctuation report:
label precision recall f1 support
<NULL> (label_id: 0) 98.77 98.27 98.52 481148
<ACRONYM> (label_id: 1) 0.00 0.00 0.00 0
. (label_id: 2) 87.63 90.63 89.11 29090
, (label_id: 3) 74.44 78.69 76.50 28549
? (label_id: 4) 66.30 52.27 58.45 1035
-------------------
micro avg 96.74 96.74 96.74 539822
macro avg 81.79 79.96 80.65 539822
weighted avg 96.82 96.74 96.77 539822
True-casing report:
label precision recall f1 support
LOWER (label_id: 0) 99.90 99.82 99.86 2082598
UPPER (label_id: 1) 94.75 97.08 95.90 70555
-------------------
micro avg 99.73 99.73 99.73 2153153
macro avg 97.32 98.45 97.88 2153153
weighted avg 99.73 99.73 99.73 2153153
Fullstop report:
label precision recall f1 support
NOSTOP (label_id: 0) 100.00 99.98 99.99 509905
FULLSTOP (label_id: 1) 99.72 99.98 99.85 32909
-------------------
micro avg 99.98 99.98 99.98 542814
macro avg 99.86 99.98 99.92 542814
weighted avg 99.98 99.98 99.98 542814
```
</details>
<details>
<summary>Romanian metrics</summary>
```text
Pre-punctuation report:
label precision recall f1 support
<NULL> (label_id: 0) 100.00 100.00 100.00 580702
¿ (label_id: 1) 0.00 0.00 0.00 0
-------------------
micro avg 100.00 100.00 100.00 580702
macro avg 100.00 100.00 100.00 580702
weighted avg 100.00 100.00 100.00 580702
Punctuation report:
label precision recall f1 support
<NULL> (label_id: 0) 98.56 98.47 98.51 520647
<ACRONYM> (label_id: 1) 52.00 79.89 63.00 179
. (label_id: 2) 87.29 89.37 88.32 29852
, (label_id: 3) 75.26 74.69 74.97 29218
? (label_id: 4) 60.73 55.46 57.98 806
-------------------
micro avg 96.74 96.74 96.74 580702
macro avg 74.77 79.57 76.56 580702
weighted avg 96.74 96.74 96.74 580702
Truecasing report:
label precision recall f1 support
LOWER (label_id: 0) 99.84 99.75 99.79 2047297
UPPER (label_id: 1) 93.56 95.65 94.59 77424
-------------------
micro avg 99.60 99.60 99.60 2124721
macro avg 96.70 97.70 97.19 2124721
weighted avg 99.61 99.60 99.60 2124721
Fullstop report:
label precision recall f1 support
NOSTOP (label_id: 0) 100.00 99.96 99.98 550858
FULLSTOP (label_id: 1) 99.26 99.94 99.60 32833
-------------------
micro avg 99.95 99.95 99.95 583691
macro avg 99.63 99.95 99.79 583691
weighted avg 99.96 99.95 99.96 583691
```
</details>
<details>
<summary>Italian metrics</summary>
```text
Pre-punctuation report:
label precision recall f1 support
<NULL> (label_id: 0) 100.00 100.00 100.00 577636
¿ (label_id: 1) 0.00 0.00 0.00 0
-------------------
micro avg 100.00 100.00 100.00 577636
macro avg 100.00 100.00 100.00 577636
weighted avg 100.00 100.00 100.00 577636
Punctuation report:
label precision recall f1 support
<NULL> (label_id: 0) 98.10 97.73 97.91 522727
<ACRONYM> (label_id: 1) 41.76 48.72 44.97 78
. (label_id: 2) 81.71 86.70 84.13 28881
, (label_id: 3) 61.72 63.24 62.47 24703
? (label_id: 4) 62.55 41.78 50.10 1247
-------------------
micro avg 95.58 95.58 95.58 577636
macro avg 69.17 67.63 67.92 577636
weighted avg 95.64 95.58 95.60 577636
Truecasing report:
label precision recall f1 support
LOWER (label_id: 0) 99.76 99.70 99.73 2160781
UPPER (label_id: 1) 91.18 92.76 91.96 72471
-------------------
micro avg 99.47 99.47 99.47 2233252
macro avg 95.47 96.23 95.85 2233252
weighted avg 99.48 99.47 99.48 2233252
Fullstop report:
label precision recall f1 support
NOSTOP (label_id: 0) 99.99 99.98 99.99 547875
FULLSTOP (label_id: 1) 99.72 99.91 99.82 32742
-------------------
micro avg 99.98 99.98 99.98 580617
macro avg 99.86 99.95 99.90 580617
weighted avg 99.98 99.98 99.98 580617
```
</details>
<details>
<summary>French metrics</summary>
```text
Pre-punctuation report:
label precision recall f1 support
<NULL> (label_id: 0) 100.00 100.00 100.00 614010
¿ (label_id: 1) 0.00 0.00 0.00 0
-------------------
micro avg 100.00 100.00 100.00 614010
macro avg 100.00 100.00 100.00 614010
weighted avg 100.00 100.00 100.00 614010
Punctuation report:
label precision recall f1 support
<NULL> (label_id: 0) 98.72 98.57 98.65 556366
<ACRONYM> (label_id: 1) 38.46 71.43 50.00 49
. (label_id: 2) 86.41 88.56 87.47 28969
, (label_id: 3) 72.15 72.80 72.47 27183
? (label_id: 4) 75.81 67.78 71.57 1443
-------------------
micro avg 96.88 96.88 96.88 614010
macro avg 74.31 79.83 76.03 614010
weighted avg 96.91 96.88 96.89 614010
Truecasing report:
label precision recall f1 support
LOWER (label_id: 0) 99.84 99.80 99.82 2127174
UPPER (label_id: 1) 93.72 94.73 94.22 66496
-------------------
micro avg 99.65 99.65 99.65 2193670
macro avg 96.78 97.27 97.02 2193670
weighted avg 99.65 99.65 99.65 2193670
Fullstop report:
label precision recall f1 support
NOSTOP (label_id: 0) 99.99 99.94 99.97 584331
FULLSTOP (label_id: 1) 98.92 99.90 99.41 32661
-------------------
micro avg 99.94 99.94 99.94 616992
macro avg 99.46 99.92 99.69 616992
weighted avg 99.94 99.94 99.94 616992
```
</details>
<details>
<summary>Catalan metrics</summary>
```text
Pre-punctuation report:
label precision recall f1 support
<NULL> (label_id: 0) 99.97 100.00 99.98 143817
¿ (label_id: 1) 0.00 0.00 0.00 50
-------------------
micro avg 99.97 99.97 99.97 143867
macro avg 49.98 50.00 49.99 143867
weighted avg 99.93 99.97 99.95 143867
Punctuation report:
label precision recall f1 support
<NULL> (label_id: 0) 97.61 97.73 97.67 119040
<ACRONYM> (label_id: 1) 0.00 0.00 0.00 28
. (label_id: 2) 74.02 79.46 76.65 15282
, (label_id: 3) 60.88 50.75 55.36 5836
? (label_id: 4) 64.94 60.28 62.52 3681
-------------------
micro avg 92.90 92.90 92.90 143867
macro avg 59.49 57.64 58.44 143867
weighted avg 92.76 92.90 92.80 143867
Truecasing report:
label precision recall f1 support
LOWER (label_id: 0) 99.81 99.83 99.82 422395
UPPER (label_id: 1) 97.09 96.81 96.95 24854
-------------------
micro avg 99.66 99.66 99.66 447249
macro avg 98.45 98.32 98.39 447249
weighted avg 99.66 99.66 99.66 447249
Fullstop report:
label precision recall f1 support
NOSTOP (label_id: 0) 99.93 99.63 99.78 123867
FULLSTOP (label_id: 1) 97.97 99.59 98.77 22000
-------------------
micro avg 99.63 99.63 99.63 145867
macro avg 98.95 99.61 99.28 145867
weighted avg 99.63 99.63 99.63 145867
```
</details> |