|
--- |
|
license: apache-2.0 |
|
language: |
|
- es |
|
- ca |
|
- fr |
|
- pt |
|
- it |
|
- ro |
|
library_name: generic |
|
tags: |
|
- text2text-generation |
|
- punctuation |
|
- fullstop |
|
- truecase |
|
- capitalization |
|
widget: |
|
- text: "hola amigo cómo estás es un día lluvioso hoy" |
|
- text: "este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt" |
|
--- |
|
|
|
# Model |
|
This model restores punctuation, predicts full stops (sentence boundaries), and predicts true-casing (capitalization) |
|
for text in the 6 most popular Romance languages: |
|
|
|
* Spanish |
|
* French |
|
* Portuguese |
|
* Catalan |
|
* Italian |
|
* Romanian |
|
|
|
Together, these languages cover approximately 97% of native speakers of the Romance language family. |
|
|
|
The model comprises a SentencePiece tokenizer, a Transformer encoder, and MLP prediction heads. |
|
|
|
This model predicts the following punctuation per input subtoken: |
|
|
|
* . |
|
* , |
|
* ? |
|
* ¿ |
|
* ACRONYM |
|
|
|
Though rare in these languages (relative to English), the special token `ACRONYM` allows fully punctuating tokens such as "`pm`" → "`p.m.`". |
|
|
|
**Widget notes** If you use the widget, it'll take a minute to load the model since a "generic" library is used. |
|
Further, the widget does not respect multi-line output, so fullstop predictions are annotated with "\n". |
|
|
|
# Usage |
|
The model is released as a `SentencePiece` tokenizer and an `ONNX` graph. |
|
|
|
The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators): |
|
|
|
```bash |
|
pip install punctuators |
|
``` |
|
|
|
If this package is broken, please let me know in the community tab (I update it for each model and break it a lot!). |
|
|
|
<details open> |
|
|
|
<summary>Example Usage</summary> |
|
|
|
```python |
|
from typing import List |
|
|
|
from punctuators.models import PunctCapSegModelONNX |
|
|
|
# Instantiate this model |
|
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory. |
|
m = PunctCapSegModelONNX.from_pretrained("pcs_romance") |
|
|
|
# Define some input texts to punctuate, at least one per language |
|
input_texts: List[str] = [ |
|
"este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt", |
|
"hola amigo cómo estás es un día lluvioso hoy", |
|
"hola amic com va avui ha estat un dia plujós el català prediu massa puntuació per com s'ha entrenat", |
|
"ciao amico come va oggi è stata una giornata piovosa", |
|
"olá amigo como tá indo estava chuvoso hoje", |
|
"salut l'ami comment ça va il pleuvait aujourd'hui", |
|
"salut prietene cum stă treaba azi a fost ploios", |
|
] |
|
results: List[List[str]] = m.infer(input_texts) |
|
for input_text, output_texts in zip(input_texts, results): |
|
print(f"Input: {input_text}") |
|
print(f"Outputs:") |
|
for text in output_texts: |
|
print(f"\t{text}") |
|
print() |
|
|
|
``` |
|
|
|
Exact output may vary based on the model version; here is the current output: |
|
|
|
</details> |
|
|
|
<details open> |
|
|
|
<summary>Expected Output</summary> |
|
|
|
```text |
|
Input: este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt |
|
Outputs: |
|
Este modelo fue entrenado en un GPU A100. |
|
En realidad, no se que dice esta frase lo traduje con NMT. |
|
|
|
Input: hola amigo cómo estás es un día lluvioso hoy |
|
Outputs: |
|
Hola, amigo. |
|
¿Cómo estás? |
|
Es un día lluvioso hoy. |
|
|
|
Input: hola amic com va avui ha estat un dia plujós el català prediu massa puntuació per com s'ha entrenat |
|
Outputs: |
|
Hola, amic. |
|
Com va avui? |
|
Ha estat un dia plujós. |
|
El català prediu massa puntuació per com s'ha entrenat. |
|
|
|
Input: ciao amico come va oggi è stata una giornata piovosa |
|
Outputs: |
|
Ciao amico, come va? |
|
Oggi è stata una giornata piovosa. |
|
|
|
Input: olá amigo como tá indo estava chuvoso hoje |
|
Outputs: |
|
Olá, amigo, como tá indo? |
|
Estava chuvoso hoje. |
|
|
|
Input: salut l'ami comment ça va il pleuvait aujourd'hui |
|
Outputs: |
|
Salut l'ami. |
|
Comment ça va? |
|
Il pleuvait aujourd'hui. |
|
|
|
Input: salut prietene cum stă treaba azi a fost ploios |
|
Outputs: |
|
Salut prietene, cum stă treaba azi? |
|
A fost ploios. |
|
``` |
|
|
|
</details> |
|
|
|
If you prefer your output to not be broken into separate sentences, you can disable sentence boundary detection |
|
in the API call: |
|
|
|
```python |
|
input_texts: List[str] = [ |
|
"hola amigo cómo estás es un día lluvioso hoy", |
|
] |
|
results: List[str] = m.infer(input_texts, apply_sbd=False) |
|
print(results[0]) |
|
``` |
|
|
|
Instead of a `List[List[str]]` (a list of output sentences for each input), we get a `List[str]` (one output |
|
sentence per input): |
|
|
|
```text |
|
Hola, amigo. ¿Cómo estás? Es un día lluvioso hoy. |
|
``` |
|
|
|
|
|
# Training Data |
|
For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/). |
|
|
|
Catalan is not included in StatMT's News Crawl. |
|
For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan. |
|
Due to this, Catalan performance may be sub-par and may over-predict punctuation and sentence breaks, which is typical of OpenSubtitles. |
|
|
|
# Training Parameters |
|
This model was trained by concatenating between 1 and 14 random sentences. |
|
The concatenation points became sentence boundary targets, |
|
text was lower-cased to produce true-case targets, |
|
and punctuation was removed to create punctuation targets. |
|
|
|
Batches were built by randomly sampling from each language. |
|
Each example is language homogenous (i.e., we only concatenate sentences from the same language). |
|
Batches were multilingual. Neither language tags nor language-specific paths are utilized in the graph. |
|
|
|
The maximum length during training was 256 subtokens. |
|
The `punctuators` package can punctuate inputs of any length. |
|
This is accomplished behind the scenes by splitting the input into overlapping subsegments of 256 tokens, and combining the results. |
|
|
|
If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained. |
|
|
|
# Contact |
|
Contact me at [email protected] with requests or issues, or just let me know on the community tab. |
|
|
|
# Metrics |
|
Test sets were generated with 3,000 lines of held-out data per language (OpenSubtitles for Catalan, News Crawl for all others). |
|
Examples were derived by concatenating 10 sentences per example, removing all punctuation, and lower-casing all letters. |
|
|
|
Since punctuation is subjective (e.g., see "hello friend how's it going" in the above examples) punctuation metrics can be misleading. |
|
|
|
Also, keep in mind that the data is noisy. Catalan is especially noisy, since it's OpenSubtitles (note how Catalan has a 50 instances of "¿" which should not appear). |
|
|
|
Note that we call the label "¿" "pre-punctuation" since it is unique in that it appears before words, and thus |
|
we predict it separate from the other punctuation tokens. |
|
|
|
Generally, periods are easy, commas are a harder, question marks are hard, and acronyms are rare and noisy. |
|
|
|
Expand any of the following tabs to see metrics for that language. |
|
|
|
|
|
<details> |
|
|
|
<summary>Spanish metrics</summary> |
|
|
|
```text |
|
Pre-punctuation report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 99.92 99.97 99.95 572069 |
|
¿ (label_id: 1) 81.93 60.46 69.57 1095 |
|
------------------- |
|
micro avg 99.90 99.90 99.90 573164 |
|
macro avg 90.93 80.22 84.76 573164 |
|
weighted avg 99.89 99.90 99.89 573164 |
|
|
|
Punctuation report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 98.70 98.44 98.57 517310 |
|
<ACRONYM> (label_id: 1) 39.68 86.21 54.35 58 |
|
. (label_id: 2) 87.72 90.41 89.04 29267 |
|
, (label_id: 3) 73.17 74.68 73.92 25422 |
|
? (label_id: 4) 69.49 59.26 63.97 1107 |
|
------------------- |
|
micro avg 96.90 96.90 96.90 573164 |
|
macro avg 73.75 81.80 75.97 573164 |
|
weighted avg 96.94 96.90 96.92 573164 |
|
|
|
True-casing report: |
|
label precision recall f1 support |
|
LOWER (label_id: 0) 99.85 99.73 99.79 2164982 |
|
UPPER (label_id: 1) 92.01 95.32 93.64 69437 |
|
------------------- |
|
micro avg 99.60 99.60 99.60 2234419 |
|
macro avg 95.93 97.53 96.71 2234419 |
|
weighted avg 99.61 99.60 99.60 2234419 |
|
|
|
Fullstop report: |
|
label precision recall f1 support |
|
NOSTOP (label_id: 0) 100.00 99.98 99.99 543228 |
|
FULLSTOP (label_id: 1) 99.66 99.93 99.80 32931 |
|
------------------- |
|
micro avg 99.98 99.98 99.98 576159 |
|
macro avg 99.83 99.96 99.89 576159 |
|
weighted avg 99.98 99.98 99.98 576159 |
|
``` |
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
|
<summary>Portuguese metrics</summary> |
|
|
|
```text |
|
Pre-punctuation report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 100.00 100.00 100.00 539822 |
|
¿ (label_id: 1) 0.00 0.00 0.00 0 |
|
------------------- |
|
micro avg 100.00 100.00 100.00 539822 |
|
macro avg 100.00 100.00 100.00 539822 |
|
weighted avg 100.00 100.00 100.00 539822 |
|
|
|
Punctuation report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 98.77 98.27 98.52 481148 |
|
<ACRONYM> (label_id: 1) 0.00 0.00 0.00 0 |
|
. (label_id: 2) 87.63 90.63 89.11 29090 |
|
, (label_id: 3) 74.44 78.69 76.50 28549 |
|
? (label_id: 4) 66.30 52.27 58.45 1035 |
|
------------------- |
|
micro avg 96.74 96.74 96.74 539822 |
|
macro avg 81.79 79.96 80.65 539822 |
|
weighted avg 96.82 96.74 96.77 539822 |
|
|
|
True-casing report: |
|
label precision recall f1 support |
|
LOWER (label_id: 0) 99.90 99.82 99.86 2082598 |
|
UPPER (label_id: 1) 94.75 97.08 95.90 70555 |
|
------------------- |
|
micro avg 99.73 99.73 99.73 2153153 |
|
macro avg 97.32 98.45 97.88 2153153 |
|
weighted avg 99.73 99.73 99.73 2153153 |
|
|
|
Fullstop report: |
|
label precision recall f1 support |
|
NOSTOP (label_id: 0) 100.00 99.98 99.99 509905 |
|
FULLSTOP (label_id: 1) 99.72 99.98 99.85 32909 |
|
------------------- |
|
micro avg 99.98 99.98 99.98 542814 |
|
macro avg 99.86 99.98 99.92 542814 |
|
weighted avg 99.98 99.98 99.98 542814 |
|
|
|
``` |
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
|
<summary>Romanian metrics</summary> |
|
|
|
```text |
|
Pre-punctuation report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 100.00 100.00 100.00 580702 |
|
¿ (label_id: 1) 0.00 0.00 0.00 0 |
|
------------------- |
|
micro avg 100.00 100.00 100.00 580702 |
|
macro avg 100.00 100.00 100.00 580702 |
|
weighted avg 100.00 100.00 100.00 580702 |
|
|
|
Punctuation report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 98.56 98.47 98.51 520647 |
|
<ACRONYM> (label_id: 1) 52.00 79.89 63.00 179 |
|
. (label_id: 2) 87.29 89.37 88.32 29852 |
|
, (label_id: 3) 75.26 74.69 74.97 29218 |
|
? (label_id: 4) 60.73 55.46 57.98 806 |
|
------------------- |
|
micro avg 96.74 96.74 96.74 580702 |
|
macro avg 74.77 79.57 76.56 580702 |
|
weighted avg 96.74 96.74 96.74 580702 |
|
|
|
Truecasing report: |
|
label precision recall f1 support |
|
LOWER (label_id: 0) 99.84 99.75 99.79 2047297 |
|
UPPER (label_id: 1) 93.56 95.65 94.59 77424 |
|
------------------- |
|
micro avg 99.60 99.60 99.60 2124721 |
|
macro avg 96.70 97.70 97.19 2124721 |
|
weighted avg 99.61 99.60 99.60 2124721 |
|
|
|
Fullstop report: |
|
label precision recall f1 support |
|
NOSTOP (label_id: 0) 100.00 99.96 99.98 550858 |
|
FULLSTOP (label_id: 1) 99.26 99.94 99.60 32833 |
|
------------------- |
|
micro avg 99.95 99.95 99.95 583691 |
|
macro avg 99.63 99.95 99.79 583691 |
|
weighted avg 99.96 99.95 99.96 583691 |
|
|
|
``` |
|
</details> |
|
|
|
<details> |
|
|
|
<summary>Italian metrics</summary> |
|
|
|
```text |
|
Pre-punctuation report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 100.00 100.00 100.00 577636 |
|
¿ (label_id: 1) 0.00 0.00 0.00 0 |
|
------------------- |
|
micro avg 100.00 100.00 100.00 577636 |
|
macro avg 100.00 100.00 100.00 577636 |
|
weighted avg 100.00 100.00 100.00 577636 |
|
|
|
Punctuation report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 98.10 97.73 97.91 522727 |
|
<ACRONYM> (label_id: 1) 41.76 48.72 44.97 78 |
|
. (label_id: 2) 81.71 86.70 84.13 28881 |
|
, (label_id: 3) 61.72 63.24 62.47 24703 |
|
? (label_id: 4) 62.55 41.78 50.10 1247 |
|
------------------- |
|
micro avg 95.58 95.58 95.58 577636 |
|
macro avg 69.17 67.63 67.92 577636 |
|
weighted avg 95.64 95.58 95.60 577636 |
|
|
|
Truecasing report: |
|
label precision recall f1 support |
|
LOWER (label_id: 0) 99.76 99.70 99.73 2160781 |
|
UPPER (label_id: 1) 91.18 92.76 91.96 72471 |
|
------------------- |
|
micro avg 99.47 99.47 99.47 2233252 |
|
macro avg 95.47 96.23 95.85 2233252 |
|
weighted avg 99.48 99.47 99.48 2233252 |
|
|
|
Fullstop report: |
|
label precision recall f1 support |
|
NOSTOP (label_id: 0) 99.99 99.98 99.99 547875 |
|
FULLSTOP (label_id: 1) 99.72 99.91 99.82 32742 |
|
------------------- |
|
micro avg 99.98 99.98 99.98 580617 |
|
macro avg 99.86 99.95 99.90 580617 |
|
weighted avg 99.98 99.98 99.98 580617 |
|
``` |
|
</details> |
|
|
|
<details> |
|
|
|
<summary>French metrics</summary> |
|
|
|
```text |
|
Pre-punctuation report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 100.00 100.00 100.00 614010 |
|
¿ (label_id: 1) 0.00 0.00 0.00 0 |
|
------------------- |
|
micro avg 100.00 100.00 100.00 614010 |
|
macro avg 100.00 100.00 100.00 614010 |
|
weighted avg 100.00 100.00 100.00 614010 |
|
|
|
Punctuation report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 98.72 98.57 98.65 556366 |
|
<ACRONYM> (label_id: 1) 38.46 71.43 50.00 49 |
|
. (label_id: 2) 86.41 88.56 87.47 28969 |
|
, (label_id: 3) 72.15 72.80 72.47 27183 |
|
? (label_id: 4) 75.81 67.78 71.57 1443 |
|
------------------- |
|
micro avg 96.88 96.88 96.88 614010 |
|
macro avg 74.31 79.83 76.03 614010 |
|
weighted avg 96.91 96.88 96.89 614010 |
|
|
|
Truecasing report: |
|
label precision recall f1 support |
|
LOWER (label_id: 0) 99.84 99.80 99.82 2127174 |
|
UPPER (label_id: 1) 93.72 94.73 94.22 66496 |
|
------------------- |
|
micro avg 99.65 99.65 99.65 2193670 |
|
macro avg 96.78 97.27 97.02 2193670 |
|
weighted avg 99.65 99.65 99.65 2193670 |
|
|
|
Fullstop report: |
|
label precision recall f1 support |
|
NOSTOP (label_id: 0) 99.99 99.94 99.97 584331 |
|
FULLSTOP (label_id: 1) 98.92 99.90 99.41 32661 |
|
------------------- |
|
micro avg 99.94 99.94 99.94 616992 |
|
macro avg 99.46 99.92 99.69 616992 |
|
weighted avg 99.94 99.94 99.94 616992 |
|
|
|
``` |
|
</details> |
|
|
|
<details> |
|
|
|
<summary>Catalan metrics</summary> |
|
|
|
```text |
|
Pre-punctuation report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 99.97 100.00 99.98 143817 |
|
¿ (label_id: 1) 0.00 0.00 0.00 50 |
|
------------------- |
|
micro avg 99.97 99.97 99.97 143867 |
|
macro avg 49.98 50.00 49.99 143867 |
|
weighted avg 99.93 99.97 99.95 143867 |
|
|
|
Punctuation report: |
|
label precision recall f1 support |
|
<NULL> (label_id: 0) 97.61 97.73 97.67 119040 |
|
<ACRONYM> (label_id: 1) 0.00 0.00 0.00 28 |
|
. (label_id: 2) 74.02 79.46 76.65 15282 |
|
, (label_id: 3) 60.88 50.75 55.36 5836 |
|
? (label_id: 4) 64.94 60.28 62.52 3681 |
|
------------------- |
|
micro avg 92.90 92.90 92.90 143867 |
|
macro avg 59.49 57.64 58.44 143867 |
|
weighted avg 92.76 92.90 92.80 143867 |
|
|
|
Truecasing report: |
|
label precision recall f1 support |
|
LOWER (label_id: 0) 99.81 99.83 99.82 422395 |
|
UPPER (label_id: 1) 97.09 96.81 96.95 24854 |
|
------------------- |
|
micro avg 99.66 99.66 99.66 447249 |
|
macro avg 98.45 98.32 98.39 447249 |
|
weighted avg 99.66 99.66 99.66 447249 |
|
|
|
Fullstop report: |
|
label precision recall f1 support |
|
NOSTOP (label_id: 0) 99.93 99.63 99.78 123867 |
|
FULLSTOP (label_id: 1) 97.97 99.59 98.77 22000 |
|
------------------- |
|
micro avg 99.63 99.63 99.63 145867 |
|
macro avg 98.95 99.61 99.28 145867 |
|
weighted avg 99.63 99.63 99.63 145867 |
|
|
|
``` |
|
</details> |