|
--- |
|
license: apache-2.0 |
|
language: |
|
- es |
|
- ca |
|
- fr |
|
- pt |
|
- it |
|
- ro |
|
library_name: generic |
|
tags: |
|
- text2text-generation |
|
- punctuation |
|
- fullstop |
|
- truecase |
|
- capitalization |
|
widget: |
|
- text: "hola amigo cómo estás es un día lluvioso hoy" |
|
- text: "este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt" |
|
--- |
|
|
|
# Model |
|
This model restores punctuation, predicts full stops (sentence boundaries), and predicts true-casing (capitalization) |
|
for text in the 6 most popular Romance languages: |
|
|
|
* Spanish |
|
* French |
|
* Portuguese |
|
* Catalan |
|
* Italian |
|
* Romanian |
|
|
|
Together, these languages cover approximately 97% of native speakers of the Romance language family. |
|
|
|
This model predicts the following punctuation per input subtoken: |
|
|
|
* . |
|
* , |
|
* ? |
|
* ¿ |
|
* ACRONYM |
|
|
|
Though rare in these languages (relative to English), the special token `ACRONYM` allows fully punctuating tokens such as "`pm`" → "`p.m.`". |
|
|
|
# Usage |
|
The model is released as a `SentencePiece` tokenizer and an `ONNX` graph. |
|
|
|
The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators): |
|
|
|
```bash |
|
pip install punctuators |
|
``` |
|
|
|
If this package is broken, please let me know in the community tab (I update it for each model and break it a lot!). |
|
|
|
<details open> |
|
|
|
<summary>Example Usage</summary> |
|
|
|
``` |
|
from typing import List |
|
|
|
from punctuators.models import PunctCapSegModelONNX |
|
|
|
# Instantiate this model |
|
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory. |
|
m = PunctCapSegModelONNX.from_pretrained("pcs_romance") |
|
|
|
# Define some input texts to punctuate |
|
input_texts: List[str] = [ |
|
"este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt", |
|
"hola amigo cómo estás es un día lluvioso hoy", |
|
] |
|
results: List[List[str]] = m.infer(input_texts) |
|
for input_text, output_texts in zip(input_texts, results): |
|
print(f"Input: {input_text}") |
|
print(f"Outputs:") |
|
for text in output_texts: |
|
print(f"\t{text}") |
|
print() |
|
|
|
``` |
|
|
|
Exact output may vary based on the model version; here is the current output: |
|
|
|
</details> |
|
|
|
<details open> |
|
|
|
<summary>Expected Output</summary> |
|
|
|
```text |
|
``` |
|
|
|
</details> |
|
|
|
|
|
# Training Data |
|
For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/). |
|
|
|
Catalan is not included in StatMT's News Crawl. |
|
For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan. |
|
Due to this, Catalan performance may be sub-par and may over-predict punctuation and sentence breaks, which is typical of OpenSubtitles. |
|
|
|
# Training Parameters |
|
This model was trained by concatenating between 1 and 14 random sentences. |
|
The concatenation points became sentence boundary targets, |
|
text was lower-cased to produce true-case targets, |
|
and punctuation was removed to create punctuation targets. |
|
|
|
Batches were built by randomly sampling from each language. |
|
Each example is language homogenous (i.e., we only concatenate sentences from the same language). |
|
Batches were multilingual. Neither language tags nor language-specific paths are utilized in the graph. |
|
|
|
The maximum length during training was 256 subtokens. |
|
The `punctuators` package can punctuate inputs of any length. |
|
This is accomplished behind the scenes by splitting the input into overlapping subsegments of 256 tokens, and combining the results. |
|
|
|
If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained. |
|
|
|
# Metrics |
|
|