license: apache-2.0
language:
- es
- ca
- fr
- pt
- it
- ro
library_name: generic
tags:
- text2text-generation
- punctuation
- fullstop
- truecase
- capitalization
widget:
- text: hola amigo cómo estás es un día lluvioso hoy
- text: >-
este modelo fue entrenado en un gpu a100 en realidad no se que dice esta
frase lo traduje con nmt
Model
This model restores punctuation, predicts full stops (sentence boundaries), and predicts true-casing (capitalization) for text in the 6 most popular Romance languages:
- Spanish
- French
- Portuguese
- Catalan
- Italian
- Romanian
Together, these languages cover approximately 97% of native speakers of the Romance language family.
This model predicts the following punctuation per input subtoken:
- .
- ,
- ?
- ¿
- ACRONYM
Though rare in these languages (relative to English), the special token ACRONYM
allows fully punctuating tokens such as "pm
" → "p.m.
".
Usage
The model is released as a SentencePiece
tokenizer and an ONNX
graph.
The easy way to use this model is to install punctuators:
pip install punctuators
If this package is broken, please let me know in the community tab (I update it for each model and break it a lot!).
Example Usage
from typing import List
from punctuators.models import PunctCapSegModelONNX
# Instantiate this model
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
m = PunctCapSegModelONNX.from_pretrained("pcs_romance")
# Define some input texts to punctuate
input_texts: List[str] = [
"este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt",
"hola amigo cómo estás es un día lluvioso hoy",
]
results: List[List[str]] = m.infer(input_texts)
for input_text, output_texts in zip(input_texts, results):
print(f"Input: {input_text}")
print(f"Outputs:")
for text in output_texts:
print(f"\t{text}")
print()
Exact output may vary based on the model version; here is the current output:
Expected Output
Training Data
For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's News Crawl.
Catalan is not included in StatMT's News Crawl.
For completeness of the Romance language family, ~500k lines of OpenSubtitles
was used for Catalan.
Due to this, Catalan performance may be sub-par and may over-predict punctuation and sentence breaks, which is typical of OpenSubtitles.
Training Parameters
This model was trained by concatenating between 1 and 14 random sentences. The concatenation points became sentence boundary targets, text was lower-cased to produce true-case targets, and punctuation was removed to create punctuation targets.
Batches were built by randomly sampling from each language. Each example is language homogenous (i.e., we only concatenate sentences from the same language). Batches were multilingual. Neither language tags nor language-specific paths are utilized in the graph.
The maximum length during training was 256 subtokens.
The punctuators
package can punctuate inputs of any length.
This is accomplished behind the scenes by splitting the input into overlapping subsegments of 256 tokens, and combining the results.
If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.