metadata

license: apache-2.0
language:
  - es
  - ca
  - fr
  - pt
  - it
  - ro
library_name: generic
tags:
  - text2text-generation
  - punctuation
  - fullstop
  - truecase
  - capitalization
widget:
  - text: hola amigo cómo estás es un día lluvioso hoy
  - text: >-
      este modelo fue entrenado en un gpu a100 en realidad no se que dice esta
      frase lo traduje con nmt

Model

This model restores punctuation, predicts full stops (sentence boundaries), and predicts true-casing (capitalization) for text in the 6 most popular Romance languages:

Spanish
French
Portuguese
Catalan
Italian
Romanian

Together, these languages cover approximately 97% of native speakers of the Romance language family.

This model predicts the following punctuation per input subtoken:

.
,
?
¿
ACRONYM

Though rare in these languages (relative to English), the special token ACRONYM allows fully punctuating tokens such as "pm" → "p.m.".

Usage

The model is released as a SentencePiece tokenizer and an ONNX graph.

The easy way to use this model is to install punctuators:

pip install punctuators

If this package is broken, please let me know in the community tab (I update it for each model and break it a lot!).

Example Usage

from typing import List

from punctuators.models import PunctCapSegModelONNX

# Instantiate this model
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
m = PunctCapSegModelONNX.from_pretrained("pcs_romance")

# Define some input texts to punctuate
input_texts: List[str] = [
    "este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt",
    "hola amigo cómo estás es un día lluvioso hoy",
]
results: List[List[str]] = m.infer(input_texts)
for input_text, output_texts in zip(input_texts, results):
    print(f"Input: {input_text}")
    print(f"Outputs:")
    for text in output_texts:
        print(f"\t{text}")
    print()

Exact output may vary based on the model version; here is the current output:

Expected Output

Training Data

For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's News Crawl.

Catalan is not included in StatMT's News Crawl. For completeness of the Romance language family, ~500k lines of OpenSubtitles was used for Catalan. Due to this, Catalan performance may be sub-par and may over-predict punctuation and sentence breaks, which is typical of OpenSubtitles.

Training Parameters

This model was trained by concatenating between 1 and 14 random sentences. The concatenation points became sentence boundary targets, text was lower-cased to produce true-case targets, and punctuation was removed to create punctuation targets.

Batches were built by randomly sampling from each language. Each example is language homogenous (i.e., we only concatenate sentences from the same language). Batches were multilingual. Neither language tags nor language-specific paths are utilized in the graph.

The maximum length during training was 256 subtokens. The punctuators package can punctuate inputs of any length. This is accomplished behind the scenes by splitting the input into overlapping subsegments of 256 tokens, and combining the results.

If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.

1-800-BAD-CODE
/

punctuation_fullstop_truecase_romance

Model

Usage

Training Data

Training Parameters

Metrics