Update README.md

3e02732 over 1 year ago

3.65 kB

	---
	license: apache-2.0
	language:
	- es
	- ca
	- fr
	- pt
	- it
	- ro
	library_name: generic
	tags:
	- text2text-generation
	- punctuation
	- fullstop
	- truecase
	- capitalization
	widget:
	- text: "hola amigo cómo estás es un día lluvioso hoy"
	- text: "este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt"
	---

	# Model
	This model restores punctuation, predicts full stops (sentence boundaries), and predicts true-casing (capitalization)
	for text in the 6 most popular Romance languages:

	* Spanish
	* French
	* Portuguese
	* Catalan
	* Italian
	* Romanian

	Together, these languages cover approximately 97% of native speakers of the Romance language family.

	This model predicts the following punctuation per input subtoken:

	* .
	* ,
	* ?
	* ¿
	* ACRONYM

	Though rare in these languages (relative to English), the special token `ACRONYM` allows fully punctuating tokens such as "`pm`" → "`p.m.`".

	# Usage
	The model is released as a `SentencePiece` tokenizer and an `ONNX` graph.

	The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):

	```bash
	pip install punctuators
	```

	If this package is broken, please let me know in the community tab (I update it for each model and break it a lot!).

	<details open>

	<summary>Example Usage</summary>

	```
	from typing import List

	from punctuators.models import PunctCapSegModelONNX

	# Instantiate this model
	# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
	m = PunctCapSegModelONNX.from_pretrained("pcs_romance")

	# Define some input texts to punctuate
	input_texts: List[str] = [
	"este modelo fue entrenado en un gpu a100 en realidad no se que dice esta frase lo traduje con nmt",
	"hola amigo cómo estás es un día lluvioso hoy",
	]
	results: List[List[str]] = m.infer(input_texts)
	for input_text, output_texts in zip(input_texts, results):
	print(f"Input: {input_text}")
	print(f"Outputs:")
	for text in output_texts:
	print(f"\t{text}")
	print()

	```

	Exact output may vary based on the model version; here is the current output:

	</details>

	<details open>

	<summary>Expected Output</summary>

	```text
	```

	</details>


	# Training Data
	For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/).

	Catalan is not included in StatMT's News Crawl.
	For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan.
	Due to this, Catalan performance may be sub-par and may over-predict punctuation and sentence breaks, which is typical of OpenSubtitles.

	# Training Parameters
	This model was trained by concatenating between 1 and 14 random sentences.
	The concatenation points became sentence boundary targets,
	text was lower-cased to produce true-case targets,
	and punctuation was removed to create punctuation targets.

	Batches were built by randomly sampling from each language.
	Each example is language homogenous (i.e., we only concatenate sentences from the same language).
	Batches were multilingual. Neither language tags nor language-specific paths are utilized in the graph.

	The maximum length during training was 256 subtokens.
	The `punctuators` package can punctuate inputs of any length.
	This is accomplished behind the scenes by splitting the input into overlapping subsegments of 256 tokens, and combining the results.

	If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.

	# Metrics