1-800-BAD-CODE
/

punct_cap_seg_47_language

Text2Text Generation

sentence-boundary-detection

Model card Files Files and versions Community

1-800-BAD-CODE commited on Feb 22, 2023

Commit

6d889b7

•

1 Parent(s): 0575d5b

Update README.md

Files changed (1) hide show

README.md +21 -0

README.md CHANGED Viewed

@@ -125,7 +125,20 @@ This model predicts the following set of "post" punctuation tokens:
 | ፧    | Ethiopic question mark | Amharic |
 # Usage
 # Training Details
@@ -143,4 +156,12 @@ This model was trained on news data, and may not perform well on conversational
 This is also a base-sized model with many languages and many tasks, so capacity may be limited.
 # Evaluation

 | ፧    | Ethiopic question mark | Amharic |
+## Pre-Punctuation Tokens
+This model predicts the following set of "post" punctuation tokens:
+| Token  | Description | Relavant Languages |
+| ---: | :---------- | :----------- |
+| ¿    | Inverted question mark | Spanish |
 # Usage
+This model is released in two parts:
+1. The ONNX graph
+2. The SentencePiece tokenizer
 # Training Details
 This is also a base-sized model with many languages and many tasks, so capacity may be limited.
+This model also predicts punctuation only once per subword.
+This implies that some acronyms, e.g., 'U.S.', cannot properly be punctuation.
+This concession was accepted on two grounds:
+1. Such acronyms are rare, especially in the context of multi-lingual models
+2. Punctuated acronyms are typically pronounced as individual characters, e.g., 'U.S.' vs. 'NATO'.
+   Since the expected use-case of this model is the output of an ASR system, it is presumed that such
+   pronunciations would be transcribed as separate tokens, e.g, 'u s' vs. 'us' (though this depends on the model's pre-processing).
 # Evaluation