1-800-BAD-CODE
/

punctuation_fullstop_truecase_english

 ---
 license: apache-2.0
+language:
+- en
+tags:
+- punctuation
+- true casing
+- sentence boundary detection
+- token classification
+- nlp
 ---
+# Model Overview
+This model accepts as input lower-cased, unpunctuated, unsegmented English text and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
+# Usage
+The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
+```bash
+pip install punctuators
+```
+Running the following script should load this model and run some texts:
+<details open>
+  <summary>Example Usage</summary>
+```
+from punctuators.models import PunctCapSegModelONNX
+# Instantiate this model
+# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
+m = PunctCapSegModelONNX.from_pretrained("pcs_en")
+# Define some input texts to punctuate
+input_texts: List[str] = [
+    "hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in",
+    "i live in the us where george hw bush was once president"
+]
+results: List[List[str]] = m.infer(input_texts)
+for input_text, output_texts in zip(input_texts, results):
+    print(f"Input: {input_text}")
+    print(f"Outputs:")
+    for text in output_texts:
+        print(f"\t{text}")
+    print()
+```
+</details>
+<details open>
+  <summary>Expected Output</summary>
+```text
+```
+Note that "Friend" in this context is a proper noun, which is why the model consistently upper-cases tokens in similar contexts.
+</details>
+# Model Details
+This model generally follows the graph shown below, with brief descriptions for each step following.
+![graph.png](https://s3.amazonaws.com/moonup/production/uploads/1678575121699-62d34c813eebd640a4f97587.png)
+1. **Encoding**:
+The model begins by tokenizing the text with a subword tokenizer.
+The tokenizer used here is a `SentencePiece` model with a vocabulary size of 64k.
+Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
+2. **Post-punctuation**:
+The encoded sequence is then fed into a classification network to predict "post" punctuation tokens.
+Post punctuation are punctuation tokens that may appear after a word, basically most normal punctuation.
+Post punctation is predicted once per subword - further discussion is below.
+3. **Re-encoding**
+All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation.
+Therefore, we must conditional all further predictions on the post punctuation tokens.
+For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token.
+Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding.
+The concatenated joint representation is re-encoded to confer global context to each time step to incorporate puncuation predictions into subsequent tasks.
+5. **Sentence boundary detection**
+Parallel to the "pre" punctuation, another classification network predicts sentence boundaries from the re-encoded text.
+In all languages, sentence boundaries can occur only if a potential full stop is predicted, hence the conditioning.
+6. **Shift and concat sentence boundaries**
+In many languages, the first character of each sentence should be upper-cased.
+Thus, we should feed the sentence boundary information to the true-case classification network.
+Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
+Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
+Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
+7. **True-case prediction**
+Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing.
+Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
+(In practice, `N` is the longest possible subword, and the extra predictions are ignored).
+This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
+## Punctuation Tokens
+This model predicts the following set of punctuation tokens:
+| Token  | Description |
+| ---: | :---------- |
+| <NULL>    | Predict no punctuation |
+| <ACRONYM>    | Every character in this subword ends with a period |
+| .    | Latin full stop |
+| ,    | Latin comma |
+| ?    | Latin question mark |
+# Training Details
+This model was trained in the NeMo framework.
+## Training Data
+This model was trained with News Crawl data from WMT.
+Approximately 10M lines were used from the years 2021 and 2012.
+The latter was used to attempt to reduce bias: annual news is typically dominated by a few topics, e.g., 2021 contained a lot of COVID discussion.
+# Limitations
+This model was trained on news data, and may not perform well on conversational or informal data.
+# Evaluation
+In these metrics, keep in mind that
+1. The data is noisy
+2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
+   When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages.
+4. Punctuation can be subjective. E.g.,
+   `Hello Frank, how's it going?`
+   or
+   `Hello Frank. How's it going?`
+   When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics.
+## Test Data and Example Generation
+Each test example was generated using the following procedure:
+1. Concatenate 10 random sentences
+2. Lower-case the concatenated sentence
+3. Remove all punctuation
+The data is a held-out portion of News Crawl, which has been deduplicated.
+2,000 lines of data was used, generating 2,000 unique examples of 10 sentences each.
+Examples longer than the model's maximum length (256) were truncated.
+The number of affected sentences can be estimated from the "full stop" support: with 2,000 sentences and 10 sentences per example, we expect 20,000 full stop targets total.