1-800-BAD-CODE
commited on
Commit
•
d55df9f
1
Parent(s):
bba32de
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,163 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
tags:
|
6 |
+
- punctuation
|
7 |
+
- true casing
|
8 |
+
- sentence boundary detection
|
9 |
+
- token classification
|
10 |
+
- nlp
|
11 |
---
|
12 |
+
|
13 |
+
# Model Overview
|
14 |
+
This model accepts as input lower-cased, unpunctuated, unsegmented English text and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
|
15 |
+
|
16 |
+
|
17 |
+
|
18 |
+
# Usage
|
19 |
+
The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
|
20 |
+
|
21 |
+
```bash
|
22 |
+
pip install punctuators
|
23 |
+
```
|
24 |
+
|
25 |
+
Running the following script should load this model and run some texts:
|
26 |
+
<details open>
|
27 |
+
|
28 |
+
<summary>Example Usage</summary>
|
29 |
+
|
30 |
+
```
|
31 |
+
from punctuators.models import PunctCapSegModelONNX
|
32 |
+
|
33 |
+
# Instantiate this model
|
34 |
+
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
|
35 |
+
m = PunctCapSegModelONNX.from_pretrained("pcs_en")
|
36 |
+
|
37 |
+
# Define some input texts to punctuate
|
38 |
+
input_texts: List[str] = [
|
39 |
+
"hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in",
|
40 |
+
"i live in the us where george hw bush was once president"
|
41 |
+
]
|
42 |
+
results: List[List[str]] = m.infer(input_texts)
|
43 |
+
for input_text, output_texts in zip(input_texts, results):
|
44 |
+
print(f"Input: {input_text}")
|
45 |
+
print(f"Outputs:")
|
46 |
+
for text in output_texts:
|
47 |
+
print(f"\t{text}")
|
48 |
+
print()
|
49 |
+
|
50 |
+
```
|
51 |
+
|
52 |
+
</details>
|
53 |
+
|
54 |
+
<details open>
|
55 |
+
|
56 |
+
<summary>Expected Output</summary>
|
57 |
+
|
58 |
+
```text
|
59 |
+
|
60 |
+
```
|
61 |
+
|
62 |
+
Note that "Friend" in this context is a proper noun, which is why the model consistently upper-cases tokens in similar contexts.
|
63 |
+
|
64 |
+
</details>
|
65 |
+
|
66 |
+
# Model Details
|
67 |
+
|
68 |
+
This model generally follows the graph shown below, with brief descriptions for each step following.
|
69 |
+
|
70 |
+
![graph.png](https://s3.amazonaws.com/moonup/production/uploads/1678575121699-62d34c813eebd640a4f97587.png)
|
71 |
+
|
72 |
+
|
73 |
+
1. **Encoding**:
|
74 |
+
The model begins by tokenizing the text with a subword tokenizer.
|
75 |
+
The tokenizer used here is a `SentencePiece` model with a vocabulary size of 64k.
|
76 |
+
Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
|
77 |
+
|
78 |
+
2. **Post-punctuation**:
|
79 |
+
The encoded sequence is then fed into a classification network to predict "post" punctuation tokens.
|
80 |
+
Post punctuation are punctuation tokens that may appear after a word, basically most normal punctuation.
|
81 |
+
Post punctation is predicted once per subword - further discussion is below.
|
82 |
+
|
83 |
+
3. **Re-encoding**
|
84 |
+
All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation.
|
85 |
+
Therefore, we must conditional all further predictions on the post punctuation tokens.
|
86 |
+
For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token.
|
87 |
+
Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding.
|
88 |
+
The concatenated joint representation is re-encoded to confer global context to each time step to incorporate puncuation predictions into subsequent tasks.
|
89 |
+
|
90 |
+
5. **Sentence boundary detection**
|
91 |
+
Parallel to the "pre" punctuation, another classification network predicts sentence boundaries from the re-encoded text.
|
92 |
+
In all languages, sentence boundaries can occur only if a potential full stop is predicted, hence the conditioning.
|
93 |
+
|
94 |
+
6. **Shift and concat sentence boundaries**
|
95 |
+
In many languages, the first character of each sentence should be upper-cased.
|
96 |
+
Thus, we should feed the sentence boundary information to the true-case classification network.
|
97 |
+
Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
|
98 |
+
Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
|
99 |
+
Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
|
100 |
+
|
101 |
+
7. **True-case prediction**
|
102 |
+
Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing.
|
103 |
+
Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
|
104 |
+
(In practice, `N` is the longest possible subword, and the extra predictions are ignored).
|
105 |
+
This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
|
106 |
+
|
107 |
+
|
108 |
+
## Punctuation Tokens
|
109 |
+
This model predicts the following set of punctuation tokens:
|
110 |
+
|
111 |
+
| Token | Description |
|
112 |
+
| ---: | :---------- |
|
113 |
+
| <NULL> | Predict no punctuation |
|
114 |
+
| <ACRONYM> | Every character in this subword ends with a period |
|
115 |
+
| . | Latin full stop |
|
116 |
+
| , | Latin comma |
|
117 |
+
| ? | Latin question mark |
|
118 |
+
|
119 |
+
|
120 |
+
|
121 |
+
|
122 |
+
|
123 |
+
# Training Details
|
124 |
+
This model was trained in the NeMo framework.
|
125 |
+
|
126 |
+
## Training Data
|
127 |
+
This model was trained with News Crawl data from WMT.
|
128 |
+
|
129 |
+
Approximately 10M lines were used from the years 2021 and 2012.
|
130 |
+
The latter was used to attempt to reduce bias: annual news is typically dominated by a few topics, e.g., 2021 contained a lot of COVID discussion.
|
131 |
+
|
132 |
+
# Limitations
|
133 |
+
This model was trained on news data, and may not perform well on conversational or informal data.
|
134 |
+
|
135 |
+
|
136 |
+
# Evaluation
|
137 |
+
In these metrics, keep in mind that
|
138 |
+
1. The data is noisy
|
139 |
+
2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
|
140 |
+
When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages.
|
141 |
+
4. Punctuation can be subjective. E.g.,
|
142 |
+
|
143 |
+
`Hello Frank, how's it going?`
|
144 |
+
|
145 |
+
or
|
146 |
+
|
147 |
+
`Hello Frank. How's it going?`
|
148 |
+
|
149 |
+
When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics.
|
150 |
+
|
151 |
+
## Test Data and Example Generation
|
152 |
+
Each test example was generated using the following procedure:
|
153 |
+
|
154 |
+
1. Concatenate 10 random sentences
|
155 |
+
2. Lower-case the concatenated sentence
|
156 |
+
3. Remove all punctuation
|
157 |
+
|
158 |
+
The data is a held-out portion of News Crawl, which has been deduplicated.
|
159 |
+
2,000 lines of data was used, generating 2,000 unique examples of 10 sentences each.
|
160 |
+
|
161 |
+
Examples longer than the model's maximum length (256) were truncated.
|
162 |
+
The number of affected sentences can be estimated from the "full stop" support: with 2,000 sentences and 10 sentences per example, we expect 20,000 full stop targets total.
|
163 |
+
|