File size: 12,507 Bytes
bba32de d55df9f bba32de d55df9f b51be78 d55df9f e2feff2 d55df9f ef74cf6 f852113 d55df9f 5a1019f f852113 ef74cf6 5a1019f ef74cf6 f852113 ef74cf6 e8d29f2 5a1019f e8d29f2 d55df9f f852113 d55df9f f852113 d55df9f f852113 ef74cf6 f852113 e8d29f2 d55df9f e2feff2 d55df9f 5e360b5 d55df9f e2feff2 5e360b5 e2feff2 d55df9f e2feff2 d55df9f e2feff2 d55df9f f852113 d55df9f e2feff2 d55df9f e2feff2 d55df9f 5e360b5 e7a5edc d55df9f e2feff2 d55df9f 5e360b5 d55df9f e2feff2 d55df9f e2feff2 d55df9f e2feff2 f852113 e2feff2 e7a5edc e2feff2 f852113 5e360b5 d55df9f e7a5edc d55df9f b51be78 f852113 b51be78 3151c36 b51be78 3151c36 c7bbf57 3151c36 f852113 3151c36 b51be78 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 |
---
license: apache-2.0
language:
- en
tags:
- punctuation
- true casing
- sentence boundary detection
- token classification
- nlp
---
# Model Overview
This model accepts as input lower-cased, unpunctuated English text and performs in one pass punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions.
# Usage
The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
```bash
pip install punctuators
```
Let's punctuate my weekend recap, as well as few interesting sentences with acronyms and abbreviations that I made up or found on Wikipedia:
<details open>
<summary>Example Usage</summary>
```
from punctuators.models import PunctCapSegModelONNX
# Instantiate this model
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
m = PunctCapSegModelONNX.from_pretrained("pcs_en")
# Define some input texts to punctuate
input_texts: List[str] = [
# Literally my weekend
"i woke up at 6 am and took the dog for a hike in the metacomet mountains we like to take morning adventures on the weekends",
"despite being mid march it snowed overnight and into the morning here in connecticut it was snowier up in the mountains than in the farmington valley where i live",
"when i got home i trained this model on the lambda cloud on an a100 gpu with about 10 million lines of text the total budget was less than 5 dollars",
# Real acronyms in sentences that I made up
"george hw bush was the president of the us for 8 years",
"i saw mr smith at the store he was shopping for a new lawn mower i suggested he get one of those new battery operated ones they're so much quieter",
# See how the model performs on made-up acronyms
"i went to the fgw store and bought a new tg optical scope",
# First few sentences from today's featured article summary on wikipedia
"it's that man again itma was a radio comedy programme that was broadcast by the bbc for twelve series from 1939 to 1949 featuring tommy handley in the central role itma was a character driven comedy whose satirical targets included officialdom and the proliferation of minor wartime regulations parts of the scripts were rewritten in the hours before the broadcast to ensure topicality"
]
results: List[List[str]] = m.infer(input_texts)
for input_text, output_texts in zip(input_texts, results):
print(f"Input: {input_text}")
print(f"Outputs:")
for text in output_texts:
print(f"\t{text}")
print()
```
Exact output may vary based on the model version; here is the current output:
</details>
<details open>
<summary>Expected Output</summary>
```text
In: i woke up at 6 am and took the dog for a hike in the metacomet mountains we like to take morning adventures on the weekends
Out: I woke up at 6 a.m. and took the dog for a hike in the Metacomet Mountains.
Out: We like to take morning adventures on the weekends.
In: despite being mid march it snowed overnight and into the morning here in connecticut it was snowier up in the mountains than in the farmington valley where i live
Out: Despite being mid March, it snowed overnight and into the morning.
Out: Here in Connecticut, it was snowier up in the mountains than in the Farmington Valley where I live.
In: when i got home i trained this model on the lambda cloud on an a100 gpu with about 10 million lines of text the total budget was less than 5 dollars
Out: When I got home, I trained this model on the Lambda Cloud.
Out: On an A100 GPU with about 10 million lines of text, the total budget was less than 5 dollars.
In: george hw bush was the president of the us for 8 years
Out: George H.W. Bush was the president of the U.S. for 8 years.
In: i saw mr smith at the store he was shopping for a new lawn mower i suggested he get one of those new battery operated ones they're so much quieter
Out: I saw Mr. Smith at the store he was shopping for a new lawn mower.
Out: I suggested he get one of those new battery operated ones.
Out: They're so much quieter.
In: i went to the fgw store and bought a new tg optical scope
Out: I went to the FGW store and bought a new TG optical scope.
In: it's that man again itma was a radio comedy programme that was broadcast by the bbc for twelve series from 1939 to 1949 featuring tommy handley in the central role itma was a character driven comedy whose satirical targets included officialdom and the proliferation of minor wartime regulations parts of the scripts were rewritten in the hours before the broadcast to ensure topicality
Out: It's that man again.
Out: ITMA was a radio comedy programme that was broadcast by the BBC for Twelve Series from 1939 to 1949, featuring Tommy Handley.
Out: In the central role, ITMA was a character driven comedy whose satirical targets included officialdom and the proliferation of minor wartime regulations.
Out: Parts of the scripts were rewritten in the hours before the broadcast to ensure topicality.
```
</details>
# Model Details
This model implements the graph shown below, with brief descriptions for each step following.
![graph.png](https://s3.amazonaws.com/moonup/production/uploads/1678575121699-62d34c813eebd640a4f97587.png)
1. **Encoding**:
The model begins by tokenizing the text with a subword tokenizer.
The tokenizer used here is a `SentencePiece` model with a vocabulary size of 32k.
Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
2. **Punctuation**:
The encoded sequence is then fed into a feed-forward classification network to predict punctuation tokens.
Punctation is predicted once per subword, to allow acronyms to be properly punctuated.
An indiret benefit of per-subword prediction is to allow the model to run in a graph generalized for continuous-script languages, e.g., Chinese.
5. **Sentence boundary detection**
For sentence boundary detection, we condition the model on punctuation via embeddings.
Each punctuation prediction is used to select an embedding for that token, which is concatenated to the encoded representation.
The SBD head analyzes both the encoding of the un-punctuated sequence and the puncutation predictions, and predicts which tokens are sentence boundaries.
7. **Shift and concat sentence boundaries**
In English, the first character of each sentence should be upper-cased.
Thus, we should feed the sentence boundary information to the true-case classification network.
Since the true-case classification network is feed-forward and has no temporal context, each time step must embed whether it is the first word of a sentence.
Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
Concatenating this with the encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
8. **True-case prediction**
Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing.
Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
(In practice, `N` is the longest possible subword, and the extra predictions are ignored).
This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
The model's maximum length is 256 subtokens, due to the limit of the trained embeddings.
However, the [punctuators](https://github.com/1-800-BAD-CODE/punctuators) package
as described above will transparently predict on overlapping subgsegments of long inputs and fuse the results before returning output,
allowing inputs to be arbitrarily long.
## Punctuation Tokens
This model predicts the following set of punctuation tokens:
| Token | Description |
| ---: | :---------- |
| NULL | Predict no punctuation |
| ACRONYM | Every character in this subword ends with a period |
| . | Latin full stop |
| , | Latin comma |
| ? | Latin question mark |
# Training Details
## Training Framework
This model was trained on a forked branch of the [NeMo](https://github.com/NVIDIA/NeMo) framework.
## Training Data
This model was trained with News Crawl data from WMT.
Approximately 10M lines were used from the years 2021 and 2012.
The latter was used to attempt to reduce bias: annual news is typically dominated by a few topics, and 2021 is dominated by COVID discussions.
# Limitations
## Domain
This model was trained on news data, and may not perform well on conversational or informal data.
## Noisy Training Data
The training data was noisy, and no manual cleaning was utilized.
### Acronyms and Abbreviations
Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
| Token | Count |
| -: | :- |
| Mr | 115232 |
| Mr. | 108212 |
| Token | Count |
| -: | :- |
| U.S. | 85324 |
| US | 37332 |
| U.S | 354 |
| U.s | 108 |
| u.S. | 65 |
Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
### Sentence Boundary Detection Targets
An assumption for sentence boundary detection targets is that each line of the input data is exactly one sentence.
However, a non-negligible portion of the training data contains multiple sentences per line.
Thus, the SBD head may miss an obvious sentence boundary if it's similar to an error seen in the training data.
# Evaluation
In these metrics, keep in mind that
1. The data is noisy
2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
When conditioning on reference punctuation, true-casing and SBD metrics are much higher w.r.t. the reference targets.
4. Punctuation can be subjective. E.g.,
`Hello Frank, how's it going?`
or
`Hello Frank. How's it going?`
When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics.
## Test Data and Example Generation
Each test example was generated using the following procedure:
1. Concatenate 10 random sentences
2. Lower-case the concatenated sentence
3. Remove all punctuation
The data is a held-out portion of News Crawl, which has been deduplicated.
2,000 lines of data was used, generating 2,000 unique examples of 10 sentences each.
Examples longer than the model's maximum length (256) were truncated.
The number of affected sentences can be estimated from the "full stop" support: with 2,000 sentences and 10 sentences per example, we expect 20,000 full stop targets total.
## Results
# Fun Facts
Some fun facts are examined in this section.
## Embeddings
Let's examine the embeddings (see graph above) to see if the model meaningfully employed them.
We show here the cosine similarity between the embeddings of each token:
| | NULL | ACRONYM | . | , | ? |
| - | - | - | - | - | - |
| NULL | 1.00 | | | | |
| ACRONYM | -0.49 | 1.00 | | ||
| . | -1.00 | 0.48 | 1.00 | | |
| , | 1.00 | -0.48 | -1.00 | 1.00 | |
| ? | -1.00 | 0.49 | 1.00 | -1.00 | 1.00 |
Recall that these embeddings are used to predict sentence boundaries... thus we should expect full stops to cluster.
Indeed, we see that `NULL` and "`,`" are exactly the same, because neither have an implication on sentence boundaries.
Next, we see that "`.`" and "`?`" are exactly the same, because w.r.t. SBD these are exactly the same: strong full stop implications.
(Though, we may expect some difference between these tokens, given that "`.`" is predicted after abbreviations, e.g., 'Mr.', that are not full stops.)
Further, we see that "`.`" and "`?`" are exactly the opposite of `NULL`.
This is expected since these tokens typically imply sentence boundaries, whereas `NULL` and "`,`" never do.
Lastly, we see that `ACRONYM` is similar to, but not the same as, the full stops "`.`" and "`?`",
and far from, but not the opposite of, `NULL` and "`,`".
Intuition suggests this is because acronyms can be full stops ("I live in the northern U.S. It's cold here.") or not ("It's 5 a.m. and I'm tired.").
|