1-800-BAD-CODE
commited on
Commit
•
b51be78
1
Parent(s):
5e360b5
Update README.md
Browse files
README.md
CHANGED
@@ -11,11 +11,10 @@ tags:
|
|
11 |
---
|
12 |
|
13 |
# Model Overview
|
14 |
-
This model accepts as input lower-cased, unpunctuated English text and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
|
15 |
|
16 |
In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions.
|
17 |
|
18 |
-
|
19 |
# Usage
|
20 |
The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
|
21 |
|
@@ -23,7 +22,6 @@ The easy way to use this model is to install [punctuators](https://github.com/1-
|
|
23 |
pip install punctuators
|
24 |
```
|
25 |
|
26 |
-
|
27 |
Running the following script should load this model and run some texts:
|
28 |
<details open>
|
29 |
|
@@ -185,3 +183,31 @@ The data is a held-out portion of News Crawl, which has been deduplicated.
|
|
185 |
Examples longer than the model's maximum length (256) were truncated.
|
186 |
The number of affected sentences can be estimated from the "full stop" support: with 2,000 sentences and 10 sentences per example, we expect 20,000 full stop targets total.
|
187 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
---
|
12 |
|
13 |
# Model Overview
|
14 |
+
This model accepts as input lower-cased, unpunctuated English text and performs in one pass punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
|
15 |
|
16 |
In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions.
|
17 |
|
|
|
18 |
# Usage
|
19 |
The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
|
20 |
|
|
|
22 |
pip install punctuators
|
23 |
```
|
24 |
|
|
|
25 |
Running the following script should load this model and run some texts:
|
26 |
<details open>
|
27 |
|
|
|
183 |
Examples longer than the model's maximum length (256) were truncated.
|
184 |
The number of affected sentences can be estimated from the "full stop" support: with 2,000 sentences and 10 sentences per example, we expect 20,000 full stop targets total.
|
185 |
|
186 |
+
## Results
|
187 |
+
|
188 |
+
# Fun Facts
|
189 |
+
Some fun facts are examined in this section.
|
190 |
+
|
191 |
+
## Embeddings
|
192 |
+
Let's examine the embeddings (see graph above) to see if the model meaningfully employed them.
|
193 |
+
|
194 |
+
We show here the cosine similarity between the embeddings of each token:
|
195 |
+
|
196 |
+
| | NULL | ACRONYM | . | , | ? |
|
197 |
+
| - | - | - | - | - | - |
|
198 |
+
| NULL | 1.00 | | | | |
|
199 |
+
| ACRONYM | -0.93 | 1.00 | | ||
|
200 |
+
| . | -1.00 | 0.94 | 1.00 | | |
|
201 |
+
| , | 1.00 | -0.94 | -1.00 | 1.00 | |
|
202 |
+
| ? | -1.00 | 0.93 | 1.00 | -1.00 | 1.00 |
|
203 |
+
|
204 |
+
Recall that these embeddings are used to predict sentence boundaries... thus we should expect full stops to cluster.
|
205 |
+
|
206 |
+
Indeed, we see that `NULL` and `COMMA` are exactly the same, because neither have an implication on sentence boundaries.
|
207 |
+
|
208 |
+
Next, we see that periods and question marks are exactly the same, and exactly the opposite of NULL.
|
209 |
+
This is expected since these tokens typically imply sentence boundaries, whereas NULL and commas do not.
|
210 |
+
|
211 |
+
Lastly, we see that ACRONYM is quite, but not totally, similar to periods and question marks,
|
212 |
+
and almost, but not totally, the opposite of NULL and commas.
|
213 |
+
Intuitio suggests this is because acronyms can be full stops ("I live in the northern U.S. It's cold here.") or not ("It's 5 a.m. and I'm tired").
|