1-800-BAD-CODE commited on
Commit
d55df9f
1 Parent(s): bba32de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +160 -0
README.md CHANGED
@@ -1,3 +1,163 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - punctuation
7
+ - true casing
8
+ - sentence boundary detection
9
+ - token classification
10
+ - nlp
11
  ---
12
+
13
+ # Model Overview
14
+ This model accepts as input lower-cased, unpunctuated, unsegmented English text and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
15
+
16
+
17
+
18
+ # Usage
19
+ The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
20
+
21
+ ```bash
22
+ pip install punctuators
23
+ ```
24
+
25
+ Running the following script should load this model and run some texts:
26
+ <details open>
27
+
28
+ <summary>Example Usage</summary>
29
+
30
+ ```
31
+ from punctuators.models import PunctCapSegModelONNX
32
+
33
+ # Instantiate this model
34
+ # This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
35
+ m = PunctCapSegModelONNX.from_pretrained("pcs_en")
36
+
37
+ # Define some input texts to punctuate
38
+ input_texts: List[str] = [
39
+ "hello friend how's it going it's snowing outside right now in connecticut a large storm is moving in",
40
+ "i live in the us where george hw bush was once president"
41
+ ]
42
+ results: List[List[str]] = m.infer(input_texts)
43
+ for input_text, output_texts in zip(input_texts, results):
44
+ print(f"Input: {input_text}")
45
+ print(f"Outputs:")
46
+ for text in output_texts:
47
+ print(f"\t{text}")
48
+ print()
49
+
50
+ ```
51
+
52
+ </details>
53
+
54
+ <details open>
55
+
56
+ <summary>Expected Output</summary>
57
+
58
+ ```text
59
+
60
+ ```
61
+
62
+ Note that "Friend" in this context is a proper noun, which is why the model consistently upper-cases tokens in similar contexts.
63
+
64
+ </details>
65
+
66
+ # Model Details
67
+
68
+ This model generally follows the graph shown below, with brief descriptions for each step following.
69
+
70
+ ![graph.png](https://s3.amazonaws.com/moonup/production/uploads/1678575121699-62d34c813eebd640a4f97587.png)
71
+
72
+
73
+ 1. **Encoding**:
74
+ The model begins by tokenizing the text with a subword tokenizer.
75
+ The tokenizer used here is a `SentencePiece` model with a vocabulary size of 64k.
76
+ Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
77
+
78
+ 2. **Post-punctuation**:
79
+ The encoded sequence is then fed into a classification network to predict "post" punctuation tokens.
80
+ Post punctuation are punctuation tokens that may appear after a word, basically most normal punctuation.
81
+ Post punctation is predicted once per subword - further discussion is below.
82
+
83
+ 3. **Re-encoding**
84
+ All subsequent tasks (true-casing, sentence boundary detection, and "pre" punctuation) are dependent on "post" punctuation.
85
+ Therefore, we must conditional all further predictions on the post punctuation tokens.
86
+ For this task, predicted punctation tokens are fed into an embedding layer, where embeddings represent each possible punctuation token.
87
+ Each time step is mapped to a 4-dimensional embeddings, which is concatenated to the 512-dimensional encoding.
88
+ The concatenated joint representation is re-encoded to confer global context to each time step to incorporate puncuation predictions into subsequent tasks.
89
+
90
+ 5. **Sentence boundary detection**
91
+ Parallel to the "pre" punctuation, another classification network predicts sentence boundaries from the re-encoded text.
92
+ In all languages, sentence boundaries can occur only if a potential full stop is predicted, hence the conditioning.
93
+
94
+ 6. **Shift and concat sentence boundaries**
95
+ In many languages, the first character of each sentence should be upper-cased.
96
+ Thus, we should feed the sentence boundary information to the true-case classification network.
97
+ Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
98
+ Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
99
+ Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
100
+
101
+ 7. **True-case prediction**
102
+ Armed with the knowledge of punctation and sentence boundaries, a classification network predicts true-casing.
103
+ Since true-casing should be done on a per-character basis, the classification network makes `N` predictions per token, where `N` is the length of the subtoken.
104
+ (In practice, `N` is the longest possible subword, and the extra predictions are ignored).
105
+ This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
106
+
107
+
108
+ ## Punctuation Tokens
109
+ This model predicts the following set of punctuation tokens:
110
+
111
+ | Token | Description |
112
+ | ---: | :---------- |
113
+ | <NULL> | Predict no punctuation |
114
+ | <ACRONYM> | Every character in this subword ends with a period |
115
+ | . | Latin full stop |
116
+ | , | Latin comma |
117
+ | ? | Latin question mark |
118
+
119
+
120
+
121
+
122
+
123
+ # Training Details
124
+ This model was trained in the NeMo framework.
125
+
126
+ ## Training Data
127
+ This model was trained with News Crawl data from WMT.
128
+
129
+ Approximately 10M lines were used from the years 2021 and 2012.
130
+ The latter was used to attempt to reduce bias: annual news is typically dominated by a few topics, e.g., 2021 contained a lot of COVID discussion.
131
+
132
+ # Limitations
133
+ This model was trained on news data, and may not perform well on conversational or informal data.
134
+
135
+
136
+ # Evaluation
137
+ In these metrics, keep in mind that
138
+ 1. The data is noisy
139
+ 2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
140
+ When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages.
141
+ 4. Punctuation can be subjective. E.g.,
142
+
143
+ `Hello Frank, how's it going?`
144
+
145
+ or
146
+
147
+ `Hello Frank. How's it going?`
148
+
149
+ When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics.
150
+
151
+ ## Test Data and Example Generation
152
+ Each test example was generated using the following procedure:
153
+
154
+ 1. Concatenate 10 random sentences
155
+ 2. Lower-case the concatenated sentence
156
+ 3. Remove all punctuation
157
+
158
+ The data is a held-out portion of News Crawl, which has been deduplicated.
159
+ 2,000 lines of data was used, generating 2,000 unique examples of 10 sentences each.
160
+
161
+ Examples longer than the model's maximum length (256) were truncated.
162
+ The number of affected sentences can be estimated from the "full stop" support: with 2,000 sentences and 10 sentences per example, we expect 20,000 full stop targets total.
163
+