1-800-BAD-CODE
commited on
Commit
•
5e360b5
1
Parent(s):
e7a5edc
Update README.md
Browse files
README.md
CHANGED
@@ -74,11 +74,11 @@ This model implements the graph shown below, with brief descriptions for each st
|
|
74 |
|
75 |
1. **Encoding**:
|
76 |
The model begins by tokenizing the text with a subword tokenizer.
|
77 |
-
The tokenizer used here is a `SentencePiece` model with a vocabulary size of
|
78 |
Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
|
79 |
|
80 |
2. **Punctuation**:
|
81 |
-
The encoded sequence is then fed into a classification network to predict punctuation tokens.
|
82 |
Punctation is predicted once per subword, to allow acronyms to be properly punctuated.
|
83 |
An indiret benefit of per-subword prediction is to allow the model to run in a graph generalized for continuous-script languages, e.g., Chinese.
|
84 |
|
@@ -100,9 +100,9 @@ Since true-casing should be done on a per-character basis, the classification ne
|
|
100 |
(In practice, `N` is the longest possible subword, and the extra predictions are ignored).
|
101 |
This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
|
102 |
|
103 |
-
|
104 |
-
|
105 |
-
as described above will transparently predict on overlapping subgsegments of
|
106 |
allowing inputs to be arbitrarily long.
|
107 |
|
108 |
## Punctuation Tokens
|
@@ -116,12 +116,10 @@ This model predicts the following set of punctuation tokens:
|
|
116 |
| , | Latin comma |
|
117 |
| ? | Latin question mark |
|
118 |
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
# Training Details
|
124 |
-
|
|
|
|
|
125 |
|
126 |
## Training Data
|
127 |
This model was trained with News Crawl data from WMT.
|
@@ -150,11 +148,15 @@ Acronyms and abbreviations are especially noisy; the table below shows how many
|
|
150 |
| U.S | 354 |
|
151 |
| U.s | 108 |
|
152 |
| u.S. | 65 |
|
153 |
-
| u.s | 2 |
|
154 |
|
155 |
Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
|
156 |
|
157 |
|
|
|
|
|
|
|
|
|
|
|
158 |
# Evaluation
|
159 |
In these metrics, keep in mind that
|
160 |
1. The data is noisy
|
|
|
74 |
|
75 |
1. **Encoding**:
|
76 |
The model begins by tokenizing the text with a subword tokenizer.
|
77 |
+
The tokenizer used here is a `SentencePiece` model with a vocabulary size of 32k.
|
78 |
Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
|
79 |
|
80 |
2. **Punctuation**:
|
81 |
+
The encoded sequence is then fed into a feed-forward classification network to predict punctuation tokens.
|
82 |
Punctation is predicted once per subword, to allow acronyms to be properly punctuated.
|
83 |
An indiret benefit of per-subword prediction is to allow the model to run in a graph generalized for continuous-script languages, e.g., Chinese.
|
84 |
|
|
|
100 |
(In practice, `N` is the longest possible subword, and the extra predictions are ignored).
|
101 |
This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
|
102 |
|
103 |
+
The model's maximum length is 256 subtokens, due to the limit of the trained embeddings.
|
104 |
+
However, the [punctuators](https://github.com/1-800-BAD-CODE/punctuators) package
|
105 |
+
as described above will transparently predict on overlapping subgsegments of long inputs and fuse the results before returning output,
|
106 |
allowing inputs to be arbitrarily long.
|
107 |
|
108 |
## Punctuation Tokens
|
|
|
116 |
| , | Latin comma |
|
117 |
| ? | Latin question mark |
|
118 |
|
|
|
|
|
|
|
|
|
119 |
# Training Details
|
120 |
+
|
121 |
+
## Training Framework
|
122 |
+
This model was trained on a forked branch of the [NeMo](https://github.com/NVIDIA/NeMo) framework.
|
123 |
|
124 |
## Training Data
|
125 |
This model was trained with News Crawl data from WMT.
|
|
|
148 |
| U.S | 354 |
|
149 |
| U.s | 108 |
|
150 |
| u.S. | 65 |
|
|
|
151 |
|
152 |
Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
|
153 |
|
154 |
|
155 |
+
Further, an assumption for sentence boundary detection targets is that each line of the input data is exactly one sentence.
|
156 |
+
However, a non-negligible portion of the training data contains multiple sentences in one line.
|
157 |
+
Thus, the SBD head may miss an obvious sentence boundary if it's similar to an error seen in the training data.
|
158 |
+
|
159 |
+
|
160 |
# Evaluation
|
161 |
In these metrics, keep in mind that
|
162 |
1. The data is noisy
|