1-800-BAD-CODE commited on
Commit
5e360b5
1 Parent(s): e7a5edc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -11
README.md CHANGED
@@ -74,11 +74,11 @@ This model implements the graph shown below, with brief descriptions for each st
74
 
75
  1. **Encoding**:
76
  The model begins by tokenizing the text with a subword tokenizer.
77
- The tokenizer used here is a `SentencePiece` model with a vocabulary size of 64k.
78
  Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
79
 
80
  2. **Punctuation**:
81
- The encoded sequence is then fed into a classification network to predict punctuation tokens.
82
  Punctation is predicted once per subword, to allow acronyms to be properly punctuated.
83
  An indiret benefit of per-subword prediction is to allow the model to run in a graph generalized for continuous-script languages, e.g., Chinese.
84
 
@@ -100,9 +100,9 @@ Since true-casing should be done on a per-character basis, the classification ne
100
  (In practice, `N` is the longest possible subword, and the extra predictions are ignored).
101
  This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
102
 
103
-
104
- The model's maximum length is 256 subtokens. However, the [punctuators](https://github.com/1-800-BAD-CODE/punctuators) package
105
- as described above will transparently predict on overlapping subgsegments of longer input texts and fuse the results before returning output,
106
  allowing inputs to be arbitrarily long.
107
 
108
  ## Punctuation Tokens
@@ -116,12 +116,10 @@ This model predicts the following set of punctuation tokens:
116
  | , | Latin comma |
117
  | ? | Latin question mark |
118
 
119
-
120
-
121
-
122
-
123
  # Training Details
124
- This model was trained in the NeMo framework.
 
 
125
 
126
  ## Training Data
127
  This model was trained with News Crawl data from WMT.
@@ -150,11 +148,15 @@ Acronyms and abbreviations are especially noisy; the table below shows how many
150
  | U.S | 354 |
151
  | U.s | 108 |
152
  | u.S. | 65 |
153
- | u.s | 2 |
154
 
155
  Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
156
 
157
 
 
 
 
 
 
158
  # Evaluation
159
  In these metrics, keep in mind that
160
  1. The data is noisy
 
74
 
75
  1. **Encoding**:
76
  The model begins by tokenizing the text with a subword tokenizer.
77
+ The tokenizer used here is a `SentencePiece` model with a vocabulary size of 32k.
78
  Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
79
 
80
  2. **Punctuation**:
81
+ The encoded sequence is then fed into a feed-forward classification network to predict punctuation tokens.
82
  Punctation is predicted once per subword, to allow acronyms to be properly punctuated.
83
  An indiret benefit of per-subword prediction is to allow the model to run in a graph generalized for continuous-script languages, e.g., Chinese.
84
 
 
100
  (In practice, `N` is the longest possible subword, and the extra predictions are ignored).
101
  This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
102
 
103
+ The model's maximum length is 256 subtokens, due to the limit of the trained embeddings.
104
+ However, the [punctuators](https://github.com/1-800-BAD-CODE/punctuators) package
105
+ as described above will transparently predict on overlapping subgsegments of long inputs and fuse the results before returning output,
106
  allowing inputs to be arbitrarily long.
107
 
108
  ## Punctuation Tokens
 
116
  | , | Latin comma |
117
  | ? | Latin question mark |
118
 
 
 
 
 
119
  # Training Details
120
+
121
+ ## Training Framework
122
+ This model was trained on a forked branch of the [NeMo](https://github.com/NVIDIA/NeMo) framework.
123
 
124
  ## Training Data
125
  This model was trained with News Crawl data from WMT.
 
148
  | U.S | 354 |
149
  | U.s | 108 |
150
  | u.S. | 65 |
 
151
 
152
  Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
153
 
154
 
155
+ Further, an assumption for sentence boundary detection targets is that each line of the input data is exactly one sentence.
156
+ However, a non-negligible portion of the training data contains multiple sentences in one line.
157
+ Thus, the SBD head may miss an obvious sentence boundary if it's similar to an error seen in the training data.
158
+
159
+
160
  # Evaluation
161
  In these metrics, keep in mind that
162
  1. The data is noisy