1-800-BAD-CODE
/

punctuation_fullstop_truecase_english

@@ -74,11 +74,11 @@ This model implements the graph shown below, with brief descriptions for each st
 1. **Encoding**:
 The model begins by tokenizing the text with a subword tokenizer.
-The tokenizer used here is a `SentencePiece` model with a vocabulary size of 64k.
 Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
 2. **Punctuation**:
-The encoded sequence is then fed into a classification network to predict punctuation tokens.
 Punctation is predicted once per subword, to allow acronyms to be properly punctuated.
 An indiret benefit of per-subword prediction is to allow the model to run in a graph generalized for continuous-script languages, e.g., Chinese.
@@ -100,9 +100,9 @@ Since true-casing should be done on a per-character basis, the classification ne
 (In practice, `N` is the longest possible subword, and the extra predictions are ignored).
 This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
-The model's maximum length is 256 subtokens. However, the [punctuators](https://github.com/1-800-BAD-CODE/punctuators) package
-as described above will transparently predict on overlapping subgsegments of longer input texts and fuse the results before returning output,
 allowing inputs to be arbitrarily long.
 ## Punctuation Tokens
@@ -116,12 +116,10 @@ This model predicts the following set of punctuation tokens:
 | ,    | Latin comma |
 | ?    | Latin question mark |
 # Training Details
-This model was trained in the NeMo framework.
 ## Training Data
 This model was trained with News Crawl data from WMT.
@@ -150,11 +148,15 @@ Acronyms and abbreviations are especially noisy; the table below shows how many
 | U.S | 354 |
 | U.s | 108 |
 | u.S. | 65 |
-| u.s | 2 |
 Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
 # Evaluation
 In these metrics, keep in mind that
 1. The data is noisy

 1. **Encoding**:
 The model begins by tokenizing the text with a subword tokenizer.
+The tokenizer used here is a `SentencePiece` model with a vocabulary size of 32k.
 Next, the input sequence is encoded with a base-sized Transformer, consisting of 6 layers with a model dimension of 512.
 2. **Punctuation**:
+The encoded sequence is then fed into a feed-forward classification network to predict punctuation tokens.
 Punctation is predicted once per subword, to allow acronyms to be properly punctuated.
 An indiret benefit of per-subword prediction is to allow the model to run in a graph generalized for continuous-script languages, e.g., Chinese.
 (In practice, `N` is the longest possible subword, and the extra predictions are ignored).
 This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
+The model's maximum length is 256 subtokens, due to the limit of the trained embeddings.
+However, the [punctuators](https://github.com/1-800-BAD-CODE/punctuators) package
+as described above will transparently predict on overlapping subgsegments of long inputs and fuse the results before returning output,
 allowing inputs to be arbitrarily long.
 ## Punctuation Tokens
 | ,    | Latin comma |
 | ?    | Latin question mark |
 # Training Details
+## Training Framework
+This model was trained on a forked branch of the [NeMo](https://github.com/NVIDIA/NeMo) framework.
 ## Training Data
 This model was trained with News Crawl data from WMT.
 | U.S | 354 |
 | U.s | 108 |
 | u.S. | 65 |
 Thus, the model's acronym and abbreviation predictions may be a bit unpredictable.
+Further, an assumption for sentence boundary detection targets is that each line of the input data is exactly one sentence.
+However, a non-negligible portion of the training data contains multiple sentences in one line.
+Thus, the SBD head may miss an obvious sentence boundary if it's similar to an error seen in the training data.
 # Evaluation
 In these metrics, keep in mind that
 1. The data is noisy