1-800-BAD-CODE
/

punctuation_fullstop_truecase_english

@@ -15,6 +15,7 @@ This model accepts as input lower-cased, unpunctuated English text and performs
 In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions.
 # Usage
 The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
@@ -22,6 +23,7 @@ The easy way to use this model is to install [punctuators](https://github.com/1-
 pip install punctuators
 ```
 Running the following script should load this model and run some texts:
 <details open>
@@ -99,6 +101,10 @@ Since true-casing should be done on a per-character basis, the classification ne
 This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
 ## Punctuation Tokens
 This model predicts the following set of punctuation tokens:
@@ -133,7 +139,7 @@ The training data was noisy, and no manual cleaning was utilized.
 Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
 | Token  | Count |
-| ---: | :---------- |
 | Mr    | 115232 |
 | Mr.    | 108212 |
@@ -153,7 +159,7 @@ Thus, the model's acronym and abbreviation predictions may be a bit unpredictabl
 In these metrics, keep in mind that
 1. The data is noisy
 2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
-   When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages.
 4. Punctuation can be subjective. E.g.,
    `Hello Frank, how's it going?`

 In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions.
 # Usage
 The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
 pip install punctuators
 ```
 Running the following script should load this model and run some texts:
 <details open>
 This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
+The model's maximum length is 256 subtokens. However, the [punctuators](https://github.com/1-800-BAD-CODE/punctuators) package
+as described above will transparently predict on overlapping subgsegments of longer input texts and fuse the results before returning output,
+allowing inputs to be arbitrarily long.
 ## Punctuation Tokens
 This model predicts the following set of punctuation tokens:
 Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
 | Token  | Count |
+| -: | :- |
 | Mr    | 115232 |
 | Mr.    | 108212 |
 In these metrics, keep in mind that
 1. The data is noisy
 2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
+   When conditioning on reference punctuation, true-casing and SBD metrics are much higher w.r.t. the reference targets.
 4. Punctuation can be subjective. E.g.,
    `Hello Frank, how's it going?`