docusco-bert

Model description

docusco-bert is a fine-tuned BERT model that is ready to use for token classification. The model was trained on data sampled from the Corpus of Contemporary American English (COCA) and classifies tokens and token sequences according to a system developed for the DocuScope dictionary-based tagger. Descriptions of the categories are included in a table below.

About DocuScope

DocuScope is a dicitonary-based tagger that has been developed at Carnegie Mellon University by David Kaufer and Suguru Ishizaki since the early 2000s. Its categories are rhetorical in their orientation (as opposed to part-of-speech tags, for example, which are morphosyntactic).

DocuScope has been been used in a wide variety of studies. Here, for example, is a short analysis of King Lear, and here is a published study of Tweets.

Intended uses & limitations

How to use

The model was trained on data with tags formatted using IOB, like those used in common tasks like Named Entity Recogition (NER). Thus, you can use this model with a Transformers NER pipeline.

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("browndw/docusco-bert")
model = AutoModelForTokenClassification.from_pretrained("browndw/docusco-bert")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Globalization is the process of interaction and integration among people, companies, and governments worldwide."
ds_results = nlp(example)
print(ds_results)

Limitations and bias

This model is limited by its training dataset of American English texts. Moreover, the current version is trained on only a small subset of the corpus. The goal is to train later versions on more data, which should increase accuracy.

Training data

This model was fine-tuned on data from the Corpus of Contemporary American English (COCA). The training data contain chunks of text randomly sampled of 5 text-types: Academic, Fiction, Magazine, News, and Spoken.

Typically, BERT models are trained on sentence segments. However, DocuScope tags can span setences. Thus, data were split into chunks that don't split B + I sequences and end with sentence-final punctuation marks (i.e., period, quesiton mark or exclamaiton point).

Additionally, the order of the chunks was randomized prior to sampling, and statified sampling was used to provide enough training data for low-frequency caegories. The resulting training data consist of:

21,460,177 tokens
15,796,305 chunks

The specific counts for each category appear in the following table.

Category	Count
O	3528038
Syntactic Complexity	2032808
Character	1413771
Description	1224744
Narrative	1159201
Negative	651012
Academic Terms	620932
Interactive	594908
Information Exposition	578228
Positive	463914
Force Stressed	432631
Information Topics	394155
First Person	249744
Metadiscourse Cohesive	240822
Strategic	238255
Public Terms	234213
Reasoning	213775
Information Place	187249
Information States	173146
Information ReportVerbs	119092
Confidence High	112861
Confidence Hedged	110008
Future	96101
Inquiry	94995
Contingent	94860
Information Change	89063
Metadiscourse Interactive	84033
Updates	81424
Citation	71241
Facilitate	50451
Uncertainty	35644
Academic WritingMoves	29352
Information ChangePositive	28475
Responsibility	25362
Citation Authority	22414
Information ChangeNegative	15612
Confidence Low	2876
Citation Hedged	895
-	-
Total	15796305

Training procedure

This model was trained on a single 2.3 GHz Dual-Core Intel Core i5 with recommended hyperparameters from the original BERT paper.

Eval results

Overall

metric	test
f1	.927
accuracy	.943

By category

category	precision	recall	f1-score	support
AcademicTerms	0.91	0.92	0.92	486399
AcademicWritingMoves	0.76	0.82	0.79	20017
Character	0.94	0.95	0.94	1260272
Citation	0.92	0.94	0.93	50812
CitationAuthority	0.86	0.88	0.87	17798
CitationHedged	0.91	0.94	0.92	632
ConfidenceHedged	0.94	0.96	0.95	90393
ConfidenceHigh	0.92	0.94	0.93	113569
ConfidenceLow	0.79	0.81	0.80	2556
Contingent	0.92	0.94	0.93	81366
Description	0.87	0.89	0.88	1098598
Facilitate	0.87	0.90	0.89	41760
FirstPerson	0.96	0.98	0.97	330658
ForceStressed	0.93	0.94	0.93	436188
Future	0.90	0.93	0.92	93365
InformationChange	0.88	0.91	0.89	72813
InformationChangeNegative	0.83	0.85	0.84	12740
InformationChangePositive	0.82	0.86	0.84	22994
InformationExposition	0.94	0.95	0.95	468078
InformationPlace	0.95	0.96	0.96	147688
InformationReportVerbs	0.91	0.93	0.92	95563
InformationStates	0.95	0.95	0.95	139429
InformationTopics	0.90	0.92	0.91	328152
Inquiry	0.85	0.89	0.87	79030
Interactive	0.95	0.96	0.95	602857
MetadiscourseCohesive	0.97	0.98	0.98	195548
MetadiscourseInteractive	0.92	0.94	0.93	73159
Narrative	0.92	0.94	0.93	1023452
Negative	0.88	0.89	0.88	645810
Positive	0.87	0.89	0.88	409775
PublicTerms	0.91	0.92	0.91	184108
Reasoning	0.93	0.95	0.94	169208
Responsibility	0.83	0.87	0.85	21819
Strategic	0.88	0.90	0.89	193768
SyntacticComplexity	0.95	0.96	0.96	1635918
Uncertainty	0.87	0.91	0.89	33684
Updates	0.91	0.93	0.92	77760
-	-	-	-	-
micro avg	0.92	0.93	0.93	10757736
macro avg	0.90	0.92	0.91	10757736
weighted avg	0.92	0.93	0.93	10757736

DocuScope Category Descriptions

Category (Cluster)	Description	Examples
Academic Terms	Abstract, rare, specialized, or disciplinary-specific terms that are indicative of informationally dense writing	market price, storage capacity, regulatory, distribution
Academic Writing Moves	Phrases and terms that indicate academic writing moves, which are common in research genres and are derived from the work of Swales (1981) and Cotos et al. (2015, 2017)	in the first section, the problem is that, payment methodology, point of contention
Character	References multiple dimensions of a character or human being as a social agent, both individual and collective	Pauline, her, personnel, representatives
Citation	Language that indicates the attribution of information to, or citation of, another source.	according to, is proposing that, quotes from
Citation Authorized	Referencing the citation of another source that is represented as true and not arguable	confirm that, provide evidence, common sense
Citation Hedged	Referencing the citation of another source that is presented as arguable	suggest that, just one opinion
Confidence Hedged	Referencing language that presents a claim as uncertain	tends to get, maybe, it seems that
Confidence High	Referencing language that presents a claim with certainty	most likely, ensure that, know that, obviously
Confidence Low	Referencing language that presents a claim as extremely unlikely	unlikely, out of the question, impossible
Contingent	Referencing contingency, typically contingency in the world, rather than contingency in one's knowledge	subject to, if possible, just in case, hypothetically
Description	Language that evokes sights, sounds, smells, touches and tastes, as well as scenes and objects	stay quiet, gas-fired, solar panels, soft, on my desk
Facilitate	Language that enables or directs one through specific tasks and actions	let me, worth a try, I would suggest
First Person	This cluster captures first person.	I, as soon as I, we have been
Force Stressed	Language that is forceful and stressed, often using emphatics, comparative forms, or superlative forms	really good, the sooner the better, necessary
Future	Referencing future actions, states, or desires	will be, hope to, expected changes
Information Change	Referencing changes of information, particularly changes that are more neutral	changes, revised, growth, modification to
Information Change Negative	Referencing negative change	going downhill, slow erosion, get worse
Information Change Positive	Referencing positive change	improving, accrued interest, boost morale
Information Exposition	Information in the form of expository devices, or language that describes or explains, frequently in regards to quantities and comparisons	final amount, several, three, compare, 80%
Information Place	Language designating places	the city, surrounding areas, Houston, home
Information Report Verbs	Informational verbs and verb phrases of reporting	report, posted, release, point out
Information States	Referencing information states, or states of being	is, are, existing, been
Information Topics	Referencing topics, usually nominal subjects or objects, that indicate the “aboutness” of a text	time, money, stock price, phone interview
Inquiry	Referencing inquiry, or language that points to some kind of inquiry or investigation	find out, let me know if you have any questions, wondering if
Interactive	Addresses from the author to the reader or from persons in the text to other persons. The address comes in the language of everyday conversation, colloquy, exchange, questions, attention-getters, feedback, interactive genre markers, and the use of the second person.	can you, thank you for, please see, sounds good to me
Metadiscourse Cohesive	The use of words to build cohesive markers that help the reader navigate the text and signal linkages in the text, which are often additive or contrastive	or, but, also, on the other hand, notwithstanding, that being said
Metadiscourse Interactive	The use of words to build cohesive markers that interact with the reader	I agree, let’s talk, by the way
Narrative	Language that involves people, description, and events extending in time	today, tomorrow, during the, this weekend
Negative	Referencing dimensions of negativity, including negative acts, emotions, relations, and values	does not, sorry for, problems, confusion
Positive	Referencing dimensions of positivity, including actions, emotions, relations, and values	thanks, approval, agreement, looks good
Public Terms	Referencing public terms, concepts from public language, media, the language of authority, institutions, and responsibility	discussion, amendment, corporation, authority, settlement
Reasoning	Language that has a reasoning focus, supporting inferences about cause, consequence, generalization, concession, and linear inference either from premise to conclusion or conclusion to premise	because, therefore, analysis, even if, as a result, indicating that
Responsibility	Referencing the language of responsibility	supposed to, requirements, obligations
Strategic	This dimension is active when the text structures strategies activism, advantage-seeking, game-playing cognition, plans, and goal-seeking.	plan, trying to, strategy, decision, coordinate, look at the
Syntactic Complexity	The features in this category are often what are called “function words,” like determiners and prepositions.	the, to, for, in, a lot of
Uncertainty	References uncertainty, when confidence levels are unknown	kind of, I have no idea, for some reason
Updates	References updates that anticipate someone searching for information and receiving it	already, a new, now that, here are some

BibTeX entry and citation info

@incollection{ishizaki2012computer,
  title    = {Computer-aided rhetorical analysis},
  author   = {Ishizaki, Suguru and Kaufer, David},
  booktitle= {Applied natural language processing: Identification, investigation and resolution},
  pages    = {276--296},
  year     = {2012},
  publisher= {IGI Global},
  url      = {https://www.igi-global.com/chapter/content/61054}
}

@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
               Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805},
  timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

browndw
/

docusco-bert