Update README.md

cfc16a2 over 1 year ago

No virus

7.65 kB

	---
	license: agpl-3.0
	metrics:
	- precision
	- recall
	- f1
	- accuracy
	model-index:
	- name: directquote-variedStyles
	results: []
	datasets:
	- DirectQuote
	language:
	- en
	pipeline_tag: token-classification
	library_name: transformers
	---
	<!--- WHISP DEVELOPMENT LOGO ~ RESPONSIVE TO LIGHT/DARK MODE --->
	<picture>
	<source media="(prefers-color-scheme: dark)" srcset="https://i.imgur.com/eO4igg9.png" height="37", style="height: 37px">
	<img src="https://i.imgur.com/ihiqdVt.png" height="37", style="height: 37px">
	</picture>

	# quote extraction & attribution on [DirectQuote](https://arxiv.org/abs/2110.07827) dataset with BERT-based token classification 💬
	this repository stores the code to train and perform inference with a DistilBERT model using the DirectQuote corpus (Zhang, et al. 2021).

	directquote-variedStyles 💬 is a fine-tuned [distilbert-base-cased](https://huggingface.co/distilbert-base-cased) model that performs token classification on a modified version of the [DirectQuote](https://arxiv.org/abs/2110.07827) dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.2339
	- Precision: 0.7440
	- Recall: 0.9090
	- F1: 0.8182
	- Accuracy: 0.9355

	## Model description

	directquote-variedStyles performs Quote Extraction and Attribution (QEA) on texts, enabling NLP applications to suitably process quotations in texts and corpora. Further implementations of QEA have been proposed in the realm of 'modular journalism' (See: ['Talking sense: using machine learning to understand quotes'](https://www.theguardian.com/info/2021/nov/25/talking-sense-using-machine-learning-to-understand-quotes)).


	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	the [DirectQuote dataset](https://arxiv.org/abs/2110.07827) presented by Zhang, et al. (2021) represents a corpus of 19,760 paragraphs containing 10,279 direct quotations — this manually-annotated corpus is, as per the authors, "the largest and most complete corpus focusing on direct quotations in news texts" [1].
	```
	# DirectQuote Distribution of Data Sources
	\| Region \| Name \| Numbers \|
	\|-------------\|-------------------------------------\|-------------\|
	\| U.S. \| Associated Press \| 438 \|
	\| \| Cable News Network \| 627 \|
	\| \| American Broadcasting Company \| 240 \|
	\| \| New York Times \| 5,642 \|
	\| \| CBS Broadcasting \| 4,890 \|
	\| UK \| British Broadcasting Corporation \| 926 \|
	\| \| Reuters \| 5,836 \|
	\| \| The Guardian \| 4,302 \|
	\| Canada \| The Globe and Mail \| 1,955 \|
	\| \| The Star \| 13,769 \|
	\| New Zealand \| NZ Herald \| 115 \|
	\| Australia \| Australian Broadcasting Corporation \| 312 \|
	\| \| Sydney Morning Herald \| 93 \|
	```
	Quote extraction and attribution appears to be an underserved area of NLP, however, a small handful of systems exist that perform this task, namely Stanford's [CoreNLP model bundle](https://stanfordnlp.github.io/CoreNLP/quote.html) [2]. Quote Extraction and Attribution (QEA) solutions generally fall into one of two broad categories, (1.) rule-based systems that identify quotation marks and common verbiage associated with a quotation (See: [Textacy QEA](https://textacy.readthedocs.io/en/latest/api_reference/extract.html#textacy.extract.triples.direct_quotations) [3]), or (2.) probabilistic model-based approaches that typically rely on LTSMs and neural network architectures.

	Existing solutions of both categories lack the comparative speed and accuracy of newer, transformer-based systems. With reference to CoreNLP, the system _does not_ support GPU-optimised inference. Rule-based systems, such as Textacy, are significantly faster but sorely lacking in terms of precision (Textacy refused to process 28% of documents from a 1000-doc sample of the Whisp corpus) — this issue is compounded by the vast array of different quotation mark 'styles' available within Unicode, as below, there are well over a dozen differing quotation marks.

	### Modifications to the DirectQuote Corpus
	As per the [CoreNLP documentation](https://stanfordnlp.github.io/CoreNLP/quote.html) on quote extraction and attribution (QEA), there exists a multitude of varying quotation styles (12+), any of which may appear in texts ingested by Whisp. For the reasons outlined in the introduction, it is necessary to adapt the DirectQuote dataset to represent a wider range of quotation styles.

	> Considers regular ascii (“”, ‘’, ``’’, and `’) as well as “smart” and international quotation marks as follows: “”,‘’, «», ‹›, 「」, 『』, „”, and ‚’.
	>
	> From CoreNLP Docs ~ Pipeline > Quote Extraction And Attribution

	I have included 11 quotation 'sets' to replace/populate pre-existing quotation marks in the DirectQuote dataset. These styles include both ASCII and Unicode quotation marks, including a small variety of international styles — largely confined to those used by French and German speaking populations in Europe. Chinese-style quotation marks have not been included due to the limited overlap in publishing between Mandarin and English content.

	## Training procedure

	Token Labels
	The DirectQuote corpus provides the following 5 labels, following the IOB1 format:

	* LeftSpeaker — Quotation, the corresponding speaker is in the preceding text
	* RightSpeaker — Quotation, the corresponding speaker is in the following text
	* Unknown — Quotation, no corresponding speaker
	* Speaker — Speaker
	* Out — N/A

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 16
	- eval_batch_size: 16
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 5

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Precision \| Recall \| F1 \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:---------:\|:------:\|:------:\|:--------:\|
	\| 0.3392 \| 1.0 \| 976 \| 0.2050 \| 0.7040 \| 0.8297 \| 0.7617 \| 0.9327 \|
	\| 0.1915 \| 2.0 \| 1952 \| 0.1996 \| 0.7417 \| 0.8990 \| 0.8128 \| 0.9337 \|
	\| 0.1668 \| 3.0 \| 2928 \| 0.2023 \| 0.7373 \| 0.9066 \| 0.8132 \| 0.9369 \|
	\| 0.1447 \| 4.0 \| 3904 \| 0.2125 \| 0.7458 \| 0.9107 \| 0.8200 \| 0.9367 \|
	\| 0.1248 \| 5.0 \| 4880 \| 0.2339 \| 0.7440 \| 0.9090 \| 0.8182 \| 0.9355 \|


	### Framework versions

	- Transformers 4.25.1
	- Pytorch 1.10.2+cu113
	- Datasets 2.8.0
	- Tokenizers 0.13.2

	## References
	[1] Zhang, Y., & Liu, Y. (2021, October 15). DirectQuote: A dataset for direct quotation extraction and attribution in news articles. arXiv.Org. https://arxiv.org/abs/2110.07827

	[2] Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

	[3] Chartbeat Labs, & DeWilde, B. (2016, February). Information Extraction. Textacy ~ NLP, before and after spaCy. https://textacy.readthedocs.io/en/latest/api_reference/extract.html#textacy.extract.triples.direct_quotations