Processors
This library includes processors for several traditional tasks. These processors can be used to process a dataset into examples that can be fed to a model.
Processors
All processors follow the same architecture which is that of the DataProcessor. The processor returns a list of InputExample. These InputExample can be converted to InputFeatures in order to be fed to the model.
Base class for data converters for sequence classification data sets.
Gets a collection of InputExample for the dev set.
Gets an example from a dict with tensorflow tensors.
Gets the list of labels for this data set.
Gets a collection of InputExample for the test set.
Gets a collection of InputExample for the train set.
Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are. This method converts examples to the correct format.
( guid: str text_a: str text_b: typing.Optional[str] = None label: typing.Optional[str] = None )
A single training/test example for simple sequence classification.
Serializes this instance to a JSON string.
( input_ids: typing.List[int] attention_mask: typing.Optional[typing.List[int]] = None token_type_ids: typing.Optional[typing.List[int]] = None label: typing.Union[int, float, NoneType] = None )
A single set of features of data. Property names are the same names as the corresponding inputs to a model.
Serializes this instance to a JSON string.
GLUE
General Language Understanding Evaluation (GLUE) is a benchmark that evaluates the performance of models across a diverse set of existing NLU tasks. It was released together with the paper GLUE: A multi-task benchmark and analysis platform for natural language understanding
This library hosts a total of 10 processors for the following tasks: MRPC, MNLI, MNLI (mismatched), CoLA, SST2, STSB, QQP, QNLI, RTE and WNLI.
Those processors are:
MrpcProcessor
MnliProcessor
MnliMismatchedProcessor
Sst2Processor
StsbProcessor
QqpProcessor
QnliProcessor
RteProcessor
WnliProcessor
Additionally, the following method can be used to load values from a data file and convert them to a list of InputExample.
automethod,transformers.data.processors.glue.glue_convert_examples_to_features
Example usage
An example using these processors is given in the run_glue.py script.
XNLI
The Cross-Lingual NLI Corpus (XNLI) is a benchmark that evaluates the quality of cross-lingual text representations. XNLI is crowd-sourced dataset based on MultiNLI <http://www.nyu.edu/projects/bowman/multinli/>: pairs of text are labeled with textual entailment annotations for 15 different languages (including both high-resource language such as English and low-resource languages such as Swahili).
It was released together with the paper XNLI: Evaluating Cross-lingual Sentence Representations
This library hosts the processor to load the XNLI data:
XnliProcessor
Please note that since the gold labels are available on the test set, evaluation is performed on the test set.
An example using these processors is given in the run_xnli.py script.
SQuAD
The Stanford Question Answering Dataset (SQuAD) is a benchmark that evaluates the performance of models on question answering. Two versions are available, v1.1 and v2.0. The first version (v1.1) was released together with the paper SQuAD: 100,000+ Questions for Machine Comprehension of Text. The second version (v2.0) was released alongside the paper Know What You Donβt Know: Unanswerable Questions for SQuAD.
This library hosts a processor for each of the two versions:
Processors
Those processors are:
SquadV1Processor
SquadV2Processor
They both inherit from the abstract class SquadProcessor
Processor for the SQuAD data set. overridden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and version 2.0 of SQuAD, respectively.
Returns the evaluation example from the data directory.
Creates a list of SquadExample
using a TFDS dataset.
Examples:
>>> import tensorflow_datasets as tfds
>>> dataset = tfds.load("squad")
>>> training_examples = get_examples_from_dataset(dataset, evaluate=False)
>>> evaluation_examples = get_examples_from_dataset(dataset, evaluate=True)
Returns the training examples from the data directory.
Additionally, the following method can be used to convert SQuAD examples into
SquadFeatures
that can be used as model inputs.
automethod,transformers.data.processors.squad.squad_convert_examples_to_features
These processors as well as the aforementionned method can be used with files containing the data as well as with the tensorflow_datasets package. Examples are given below.
Example usage
Here is an example using the processors as well as the conversion method using data files:
# Loading a V2 processor
processor = SquadV2Processor()
examples = processor.get_dev_examples(squad_v2_data_dir)
# Loading a V1 processor
processor = SquadV1Processor()
examples = processor.get_dev_examples(squad_v1_data_dir)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)
Using tensorflow_datasets is as easy as using a data file:
# tensorflow_datasets only handle Squad V1.
tfds_examples = tfds.load("squad")
examples = SquadV1Processor().get_examples_from_dataset(tfds_examples, evaluate=evaluate)
features = squad_convert_examples_to_features(
examples=examples,
tokenizer=tokenizer,
max_seq_length=max_seq_length,
doc_stride=args.doc_stride,
max_query_length=max_query_length,
is_training=not evaluate,
)
Another example using these processors is given in the run_squad.py script.