Summarization
#1
by
brirrer
- opened
Hi!
Is it possible to fine-tune this model specifically for summarization tasks? I've attempted this, but I consistently encounter errors regarding invalid training data.
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Seq2SeqTrainer, Seq2SeqTrainingArguments
tokenizer = AutoTokenizer.from_pretrained("DTAI-KULeuven/robbert-2023-dutch-large")
model = AutoModelForSequenceClassification.from_pretrained("DTAI-KULeuven/robbert-2023-dutch-large")
from datasets import load_dataset
train_dataset = load_dataset("json", data_files="data.json", split="train")
eval_dataset = load_dataset("json", data_files="data_eval.json")
training_args = Seq2SeqTrainingArguments(
output_dir="/var/tmp/output",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=4,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=3,
predict_with_generate=True,
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer
)
trainer.train()
I've tried to types of datasets:
#version 1 [{'input_text':'a', 'target_text':'1'}, {'input_text:'b', 'target_text':'2'}]
#version 2 {'input_text':['a', 'b'], 'target_text':['1','2']}]
Hi, the task you provide is a sequence-to-sequence task. These models are not made for sequence generation, as they are only encoder models, so you need a decoder as well.
You can look into extractive summarization or using our model as the encoder: https://huggingface.co/docs/transformers/model_doc/encoder-decoder
pdelobelle
changed discussion status to
closed