Seeking Guidance on Improving ROUGE Scores in Text Summarization with BART-Large-CNN
#49
by
cooper521
- opened
I want to know how to configure the settings to achieve better scores.I observed that the evaluation results in the "Evaluation results" column are as follows:
- ROUGE-1 on cnn_dailymail: 42.949
- ROUGE-2 on cnn_dailymail: 20.815
- ROUGE-L on cnn_dailymail: 30.619
However, when I tested using bart-large-cnn
on daily/dm, the scores were significantly lower:
- ROUGE-1: 0.2891
- ROUGE-2: 0.1383
- ROUGE-L: 0.2104
Here is the code I used:
# Importing libraries
model_name = '/root/.cache/huggingface/hub/bart-large-cnn'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)
dataset = datasets.load_from_disk('data/cnn_daily')['test']
# Initialize
accelerator = Accelerator()
model, tokenizer = accelerator.prepare(model, tokenizer)
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
# Generate summaries
predicted_texts = []
reference_texts = []
batch_size = 8
for i in tqdm(range(0, len(dataset), batch_size), desc="Processing batches"):
# Test condition
if i >= 2 * batch_size:
break
batch = dataset[i: i + batch_size]
input_texts = batch['article']
target_texts = batch['highlights']
inputs = tokenizer(
input_texts,
return_tensors="pt",
add_special_tokens=True,
padding=True,
truncation=True,
max_length=1024
).to(model.device)
# Generate summary IDs
summary_ids = model.generate(
**inputs,
do_sample=True,
temperature=0.7,
max_new_tokens=200,
top_k=0
)
# Decode and extend lists
predicted_texts_batch = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)
predicted_texts.extend(predicted_texts_batch)
reference_texts.extend(target_texts)
# Compute ROUGE scores
all_scores = []
for ref, pred in zip(reference_texts, predicted_texts):
scores = scorer.score(ref, pred)
all_scores.append(scores)
# Calculate average scores
avg_scores = {key: 0.0 for key in all_scores[0].keys()}
for score in all_scores:
for key, value in score.items():
avg_scores[key] += value.fmeasure
for key, value in avg_scores.items():
avg_scores[key] = value / len(all_scores)
print("Average ROUGE scores:", avg_scores)