Seeking Guidance on Improving ROUGE Scores in Text Summarization with BART-Large-CNN

#49
by cooper521 - opened

I want to know how to configure the settings to achieve better scores.I observed that the evaluation results in the "Evaluation results" column are as follows:

  • ROUGE-1 on cnn_dailymail: 42.949
  • ROUGE-2 on cnn_dailymail: 20.815
  • ROUGE-L on cnn_dailymail: 30.619

However, when I tested using bart-large-cnn on daily/dm, the scores were significantly lower:

  • ROUGE-1: 0.2891
  • ROUGE-2: 0.1383
  • ROUGE-L: 0.2104
    Here is the code I used:
# Importing libraries
model_name = '/root/.cache/huggingface/hub/bart-large-cnn'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)
dataset = datasets.load_from_disk('data/cnn_daily')['test']

# Initialize
accelerator = Accelerator()
model, tokenizer = accelerator.prepare(model, tokenizer)
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Generate summaries
predicted_texts = []
reference_texts = []
batch_size = 8

for i in tqdm(range(0, len(dataset), batch_size), desc="Processing batches"):
    # Test condition
    if i >= 2 * batch_size:
        break
    
    batch = dataset[i: i + batch_size]
    input_texts = batch['article']
    target_texts = batch['highlights']
    
    inputs = tokenizer(
        input_texts, 
        return_tensors="pt", 
        add_special_tokens=True, 
        padding=True, 
        truncation=True, 
        max_length=1024
    ).to(model.device)
    
    # Generate summary IDs
    summary_ids = model.generate(
        **inputs, 
        do_sample=True, 
        temperature=0.7, 
        max_new_tokens=200, 
        top_k=0
    )

    # Decode and extend lists
    predicted_texts_batch = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)
    predicted_texts.extend(predicted_texts_batch)
    reference_texts.extend(target_texts)

# Compute ROUGE scores
all_scores = []
for ref, pred in zip(reference_texts, predicted_texts):
    scores = scorer.score(ref, pred)
    all_scores.append(scores)

# Calculate average scores
avg_scores = {key: 0.0 for key in all_scores[0].keys()}
for score in all_scores:
    for key, value in score.items():
        avg_scores[key] += value.fmeasure

for key, value in avg_scores.items():
    avg_scores[key] = value / len(all_scores)

print("Average ROUGE scores:", avg_scores)

Sign up or log in to comment