autoevaluator's picture
Add evaluation results on the kmfoda--booksum config and test split of kmfoda/booksum
2c67882
|
raw
history blame
No virus
3.6 kB
metadata
tags:
  - generated_from_trainer
  - summarization
  - book summary
dataset:
  - kmfoda/booksum
metrics:
  - rouge
model-index:
  - name: long-t5-tglobal-large-booksum-WIP
    results:
      - task:
          type: summarization
          name: Summarization
        dataset:
          name: kmfoda/booksum
          type: kmfoda/booksum
          config: kmfoda--booksum
          split: test
        metrics:
          - name: ROUGE-1
            type: rouge
            value: 25.6136
            verified: true
          - name: ROUGE-2
            type: rouge
            value: 2.8652
            verified: true
          - name: ROUGE-L
            type: rouge
            value: 12.4913
            verified: true
          - name: ROUGE-LSUM
            type: rouge
            value: 23.1102
            verified: true
          - name: loss
            type: loss
            value: 5.004334926605225
            verified: true
          - name: gen_len
            type: gen_len
            value: 89.4354
            verified: true

tglobal-large-booksum-WIP

this is a WIP checkpoint that has been fine-tuned from the vanilla (original) for 10ish epochs. It is not ready to be used for inference

This model is a fine-tuned version of google/long-t5-tglobal-large on the kmfoda/booksum dataset. It achieves the following results on the evaluation set:

  • Loss: 4.9519
  • Rouge1: 21.8058
  • Rouge2: 2.9343
  • Rougel: 10.3717
  • Rougelsum: 20.1537
  • Gen Len: 106.055

Model description

Testing fine-tuning only on booksum with 16384/1024 the whole time (vs. previous large WIP checkpoint I made that started from a partially-trained pubmed checkpoint)

Intended uses & limitations

this is a WIP checkpoint that has been fine-tuned from the vanilla (original) for 10ish epochs. It is not ready to be used for inference

Training and evaluation data

This is only fine-tuned on booksum (vs. previous large WIP checkpoint I made that started from a partially-trained pubmed checkpoint)

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0004
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 31060
  • distributed_type: multi-GPU
  • num_devices: 4
  • gradient_accumulation_steps: 32
  • total_train_batch_size: 128
  • total_eval_batch_size: 4
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • num_epochs: 3.0

Training results

Training Loss Epoch Step Gen Len Validation Loss Rouge1 Rouge2 Rougel Rougelsum
5.0389 0.99 37 219.03 5.1884 29.995 4.4045 12.8837 27.557
4.8986 1.0 75 5.1286 26.921 3.7193 11.3605 25.3492 276.005
4.5928 2.0 150 4.9900 26.6667 3.7342 11.8223 24.7087 178.775
4.6159 3.0 225 4.9519 21.8058 2.9343 10.3717 20.1537 106.055

eval in bf16

***** eval metrics *****
  epoch                   =        3.0
  eval_gen_len            =    103.075
  eval_loss               =     4.9501
  eval_rouge1             =    21.6345
  eval_rouge2             =      2.877
  eval_rougeL             =     10.386
  eval_rougeLsum          =    20.0148
  eval_runtime            = 0:06:02.75
  eval_samples            =        200
  eval_samples_per_second =      0.551
  eval_steps_per_second   =      0.138
[INFO|trainer.py:2724] 2022-11-27 01:00:

Framework versions

  • Transformers 4.25.0.dev0
  • Pytorch 1.13.0+cu117
  • Datasets 2.6.1
  • Tokenizers 0.13.1