Batch Inference causes degraded performance

#43
by tanliboy - opened

I want to bring up attentions to this issue. The batch inference with Gemma-2-9b-it on lm-evaluation-harness leads to significantly degraded performance.

1st run with auto batch size (batch size = 1 after auto detection)

Tasks Version Filter n-shot Metric Value Stderr
ifeval 2 none 0 inst_level_loose_acc 0.7674 ± N/A
none 0 inst_level_strict_acc 0.7554 ± N/A
none 0 prompt_level_loose_acc 0.6784 ± 0.0201
none 0 prompt_level_strict_acc 0.6636 ± 0.0203

2nd run with batch = 32

Tasks Version Filter n-shot Metric Value Stderr
ifeval 2 none 0 inst_level_loose_acc 0.0528 ± N/A
none 0 inst_level_strict_acc 0.0528 ± N/A
none 0 prompt_level_loose_acc 0.0462 ± 0.0090
none 0 prompt_level_strict_acc 0.0462 ± 0.0090

It is likely related to the sliding window issue.

Sign up or log in to comment