Batch Inference causes degraded performance
#43
by
tanliboy
- opened
I want to bring up attentions to this issue. The batch inference with Gemma-2-9b-it on lm-evaluation-harness
leads to significantly degraded performance.
1st run with auto batch size (batch size = 1 after auto detection)
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
ifeval | 2 | none | 0 | inst_level_loose_acc | ↑ | 0.7674 | ± | N/A |
none | 0 | inst_level_strict_acc | ↑ | 0.7554 | ± | N/A | ||
none | 0 | prompt_level_loose_acc | ↑ | 0.6784 | ± | 0.0201 | ||
none | 0 | prompt_level_strict_acc | ↑ | 0.6636 | ± | 0.0203 |
2nd run with batch = 32
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
ifeval | 2 | none | 0 | inst_level_loose_acc | ↑ | 0.0528 | ± | N/A |
none | 0 | inst_level_strict_acc | ↑ | 0.0528 | ± | N/A | ||
none | 0 | prompt_level_loose_acc | ↑ | 0.0462 | ± | 0.0090 | ||
none | 0 | prompt_level_strict_acc | ↑ | 0.0462 | ± | 0.0090 |
It is likely related to the sliding window issue.