google/gemma-2-9b-it · Batch Inference causes degraded performance

I want to bring up attentions to this issue. The batch inference with Gemma-2-9b-it on lm-evaluation-harness leads to significantly degraded performance.

1st run with auto batch size (batch size = 1 after auto detection)

Tasks	Version	Filter	Metric		Value		Stderr
ifeval	2	none	inst_level_loose_acc	↑	0.7674	±	N/A
		none	inst_level_strict_acc	↑	0.7554	±	N/A
		none	prompt_level_loose_acc	↑	0.6784	±	0.0201
		none	prompt_level_strict_acc	↑	0.6636	±	0.0203

2nd run with batch = 32

Tasks	Version	Filter	Metric		Value		Stderr
ifeval	2	none	inst_level_loose_acc	↑	0.0528	±	N/A
		none	inst_level_strict_acc	↑	0.0528	±	N/A
		none	prompt_level_loose_acc	↑	0.0462	±	0.0090
		none	prompt_level_strict_acc	↑	0.0462	±	0.0090