
Goodhart's Law on Benchmarks

Capability Description miniG Gemini-Flash GLM-4-9B-Chat Llama 3.1 8B Instruct
MMLU Representation of questions in 57 subjects
(incl. STEM, humanities, and others)
85.45 78.9 72.4 69.4
IFEval Evaluation of instruction-following
using verifiable prompts
74.22 - 69 80.4
GSM8K Challenging math problems
(5-shot evaluation)
75.89 (5-shot) 86.2 (11-shot) 79.6 84.5 (8-shot CoT)
HumanEval Python code generation on a held-out dataset
79.88 74.3 71.8 72.6
GPQA Challenging dataset of questions
from biology, physics, and chemistry
37.37 39.5 34.3 (base) 34.2
Context Window Maximum context length
the model can handle
1M 1M 128K 128K
Input Supported input modalities Text, image, audio, video Text only Text only
1. miniG is a 14B parameter model derived from the 9B parameter glm-4-9b-chat-1m model weights. It continues pre-training on a selected corpus of 20B tokens while retaining long-context capabilities. The model is fine-tuned on a dataset of 120M+ conversation entries, synthesized through cross-page clustering similar to RAG on this selected corpus. Additionally, miniG underwent multimodal training in two stages for single image input, with the second stage reinitializing 5B parameters of a Vision Transformer from glm-4v-9b for Locked-Image Tuning.
2. miniG outputs are formatted similarly to Gemini 1.5 Flash but were not trained on data generated by the Gemini models.