Capability | Description | miniG | Gemini-Flash | GLM-4-9B-Chat | Llama 3.1 8B Instruct |
---|---|---|---|---|---|
MMLU | Representation of questions in 57 subjects (incl. STEM, humanities, and others) |
85.45 | 78.9 | 72.4 | 69.4 |
IFEval | Evaluation of instruction-following using verifiable prompts |
74.22 | - | 69 | 80.4 |
GSM8K | Challenging math problems (5-shot evaluation) |
75.89 (5-shot) | 86.2 (11-shot) | 79.6 | 84.5 (8-shot CoT) |
HumanEval | Python code generation on a held-out dataset (0-shot) |
79.88 | 74.3 | 71.8 | 72.6 |
GPQA | Challenging dataset of questions from biology, physics, and chemistry |
37.37 | 39.5 | 34.3 (base) | 34.2 |
Context Window | Maximum context length the model can handle |
1M | 1M | 128K | 128K |
Input | Supported input modalities | Text, image (single model) |
Text, image, audio, video | Text only | Text only |