metadata
language:
- en
license: apache-2.0
datasets:
- Locutusque/TM-DATA-V2
- LLM360/TxT360
- mlfoundations/dclm-baseline-1.0
- Skylion007/openwebtext
- JeanKaddour/minipile
- eminorhan/gutenberg_en
model-index:
- name: TinyMistral-248M-v3
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: IFEval (0-Shot)
type: HuggingFaceH4/ifeval
args:
num_few_shot: 0
metrics:
- type: inst_level_strict_acc and prompt_level_strict_acc
value: 16.39
name: strict accuracy
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=M4-ai/TinyMistral-248M-v3
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: BBH (3-Shot)
type: BBH
args:
num_few_shot: 3
metrics:
- type: acc_norm
value: 1.78
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=M4-ai/TinyMistral-248M-v3
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MATH Lvl 5 (4-Shot)
type: hendrycks/competition_math
args:
num_few_shot: 4
metrics:
- type: exact_match
value: 0
name: exact match
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=M4-ai/TinyMistral-248M-v3
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GPQA (0-shot)
type: Idavidrein/gpqa
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 0
name: acc_norm
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=M4-ai/TinyMistral-248M-v3
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MuSR (0-shot)
type: TAUR-Lab/MuSR
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 5.15
name: acc_norm
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=M4-ai/TinyMistral-248M-v3
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU-PRO (5-shot)
type: TIGER-Lab/MMLU-Pro
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 1.47
name: accuracy
source:
url: >-
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=M4-ai/TinyMistral-248M-v3
name: Open LLM Leaderboard
still in training. Trained on about ~21 billion tokens so far.
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
Open LLM Leaderboard | N/A | |||||||
- arc_challenge | 1 | none | 25 | acc | ↑ | 0.2005 | ± | 0.0117 |
none | 25 | acc_norm | ↑ | 0.2406 | ± | 0.0125 | ||
- gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.0083 | ± | 0.0025 |
strict-match | 5 | exact_match | ↑ | 0.0000 | ± | 0.0000 | ||
- hellaswag | 1 | none | 10 | acc | ↑ | 0.2724 | ± | 0.0044 |
none | 10 | acc_norm | ↑ | 0.2838 | ± | 0.0045 | ||
- mmlu | 2 | none | acc | ↑ | 0.2290 | ± | 0.0035 | |
- humanities | 2 | none | acc | ↑ | 0.2380 | ± | 0.0062 | |
- formal_logic | 1 | none | 5 | acc | ↑ | 0.2460 | ± | 0.0385 |
- high_school_european_history | 1 | none | 5 | acc | ↑ | 0.1818 | ± | 0.0301 |
- high_school_us_history | 1 | none | 5 | acc | ↑ | 0.2647 | ± | 0.0310 |
- high_school_world_history | 1 | none | 5 | acc | ↑ | 0.2911 | ± | 0.0296 |
- international_law | 1 | none | 5 | acc | ↑ | 0.2149 | ± | 0.0375 |
- jurisprudence | 1 | none | 5 | acc | ↑ | 0.2685 | ± | 0.0428 |
- logical_fallacies | 1 | none | 5 | acc | ↑ | 0.2209 | ± | 0.0326 |
- moral_disputes | 1 | none | 5 | acc | ↑ | 0.2457 | ± | 0.0232 |
- moral_scenarios | 1 | none | 5 | acc | ↑ | 0.2369 | ± | 0.0142 |
- philosophy | 1 | none | 5 | acc | ↑ | 0.1865 | ± | 0.0221 |
- prehistory | 1 | none | 5 | acc | ↑ | 0.1975 | ± | 0.0222 |
- professional_law | 1 | none | 5 | acc | ↑ | 0.2432 | ± | 0.0110 |
- world_religions | 1 | none | 5 | acc | ↑ | 0.3099 | ± | 0.0355 |
- other | 2 | none | acc | ↑ | 0.2375 | ± | 0.0076 | |
- business_ethics | 1 | none | 5 | acc | ↑ | 0.3200 | ± | 0.0469 |
- clinical_knowledge | 1 | none | 5 | acc | ↑ | 0.2226 | ± | 0.0256 |
- college_medicine | 1 | none | 5 | acc | ↑ | 0.1965 | ± | 0.0303 |
- global_facts | 1 | none | 5 | acc | ↑ | 0.1800 | ± | 0.0386 |
- human_aging | 1 | none | 5 | acc | ↑ | 0.3004 | ± | 0.0308 |
- management | 1 | none | 5 | acc | ↑ | 0.1942 | ± | 0.0392 |
- marketing | 1 | none | 5 | acc | ↑ | 0.2735 | ± | 0.0292 |
- medical_genetics | 1 | none | 5 | acc | ↑ | 0.3000 | ± | 0.0461 |
- miscellaneous | 1 | none | 5 | acc | ↑ | 0.2478 | ± | 0.0154 |
- nutrition | 1 | none | 5 | acc | ↑ | 0.2222 | ± | 0.0238 |
- professional_accounting | 1 | none | 5 | acc | ↑ | 0.2021 | ± | 0.0240 |
- professional_medicine | 1 | none | 5 | acc | ↑ | 0.1912 | ± | 0.0239 |
- virology | 1 | none | 5 | acc | ↑ | 0.2590 | ± | 0.0341 |
- social sciences | 2 | none | acc | ↑ | 0.2203 | ± | 0.0075 | |
- econometrics | 1 | none | 5 | acc | ↑ | 0.2368 | ± | 0.0400 |
- high_school_geography | 1 | none | 5 | acc | ↑ | 0.2020 | ± | 0.0286 |
- high_school_government_and_politics | 1 | none | 5 | acc | ↑ | 0.1865 | ± | 0.0281 |
- high_school_macroeconomics | 1 | none | 5 | acc | ↑ | 0.2205 | ± | 0.0210 |
- high_school_microeconomics | 1 | none | 5 | acc | ↑ | 0.2143 | ± | 0.0267 |
- high_school_psychology | 1 | none | 5 | acc | ↑ | 0.1908 | ± | 0.0168 |
- human_sexuality | 1 | none | 5 | acc | ↑ | 0.2672 | ± | 0.0388 |
- professional_psychology | 1 | none | 5 | acc | ↑ | 0.2386 | ± | 0.0172 |
- public_relations | 1 | none | 5 | acc | ↑ | 0.1727 | ± | 0.0362 |
- security_studies | 1 | none | 5 | acc | ↑ | 0.2367 | ± | 0.0272 |
- sociology | 1 | none | 5 | acc | ↑ | 0.2488 | ± | 0.0306 |
- us_foreign_policy | 1 | none | 5 | acc | ↑ | 0.2600 | ± | 0.0441 |
- stem | 2 | none | acc | ↑ | 0.2157 | ± | 0.0073 | |
- abstract_algebra | 1 | none | 5 | acc | ↑ | 0.2200 | ± | 0.0416 |
- anatomy | 1 | none | 5 | acc | ↑ | 0.1778 | ± | 0.0330 |
- astronomy | 1 | none | 5 | acc | ↑ | 0.1908 | ± | 0.0320 |
- college_biology | 1 | none | 5 | acc | ↑ | 0.2778 | ± | 0.0375 |
- college_chemistry | 1 | none | 5 | acc | ↑ | 0.2200 | ± | 0.0416 |
- college_computer_science | 1 | none | 5 | acc | ↑ | 0.2100 | ± | 0.0409 |
- college_mathematics | 1 | none | 5 | acc | ↑ | 0.2100 | ± | 0.0409 |
- college_physics | 1 | none | 5 | acc | ↑ | 0.2157 | ± | 0.0409 |
- computer_security | 1 | none | 5 | acc | ↑ | 0.2700 | ± | 0.0446 |
- conceptual_physics | 1 | none | 5 | acc | ↑ | 0.2638 | ± | 0.0288 |
- electrical_engineering | 1 | none | 5 | acc | ↑ | 0.2483 | ± | 0.0360 |
- elementary_mathematics | 1 | none | 5 | acc | ↑ | 0.2037 | ± | 0.0207 |
- high_school_biology | 1 | none | 5 | acc | ↑ | 0.1774 | ± | 0.0217 |
- high_school_chemistry | 1 | none | 5 | acc | ↑ | 0.2020 | ± | 0.0282 |
- high_school_computer_science | 1 | none | 5 | acc | ↑ | 0.2500 | ± | 0.0435 |
- high_school_mathematics | 1 | none | 5 | acc | ↑ | 0.2148 | ± | 0.0250 |
- high_school_physics | 1 | none | 5 | acc | ↑ | 0.2053 | ± | 0.0330 |
- high_school_statistics | 1 | none | 5 | acc | ↑ | 0.1481 | ± | 0.0242 |
- machine_learning | 1 | none | 5 | acc | ↑ | 0.3125 | ± | 0.0440 |
- truthfulqa_gen | 3 | none | 0 | bleu_acc | ↑ | 0.2362 | ± | 0.0149 |
none | 0 | bleu_diff | ↑ | -1.0138 | ± | 0.2569 | ||
none | 0 | bleu_max | ↑ | 7.9522 | ± | 0.4088 | ||
none | 0 | rouge1_acc | ↑ | 0.2595 | ± | 0.0153 | ||
none | 0 | rouge1_diff | ↑ | -1.9129 | ± | 0.4349 | ||
none | 0 | rouge1_max | ↑ | 21.7885 | ± | 0.7307 | ||
none | 0 | rouge2_acc | ↑ | 0.1200 | ± | 0.0114 | ||
none | 0 | rouge2_diff | ↑ | -1.9771 | ± | 0.3475 | ||
none | 0 | rouge2_max | ↑ | 9.0199 | ± | 0.5842 | ||
none | 0 | rougeL_acc | ↑ | 0.2570 | ± | 0.0153 | ||
none | 0 | rougeL_diff | ↑ | -1.8812 | ± | 0.4185 | ||
none | 0 | rougeL_max | ↑ | 19.6284 | ± | 0.6850 | ||
- truthfulqa_mc1 | 2 | none | 0 | acc | ↑ | 0.1983 | ± | 0.0140 |
- truthfulqa_mc2 | 2 | none | 0 | acc | ↑ | 0.3861 | ± | 0.0147 |
- winogrande | 1 | none | 5 | acc | ↑ | 0.4972 | ± | 0.0141 |
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
- mmlu | 2 | none | acc | ↑ | 0.2290 | ± | 0.0035 | |
- humanities | 2 | none | acc | ↑ | 0.2380 | ± | 0.0062 | |
- other | 2 | none | acc | ↑ | 0.2375 | ± | 0.0076 | |
- social sciences | 2 | none | acc | ↑ | 0.2203 | ± | 0.0075 | |
- stem | 2 | none | acc | ↑ | 0.2157 | ± | 0.0073 |
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
agieval_nous | 0 | none | acc_norm | ↑ | 0.2133 | ± | 0.0081 | |
- agieval_aqua_rat | 1 | none | 0 | acc | ↑ | 0.2047 | ± | 0.0254 |
none | 0 | acc_norm | ↑ | 0.1969 | ± | 0.0250 | ||
- agieval_logiqa_en | 1 | none | 0 | acc | ↑ | 0.2043 | ± | 0.0158 |
none | 0 | acc_norm | ↑ | 0.2304 | ± | 0.0165 | ||
- agieval_lsat_ar | 1 | none | 0 | acc | ↑ | 0.1739 | ± | 0.0250 |
none | 0 | acc_norm | ↑ | 0.1957 | ± | 0.0262 | ||
- agieval_lsat_lr | 1 | none | 0 | acc | ↑ | 0.1549 | ± | 0.0160 |
none | 0 | acc_norm | ↑ | 0.1608 | ± | 0.0163 | ||
- agieval_lsat_rc | 1 | none | 0 | acc | ↑ | 0.1636 | ± | 0.0226 |
none | 0 | acc_norm | ↑ | 0.2119 | ± | 0.0250 | ||
- agieval_sat_en | 1 | none | 0 | acc | ↑ | 0.2670 | ± | 0.0309 |
none | 0 | acc_norm | ↑ | 0.2621 | ± | 0.0307 | ||
- agieval_sat_en_without_passage | 1 | none | 0 | acc | ↑ | 0.2670 | ± | 0.0309 |
none | 0 | acc_norm | ↑ | 0.2621 | ± | 0.0307 | ||
- agieval_sat_math | 1 | none | 0 | acc | ↑ | 0.2182 | ± | 0.0279 |
none | 0 | acc_norm | ↑ | 0.2318 | ± | 0.0285 | ||
arc_challenge | 1 | none | 0 | acc | ↑ | 0.1945 | ± | 0.0116 |
none | 0 | acc_norm | ↑ | 0.2372 | ± | 0.0124 | ||
truthfulqa_mc2 | 2 | none | 0 | acc | ↑ | 0.3861 | ± | 0.0147 |
Groups | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
agieval_nous | 0 | none | acc_norm | ↑ | 0.2133 | ± | 0.0081 |
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 4.13 |
IFEval (0-Shot) | 16.39 |
BBH (3-Shot) | 1.78 |
MATH Lvl 5 (4-Shot) | 0.00 |
GPQA (0-shot) | 0.00 |
MuSR (0-shot) | 5.15 |
MMLU-PRO (5-shot) | 1.47 |