leaderboard-pr-bot's picture
Adding Evaluation Results
3353590 verified
|
raw
history blame
7.35 kB
metadata
language:
  - en
license: mit
library_name: transformers
model-index:
  - name: free-evo-qwen72b-v0.8-re
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 79.86
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 91.34
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 78
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 74.85
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 87.77
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 75.89
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 53.31
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 45.32
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 16.24
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 14.21
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 20.96
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 43
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=freewheelin/free-evo-qwen72b-v0.8-re
          name: Open LLM Leaderboard

Model Card for free-evo-qwen72b-v0.8

Developed by : Freewheelin AI Technical Team

2024 4th May - avg. 81.28 Open Llm Leaderboard

Metric Value
Avg. 81.28
ARC (25-Shot) 79.86
HellaSwag (10-Shot) 91.32
MMLU (5-Shot) 78.00
TruthfulQA (0-shot) 74.85
Winogrande (5-shot) 87.77
GSM8k (5-shot) 75.89

Method

Process

You need two models with the same architecture.

  • Choose one model and fine-tune it to create a gap between the original model and the fine-tuned one. It doesn't matter whether the evaluation score is higher or lower.
  • Merge the two models.
  • Evaluate the merged model.
  • Fine-tune a specific evaluation part of the model if you need to increase the score for that part. (It's unlikely to work as you think, but you can try it.)
  • Merge the models again.
  • Evaluate again.
  • Keep going until the average evaluation score is higher than the original one.

That's it. Simple. You can create a framework to automate this process.

Base Architecture

  • QWEN2

Base Models

  • several QWEN2 based models

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 32.17
IFEval (0-Shot) 53.31
BBH (3-Shot) 45.32
MATH Lvl 5 (4-Shot) 16.24
GPQA (0-shot) 14.21
MuSR (0-shot) 20.96
MMLU-PRO (5-shot) 43.00