The model did not achieve the mtbench score as claimed, and the jglue score is low.
I have tried rerunning mtbench in two ways:
- Using GPT-4 as the judge (same as the method you used)
- Using GPT-4o as the judge
The results are shown in images 1 and 2. I noticed that your model scores significantly lower than models from Elyza and Swallow.
Next, I evaluated your model on the JGLUE benchmark (JGLUE has many benchmarks similar to the Nejumi leaderboard) and got quite poor results (image 3). Across 8 tasks, it only averaged 63. I have to say, this is a really bad result. I have evaluated many models capable of Japanese, such as Sakana, Rinna-llama3, Qwen2, GLM4, Elyza, and Swallow, all of which scored 70 or higher.
I want to raise a question regarding the actual quality of your model.
You can follow the the README.md at https://github.com/team-hatakeyama-phase2/llm-leaderboard
python scripts/run_jmtbench_eval.py
Put config.yaml
at llm-leaderboard/configs
as follows:
wandb:
log: True
entity: "weblab-geniac1" # たぬきチームはweblab-geniac1
project: "leaderboard_neo" # SFT検証の場合leaderboard_sft, テスト用はleaderboard_test
run_name: '0809_dpo_07-gpt4' # 学習時のwandb_nameを記載 e.g. 04_hallucination-tanuki_8B_lora-with_hallucination
github_version: v2.0.0 #for recording
testmode: false
# if you don't use api, please set "api" as "false"
# if you use api, please select from "openai", "anthoropic", "google", "cohere"
api: false
model:
use_wandb_artifacts: false
artifacts_path: ""
pretrained_model_name_or_path: '/storage5/personal/shioya/po_model/polab-experiments/8B/pass4_exp002-0809_dpo_07-zero2' #学習したモデルが保存されているpathを記載
trust_remote_code: true
device_map: "auto"
load_in_8bit: false
load_in_4bit: false
generator:
do_sample: false
num_beams: 1 # https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/text_generation
top_p: 1.0
top_k: 0
temperature: 0.1
repetition_penalty: 1.0
tokenizer:
use_wandb_artifacts: false
artifacts_path: ""
pretrained_model_name_or_path: "/storage5/personal/shioya/po_model/polab-experiments/8B/pass4_exp002-0809_dpo_07-zero2" #学習したモデルが保存されているpathを記載
use_fast: false
# for llm-jp-eval
max_seq_length: 2048
dataset_artifact: "wandb-japan/llm-leaderboard/jaster:v11" #if you use artifacts, please fill here (if not, fill null)
dataset_dir: "/jaster/1.2.6/evaluation/test"
target_dataset: "all" # {all, jamp, janli, jcommonsenseqa, jemhopqa, jnli, jsem, jsick, jsquad, jsts, niilc, chabsa}
log_dir: "./logs"
torch_dtype: "bf16" # {fp16, bf16, fp32}
custom_prompt_template: null
# if you use this, please include {instruction} and {input}. If you use few shots, please include {few_shots} additionally.
# example of prompt template with fewshots
# "以下はタスクを説明する指示と、追加の背景情報を提供する入力の組み合わせです。要求を適切に満たす回答を書いてください。\n### 指示:\n{instruction}\n{few_shots}\n### 入力:\n{input}\n### 回答:\n"
# example of prompt template without fewshots
# "以下はタスクを説明する指示と、追加の背景情報を提供する入力の組み合わせです。要求を適切に満たす回答を書いてください。\n### 指示:\n{instruction}\n### 入力:\n{input}\n### 回答:\n"
# example of LLama2 format; plase change tokenizer.bos_token
# "<tokenizer.bos_token>[INST] <<SYS>>\n あなたは誠実で優秀な日本人のアシスタントです。 \n<</SYS>>\n\n {instruction} \n\n {input} [/INST]"
custom_fewshots_template: null
# Please include {input} and {output} as variables
# example of fewshots template
# "\n### 入力:\n{input}\n### 回答:\n{output}"
metainfo:
basemodel_name: "0809_dpo_07-gpt4" # # 学習時のwandb_nameを記載 e.g. 04_hallucination-tanuki_8B_lora-with_hallucination
model_type: "open llm" # {open llm, commercial api}
instruction_tuning_method: "None" # {"None", "Full", "LoRA", ...}
instruction_tuning_data: ["None"] # {"None", "jaster", "dolly_ja", "oasst_ja", ...}
num_few_shots: 0
llm-jp-eval-version: "1.1.0"
# for mtbench
mtbench:
question_artifacts_path: 'wandb-japan/llm-leaderboard/mtbench_ja_question:v0' # if testmode is true, small dataset will be used
referenceanswer_artifacts_path: 'wandb-japan/llm-leaderboard/mtbench_ja_referenceanswer:v0' # if testmode is true, small dataset will be used
judge_prompt_artifacts_path: 'wandb-japan/llm-leaderboard/mtbench_ja_prompt:v1'
bench_name: 'japanese_mt_bench'
model_id: null # cannot use '<', '>', ':', '"', '/', '\\', '|', '?', '*', '.'
question_begin: null
question_end: null
max_new_token: 1024
num_choices: 1
num_gpus_per_model: 1
num_gpus_total: 1
max_gpu_memory: null
dtype: bfloat16 # None or float32 or float16 or bfloat16
# for gen_judgment
judge_model: 'gpt-4'
mode: 'single'
baseline_model: null
parallel: 1
first_n: null
# for conv template # added
custom_conv_template: true
# the following variables will be used when custom_conv_template is set as true
conv_name: "custom"
conv_system_template: "<s>{system_message}"
conv_system_message: "以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。"
conv_roles: "('指示', '応答')"
conv_sep: "\n\n### "
conv_sep2: "</s>"
conv_stop_token_ids: "[2,6]"
conv_stop_str: "### 指示:"
conv_sep_style: "custom"
conv_role_message_separator: ":\n"
conv_role_only_separator: ":\n"
I hope this helps.
Regarding the official evaluation of the Japanese MTBench, we need to use Leaderboard 3.
https://github.com/wandb/llm-leaderboard
This is because the program versions before Leaderboard Neo are somewhat outdated and do not support the chat template of the model we've built. If you intend to use Neo, you will need to make slight modifications to the program code.
https://github.com/team-hatakeyama-phase2/llm-leaderboard
Specifically, our model is trained with a chat template that appends an end-of-sequence (EOS) token at the end of each multi-turn conversation. However, prior to Leaderboard Neo, the system was not capable of appending the EOS token at this timing, which we have observed can lower the score by about one point.
Additionally, when using GPT-4 instead of GPT-4O during the evaluation in Neo, there is a bug in the scoring model where, in some cases (around 2-3 out of 160 questions), no score is output. In such cases, the system defaults to a score of -1, unfairly lowering the overall evaluation. As stated in the README, we handled this by calculating the average score, excluding the problems where a score of -1 was returned.
The relatively lower evaluation on JGLUE compared to other similar models is, in a sense, an intentional behavior. Benchmarks like JGLUE and JASTER mostly consist of tasks that test the ability to provide short and concise answers, such as multiple-choice questions, which we believe are not commonly required in real chatbot applications. Therefore, Tanuki was specifically developed to excel at tasks involving dialogue and composition, as required by MTBench.
We have written an article on this background in Japanese, so you might find it useful.
https://zenn.dev/matsuolab/articles/95fa297ef12a14
leaderboard 3 results
8b
https://wandb.ai/weblab-geniac1/llm-leaderboard3/reports/8b-nejumi-leaderboard3-all--Vmlldzo5Mjk2MTQz?accessToken=22frkj9myy7xugl8u6j4g39v4l1tsldydghnt7w1ieq2fdx5q6aymvqobrqjeu6v
example config for the 8b model with leaderboard3
wandb:
run_name: "Tanuki-8B-dpo-v1.0" # use run_name defined above
# if you don't use api, please set "api" as "false"
# if you use api, please select from "openai", "anthoropic", "google", "cohere", "vllm"
api: vllm
batch_size: 256 # vllmは256, apiは32を推奨
#test mode
testmode: false
run:
jaster: true
jmmlu_robustness: true # if this is set as true, jaster should set as true
mtbench: true
jbbq: true
lctg: true
toxicity: true
jtruthfulqa: true
aggregate: true
num_gpus: 8
model:
use_wandb_artifacts: false
pretrained_model_name_or_path: "weblab-GENIAC/Tanuki-8B-dpo-v1.0"
chat_template: "weblab-GENIAC/Tanuki-8B-dpo-v1.0"
# size_category: "<10B"
size_category: "50B≤"
release_date: "8/12/2024"
num_gpus_per_model: 8
num_gpus_total: 8
tensor_parallel_size: 8
Thank you for your answer. Based on the config you ran, I understand that you are using GPT-4 as the judge model, and the reference answer is from https://github.com/Stability-AI/FastChat/blob/jp-stable/fastchat/llm_judge/data/japanese_mt_bench/reference_answer/base-gpt4o-with-human-annotation.jsonl.
I used the code from https://github.com/Stability-AI/FastChat/blob/jp-stable/fastchat/llm_judge to run the same config as yours, and the results are shown in the image.
I would like to ask why there is such a significant discrepancy in the results?
The reason of discrepancy is primally the differences of chat templates.
The Stability-AI/FastChat repo is outdated and deprecated for this model, because it does not support the chat template for this model.
You can use the repositories which are in previous responses.
In general, the performances of open models including Mistral and Llama are sensitive to chat templates, while proprietary LLMs like GPT4, Claude and Gemini have robustness against various chat templates and input formats.
It's true that StabilityAI's repo is outdated, but I have fixed and updated the new chat templates. The fact is that when I evaluate, your model, Swallow-8B, and Elyza-8B all use the same chat template.
If possible, please provide me with the chat template you use to evaluate the model. I want to reproduce the results.
Thank you very much.
I think that it's provided as follows: https://huggingface.co/weblab-GENIAC/Tanuki-8B-dpo-v1.0/blob/main/tokenizer_config.json#L110
I am afraid that your updates on StabilityAI's repo might differ from the maintainer's ones.
Could you use the repo at https://github.com/team-hatakeyama-phase2/llm-leaderboard for Neo and
https://github.com/wandb/llm-leaderboard for Leaderboard 3 ?