kaleinaNyan's picture
refactor: update link to arena code
261234d verified
metadata
license: apache-2.0
language:
  - ru
  - en
base_model:
  - jinaai/jina-embeddings-v3

JinaJudge: Proxy Judgement for Russian LLM Arena

Description

This model is trained to replicate the judgement patterns of GPT-4-1106-Preview in the Russian LLM Arena, designed for faster and more cost-effective evaluation of language models. While the model's focus is on Russian LLM evaluation, it can also be used for English-centric models.


Model Details

This is an iterative update of kaleinaNyan/jina-v3-rullmarena-judge-300924 model:

  • Increased amount of training data (not by much, approaximately 1.5x times).
  • Updated data composition to fix erroneous judgements where GPT-4 picked English responses over Russian ones.
  • Validation set was updated as well to exclude such errors.
  • Test set did not change (no bad judgements in that regard).

Evaluation

The validation process was based on existing judgements from the Russian LLM Arena, which were already available. These judgements were filtered and simplified to match the three-class structure used in training.

NOTE: values in parenthesis show relative improvement compared to previous model.

Models evaluated:

  • gemma-2-9b-it-sppo-iter3
  • glm-4-9b-chat
  • gpt-3.5-turbo-1106
  • mistral-7b-instruct-v0.3
  • storm-7b

Validation Performance (old validation set):

  • Accuracy: 79.97% (-0.78)
  • Precision: 78.25% (-0.31)
  • Recall: 78.25% (-1.23)
  • F1-score: 78.25% (-0.75)

NOTE: will report later what actually caused the drop (the subset of fixed judgements or smth else)

Validation Performance (new validation set):

  • Accuracy: 83.59% (+2.48)
  • Precision: 80.97% (+2.14)
  • Recall: 80.97% (+1.22)
  • F1-score: 80.97% (+1.77)

For the test phase, new judgements were generated using GPT-4 for the kolibri-mistral-0427-upd model.

Test Performance:

  • Accuracy: 85.09% (+2.37)
  • Precision: 83.20% (+3.09)
  • Recall: 83.20% (+0.78)
  • F1-score: 83.20% (+2.02)

Usage Example

from transformers import AutoModel

jina = AutoModel.from_pretrained("kaleinaNyan/jina-v3-rullmarena-judge-041024", trust_remote_code=True)

prompt_template = """
<user prompt>
{user_prompt}
<end>
<assistant A answer>
{assistant_a}
<end>
<assistant B answer>
{assistant_b}
<end>
""".strip()

prompt = "your prompt"
assistant_a = "assistant a response"
assistant_b = "assistant b response"

example = prompt_template.format(
    user_prompt=user_prompt,
    assistant_a=assistant_a,
    assistant_b=assistant_b,
)

judgement = jina([example])[0].argmax()

judgement_map = {
  0: "A is better than B",
  1: "A == B",
  2: "B is better than A"
}

print(judgement_map[judgement])

Generated ranking

The ranking was obtained using a modified Russian LLM Arena code. All judgements were regenerated using the jina-judge model. It takes about 16 minutes to regenerate the whole board (or 23 seconds per model) on an RTX3090.

Model Score 95% CI Average #Tokens
gpt-4-1106-preview 82.8 (-2.2, 2.3) 541
gpt-4o-mini 75.3 (-2.5, 2.9) 448
qwen-2.5-72b-it 73.1 (-3.4, 3.1) 557
gemma-2-9b-it-sppo-iter3 70.6 (-3.9, 2.8) 509
gemma-2-27b-it 68.7 (-2.8, 3.8) 472
t-lite-instruct-0.1 67.5 (-3.8, 3.8) 810
gemma-2-9b-it 67.0 (-3.7, 3.3) 459
suzume-llama-3-8B-multilingual-orpo-borda-half 62.4 (-3.5, 3.7) 682
glm-4-9b-chat 61.5 (-3.7, 3.0) 568
phi-3-medium-4k-instruct 60.4 (-3.5, 3.7) 566
sfr-iterative-dpo-llama-3-8b-r 57.2 (-3.9, 2.2) 516
c4ai-command-r-v01 55.0 (-3.9, 3.1) 529
suzume-llama-3-8b-multilingual 51.9 (-2.8, 3.7) 641
mistral-nemo-instruct-2407 51.9 (-3.8, 3.7) 403
yandex_gpt_pro 50.3 (-3.4, 3.1) 345
gpt-3.5-turbo-0125 50.0 (0.0, 0.0) 220
hermes-2-theta-llama-3-8b 49.3 (-3.4, 3.9) 485
starling-lm-7b-beta 48.3 (-3.8, 4.0) 629
llama-3-8b-saiga-suzume-ties 47.9 (-3.9, 5.0) 763
llama-3-smaug-8b 47.6 (-3.6, 3.1) 524
vikhr-it-5.4-fp16-orpo-v2 46.8 (-2.5, 2.7) 379
aya-23-8b 46.1 (-3.9, 3.9) 554
saiga_llama3_8b_v6 44.8 (-3.4, 3.3) 471
qwen2-7b-instruct 43.6 (-3.0, 2.7) 340
vikhr-it-5.2-fp16-cp 43.6 (-4.1, 3.3) 543
openchat-3.5-0106 42.8 (-3.9, 3.3) 492
kolibri-mistral-0427-upd 42.3 (-4.2, 3.2) 551
paralex-llama-3-8b-sft 41.8 (-3.2, 3.7) 688
llama-3-instruct-8b-sppo-iter3 41.7 (-3.4, 3.3) 502
gpt-3.5-turbo-1106 41.5 (-2.9, 2.1) 191
mistral-7b-instruct-v0.3 41.1 (-4.3, 3.5) 469
gigachat_pro 40.9 (-3.4, 3.6) 294
openchat-3.6-8b-20240522 39.1 (-3.2, 4.1) 428
vikhr-it-5.3-fp16-32k 38.8 (-3.5, 3.3) 519
hermes-2-pro-llama-3-8b 38.4 (-3.2, 3.1) 463
kolibri-vikhr-mistral-0427 34.5 (-2.9, 3.5) 489
vikhr-it-5.3-fp16 33.5 (-3.5, 3.8) 523
llama-3-instruct-8b-simpo 32.7 (-3.9, 3.6) 417
meta-llama-3-8b-instruct 32.1 (-3.4, 3.3) 450
neural-chat-7b-v3-3 25.9 (-2.7, 3.6) 927
gigachat_lite 25.4 (-2.8, 2.5) 276
snorkel-mistral-pairrm-dpo 10.3 (-2.0, 2.3) 773
storm-7b 3.7 (-1.3, 1.6) 419