nayohan's picture
Update README.md
ae787f6 verified
metadata
language:
  - en
  - ko
license: llama3
library_name: transformers
tags:
  - ko
  - eval
  - llm-eval
base_model:
  - meta-llama/Meta-Llama-3-8B-Instruct
datasets:
  - nayohan/feedback-collection-ko
  - nayohan/feedback-collection-ko-chat
pipeline_tag: text-generation

Introduction

This model translated the prometheus-eval/Feedback-Collection dataset into Korean and trained on the llama3-8b-it model. Train Dataset: nayohan/feedback-collection-ko

Loading the Model

Use the following Python code to load the model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "nayohan/llama3-8b-it-prometheus-ko"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  device_map="auto",
  torch_dtype=torch.bfloat16
)

Generating Text

System prompt is fixed, and you can set the score rubric according to the given task, and then change the orig_instruction, orig_response, and orig_reference_answer to evaluate it.

system_prompt = """###Task Description: An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)\"
4. Please do not generate any other opening, closing, and explanations."""

sample = {
  'orig_instruction': "λ‚˜λŠ” 첨단 기술 ν”„λ‘œμ νŠΈλ₯Ό μ§„ν–‰ν•˜λŠ” νŒ€μ— μžˆλ‹€. κ·ΈλŸ¬λ‚˜ 졜근 ν”„λ‘œμ νŠΈ λ°©ν–₯을 놓고 νŒ€μ›λ“€ 사이에 지속적인 κ°ˆλ“±μ΄ λ°œμƒν•˜κ³  μžˆλ‹€. ν•œ 그룹은 급진적이고 μœ„ν—˜ν•˜μ§€λ§Œ 잠재적으둜 κ²Œμž„μ„ λ°”κΏ€ 수 μžˆλŠ” 접근법을 κ°•λ ₯ν•˜κ²Œ μ˜Ήν˜Έν•˜κ³  μžˆλ‹€. λŒ€μ‘°μ μœΌλ‘œ, λ‹€λ₯Έ 그룹은 보닀 μΈ‘μ •λ˜κ³  더 μ•ˆμ „ν•˜λ©° μž…μ¦λœ μ „λž΅μ„ μ„ ν˜Έν•œλ‹€. 결과적으둜 우리 νŒ€μ€ λΆ„μ—΄λ˜μ–΄ 진전을 이룰 수 μ—†λ‹€. 우리의 λŒ€ν™”λ₯Ό μ€‘μž¬ν•˜κ³  해결을 μ΄λŒμ–΄λ‚Ό 수 μžˆλŠ” AI λͺ¨λΈμ΄ ν•„μš”ν•˜λ‹€. μ΄λŸ¬ν•œ 상황에 λŒ€μ‘ν•˜μ—¬ AI λͺ¨λΈμ€ 무엇을 말해야 ν•˜λŠ”κ°€?",
  'orig_response': "κ·ΈλŸ¬λ‹ˆκΉŒ ν”„λ‘œμ νŠΈ λ°©ν–₯에 ν•©μ˜κ°€ μ•ˆ λ˜λŠ” νŒ€μ— μžˆλŠ” κ±° μ•„λ‹ˆμ•Ό? λ‹€λ“€ 잘 λ§žλ„λ‘ λ°°μ›Œμ•Ό ν•  것 κ°™λ„€μš”. μ–΄μ©Œλ©΄ 동전을 λ˜μ§€κ³  μ–΄λŠ μͺ½μ΄ μŠΉλ¦¬ν•˜λŠ”μ§€ 봐야 ν•  것 κ°™μ•„μš”. κ·Έλ ‡κ²Œ ν•˜λ©΄ λ…ΌμŸμ΄ μ—†κ³  λͺ¨λ‘κ°€ μΌν„°λ‘œ λŒμ•„κ°ˆ 수 μžˆμŠ΅λ‹ˆλ‹€. μœ„ν—˜ν•˜λ“  μ•ˆμ „ν•˜λ“  μƒκ΄€μ—†μ–΄μš”. ν•˜λ‚˜λ₯Ό κ³¨λΌμ„œ κ·Έλƒ₯ κ°€μ„Έμš”. κ²Œλ‹€κ°€, λͺ¨λ“  것이 λ¬΄λ„ˆμ§€λ©΄ μ„œλ‘œ λΉ„λ‚œν•˜κ³  λ„˜μ–΄κ°ˆ 수 μžˆμŠ΅λ‹ˆλ‹€. μ•„λ‹ˆλ©΄ 더 쒋은 것은, μ–΄λ–€ 그룹의 아이디어가 더 λ‚˜μ€μ§€ 보기 μœ„ν•œ 경쟁이 μ™œ μ•ˆ 돼? νŒ¨λ°°μžλŠ” 우승자λ₯Ό μœ„ν•΄ 점심을 사야 ν•΄μš”.",
  'orig_reference_answer': "이 νŒ€μ˜ λͺ¨λ“  μ‚¬λžŒλ“€μ΄ ν”„λ‘œμ νŠΈμ— 열정적이고 μ„±κ³΅ν•˜κΈ°λ₯Ό μ›ν•œλ‹€λŠ” 것은 λΆ„λͺ…ν•˜λ©°, μ΄λŠ” λͺ¨λ“  ν•΄κ²°μ˜ ν›Œλ₯­ν•œ μΆœλ°œμ μ΄λ‹€. λ˜ν•œ κ°ˆλ“±μ€ μœ„ν—˜κ³Ό ν˜μ‹ μ— λŒ€ν•œ μ„œλ‘œ λ‹€λ₯Έ κ΄€μ μ—μ„œ λ°œμƒν•œλ‹€λŠ” 것도 λΆ„λͺ…ν•©λ‹ˆλ‹€. λ‘˜ λ‹€ ν”„λ‘œμ νŠΈμ˜ 성곡에 μ€‘μš”ν•œ κ³ λ € μ‚¬ν•­μž…λ‹ˆλ‹€. 두 접근법 λͺ¨λ‘μ—μ„œ μœ νš¨ν•œ 점을 μΈμ •ν•˜λŠ” κ²ƒμœΌλ‘œ μ‹œμž‘ν•˜κ² μŠ΅λ‹ˆλ‹€. 급진적인 접근법을 μ˜Ήν˜Έν•˜λŠ” νŒ€μ€ 높은 보상과 획기적인 ν˜μ‹ μ˜ 잠재λ ₯에 μ˜ν•΄ μ£Όλ„λ˜λ©°, μ΄λŠ” λͺ¨λ“  첨단 ν”„λ‘œμ νŠΈμ—μ„œ ν›Œλ₯­ν•˜κ³  ν•„μˆ˜μ μž…λ‹ˆλ‹€.",
  'orig_criteria':'λͺ¨ν˜•μ€ λŒ€ν™”μ—μ„œ κ°ˆλ“± 해결을 μ–Όλ§ˆλ‚˜ 효과적으둜 μ²˜λ¦¬ν•˜λŠ”κ°€?',
  'orig_score1_description':'λͺ¨λΈμ€ κ°ˆλ“±μ΄λ‚˜ μ˜€ν•΄λ₯Ό κ°€μ€‘μ‹œμΌœ 문제λ₯Ό μ€‘μž¬ν•˜κ±°λ‚˜ ν•΄κ²°ν•  수 μžˆλŠ” λŠ₯λ ₯을 보이지 μ•ŠλŠ”λ‹€.',
  'orig_score2_description':'이 λͺ¨λΈμ€ κ°ˆλ“±μ— λŒ€ν•œ 인식이 μžˆμ§€λ§Œ 이λ₯Ό ν•΄κ²°ν•˜λ €λŠ” μ‹œλ„λŠ” νš¨κ³Όκ°€ μ—†κ±°λ‚˜ 잘λͺ»λœ 지침을 가지고 μžˆλ‹€.',
  'orig_score3_description':'이 λͺ¨λΈμ€ κ°ˆλ“±μ„ μ λ‹Ήνžˆ μ²˜λ¦¬ν•˜μ—¬ 일뢀 성곡적인 ν•΄κ²° μ „μˆ μ„ λ³΄μ—¬μ£Όμ§€λ§Œ 더 일관성이 μžˆμ„ 수 μžˆλ‹€.',
  'orig_score4_description':'이 λͺ¨λΈμ€ κ°ˆλ“±μ„ 잘 μ²˜λ¦¬ν•˜μ—¬ κΈ΄μž₯을 ν™•μ‚°μ‹œν‚€κ³  해결을 효과적으둜 μ•ˆλ‚΄ν•˜μ§€λ§Œ λ―Έμ„Έν•œ λ―Έλ„λŸΌμ΄ μžˆμŠ΅λ‹ˆλ‹€.',
  'orig_score5_description':'이 λͺ¨λΈμ€ κ°ˆλ“±μ„ ν›Œλ₯­ν•˜κ²Œ κ΄€λ¦¬ν•˜κ³ , μ§€μ†μ μœΌλ‘œ κΈ΄μž₯을 ν™•μ‚°μ‹œν‚€λ©°, λŒ€ν™”λ₯Ό νƒ€ν˜‘μœΌλ‘œ μ•ˆλ‚΄ν•˜κ³  긍정적인 λŒ€ν™” ν™˜κ²½μ„ μ‘°μ„±ν•œλ‹€.',
  'orig_feedback': '제곡된 응닡은 λ‹Ήλ©΄ν•œ 문제λ₯Ό μ‘°μ •ν•˜κ±°λ‚˜ ν•΄κ²°ν•˜λŠ” λŠ₯λ ₯을 보여주지 μ•ŠλŠ”λ‹€. λŒ€μ‹  νŒ€μ˜ 우렀λ₯Ό μ‚¬μ†Œν™”ν•˜κ³  잠재적인 결과에 λŒ€ν•œ κ³ λ € 없이 동전을 λ˜μ§€κ±°λ‚˜ λŒ€νšŒλ₯Ό κ°œμ΅œν•˜λŠ” 것과 같은 비건섀적 μ†”λ£¨μ…˜μ„ μ œμ•ˆν•œλ‹€. λ˜ν•œ 응닡은 상황이 잘λͺ»λ˜λ©΄ νŒ€ ꡬ성원듀이 μ„œλ‘œλ₯Ό λΉ„λ‚œν•΄μ•Ό ν•œλ‹€λŠ” 것을 μ•”μ‹œν•œλ‹€. κ°ˆλ“±μ„ λ”μš± μ•…ν™”μ‹œν‚¨λ‹€. 건섀적인 λŒ€ν™”λ₯Ό μž₯λ €ν•˜κ±°λ‚˜ 두 접근법 μ‚¬μ΄μ˜ 쀑간 지점을 μ°ΎλŠ” κ²ƒμ˜ μ€‘μš”μ„±μ„ μΈμ •ν•˜μ§€ μ•ŠλŠ”λ‹€. λ”°λΌμ„œ 전체 μ μˆ˜λŠ” 1이닀.',
  'orig_score': 1,
}

instruction = f"""###The instruction to evaluate: {sample['orig_instruction']}
  ###Response to evaluate: {sample['orig_response']}
  ###Reference Answer (Score 5): {sample['orig_reference_answer']}
  ###Score Rubrics: [{sample['orig_criteria']}]
  Score 1: {sample['orig_score1_description']}
  Score 2: {sample['orig_score2_description']}
  Score 3: {sample['orig_score3_description']}
  Score 4: {sample['orig_score4_description']}
  Score 5: {sample['orig_score5_description']}
  ###Feedback:"""

# for training
# output = f"""{sample['orig_feedback']}
#   [RESULT] {sample['orig_score']}"""
    
conversation = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": instruction},
            # {"role": "assistant", "content": output}
        ]

input_ids = tokenizer.apply_chat_template(
  conversation,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors='pt'
).to("cuda")

output = model.generate(input_ids, max_new_tokens=512)
output_text = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
print(output_text)

If you don't have a reference text, it can work without one. The model evaluates orig_response, the sentence after orig_instruction. Use the following template code.

instruction = f"""###The instruction to evaluate: {sample['orig_instruction']}
  ###Response to evaluate: {sample['orig_response']}
  ###Score Rubrics: [{sample['orig_criteria']}]
  Score 1: {sample['orig_score1_description']}
  Score 2: {sample['orig_score2_description']}
  Score 3: {sample['orig_score3_description']}
  Score 4: {sample['orig_score4_description']}
  Score 5: {sample['orig_score5_description']}
  ###Feedback:"""

The model was trained with feedback truncated, feedback can sometimes be represented with some truncation.

# Result with orig_reference_answer 
# OUTPUT: 이 λŒ€μ‘μ€ κ°ˆλ“± 해결에 λŒ€ν•œ 이해가 뢀쑱함을 보여쀀닀. 동전을 λ˜μ§€κ±°λ‚˜ κ²½μŸμ„ μ œμ•ˆν•˜λŠ” 것과 같이 제곡된 μ œμ•ˆμ€ 문제의 λ³΅μž‘μ„±μ„ λ¬΄μ‹œν•˜κ³  νŒ€ λ‚΄μ˜ λ‹€μ–‘ν•œ 관점을 λ¬΄μ‹œν•œλ‹€. 응닡은 두 μ ‘κ·Όλ²•μ˜ 잠재적 κ°€μΉ˜λ₯Ό μΈμ •ν•˜μ§€ μ•ŠμœΌλ©° νŒ€ ꡬ성원 κ°„μ˜ 이해와 쑴쀑을 μ΄‰μ§„ν•˜μ§€λ„ μ•ŠλŠ”λ‹€. λ˜ν•œ 응닡은 νŒ€μ˜ μ—΄μ •κ³Ό ν”„λ‘œμ νŠΈμ— λŒ€ν•œ ν—Œμ‹ μ„ μΈμ •ν•˜μ§€ μ•ŠλŠ”λ‹€. λ”°λΌμ„œ 전체 μ μˆ˜λŠ” 1이닀.
    [RESULT] 1
# Result without orig_reference_answer 
# OUTPUT: λŒ€μ‘μ€ κ°ˆλ“± 해결에 λŒ€ν•œ 이해λ₯Ό λ‚˜νƒ€λ‚΄μ§€ μ•ŠλŠ”λ‹€. AI λͺ¨λΈμ€ κ°ˆλ“±μ„ ν•΄κ²°ν•˜κΈ°λ³΄λ‹€λŠ” κ°ˆλ“±μ„ μ•…ν™”μ‹œν‚€λŠ” 것을 μ œμ•ˆν•˜λ©°, μ΄λŠ” 점수 λ£¨λΈŒλ¦­μ— 따라 μš”κ΅¬ 사항에 μ–΄κΈ‹λ‚œλ‹€. 동전을 λ˜μ§€κ³  κ²½μŸμ„ μ œμ•ˆν•˜λŠ” 것은 νŒ€ ꡬ성원 κ°„μ˜ κΈ΄μž₯을 ν™•μ‚°μ‹œν‚€λŠ” 데 도움이 λ˜μ§€ μ•Šκ³  였히렀 더 λ§Žμ€ κ°ˆλ“±μ„ μ΄‰λ°œν•  수 μžˆλ‹€. λ˜ν•œ, νŒ€ ꡬ성원이 더 λ‚˜μ€ 아이디어λ₯Ό κ°–λŠ” 것이 μ•„λ‹ˆλΌ "더 λ‚˜μ€" 아이디어λ₯Ό κ°–λŠ”λ‹€λŠ” 것을 μ•”μ‹œν•˜λŠ” 것은 νŒ€ ꡬ성원 κ°„μ˜ 화합을 μ΄‰μ§„ν•˜μ§€ μ•ŠλŠ”λ‹€. λ”°λΌμ„œ 전체 μ μˆ˜λŠ” 1이닀.
    [RESULT] 1

If you just want to get a score from the evaluation, you can use the following extract_score function.

import re
def extract_score(text):
    pattern = re.compile(r'\[RESULT\]\s+([0-5])')
    match = pattern.search(text)
    if match:
        score = int(match.group(1))
    else: score=0
    return score

predict_score = extract_score(output_text)
print(predict_score) # 1

Heatmap Visualize

[eng->eng] we randomly sampled 200 evalset from the training data, extracted scores from the model-generated sentences, and compared them to the correct answers. The training and test datasets are not separated, so we can only see how well the model learned.

[ko->ko] sampled 200 evalset in this testset. llama3-8b-it-prometheus-ko only use train set.

  • prometheus-7b-v1.0 (english train-> english inference) # 3 failed to output a score, total 197
  • llama3-8b-it-prometheus-ko (korean train-> korean inference) # total 200

image/png

Citation

@misc{kim2023prometheus,
    title={Prometheus: Inducing Fine-grained Evaluation Capability in Language Models},
    author={Seungone Kim and Jamin Shin and Yejin Cho and Joel Jang and Shayne Longpre and Hwaran Lee and Sangdoo Yun and Seongjin Shin and Sungdong Kim and James Thorne and Minjoon Seo},
    year={2023},
    eprint={2310.08491},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Our trainig code can be found here: [TBD]