prometheus-2-llama-3-8b

Finetuned llama-3.1-8B-instruct of prometheus-7b-v2.0 using Llama 3.1 8B Instruct as the base model.

Training hyperparameters:

3 epoch
Learning rate 1e-5
Effective batch size 4
Cosine annealing
~5% warmup

Supports both feedback (likert-scale) evaluation and preference evaluation. Uses Llama 3.1 Instruct the same prompts as prometheus-7b-v2.0. See example information below.

Feedback Evaluation

ABSOLUTE_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{}

###Response to evaluate:
{}

###Reference Answer (Score 5):
{}

###Score Rubrics:
{}

###Feedback: """

device = 'cuda:0'
model = AutoModelForCausalLM.from_pretrained("zli12321/prometheus2-llama3.1-8B").to(device)
tokenizer = AutoTokenizer.from_pretrained("zli12321/prometheus2-llama3.1-8B")

'''
Define your own instruction, response, reference, and rubric below
'''
prompt = ABSOLUTE_PROMPT.format(instruction, response, reference, rubric)
    
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
input_length = input_ids.shape[1]
outputs = model.generate(input_ids, output_logits=True, return_dict_in_generate=True, max_new_tokens=4096)
print(tokenizer.decode(outputs.sequences[0], skip_special_tokens=True))

Preference Evaluation Template

Follow the above to generate preference evaluation with the preference evaluation template.

###Task Description:
An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of two responses strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, choose a better response between Response A and Response B. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (A or B)"
4. Please do not generate any other opening, closing, and explanations.

###Instruction:
{}

###Response A:
{}

###Response B:
{}

###Reference Answer:
{}

###Score Rubric:
{}

###Feedback:

Citations

@misc{kim2023prometheus,
    title={Prometheus: Inducing Fine-grained Evaluation Capability in Language Models},
    author={Seungone Kim and Jamin Shin and Yejin Cho and Joel Jang and Shayne Longpre and Hwaran Lee and Sangdoo Yun and Seongjin Shin and Sungdong Kim and James Thorne and Minjoon Seo},
    year={2023},
    eprint={2310.08491},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

@misc{kim2024prometheus,
    title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},
    author={Seungone Kim and Juyoung Suk and Shayne Longpre and Bill Yuchen Lin and Jamin Shin and Sean Welleck and Graham Neubig and Moontae Lee and Kyungjae Lee and Minjoon Seo},
    year={2024},
    eprint={2405.01535},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

zli12321
/

prometheus2-llama3.1-8B

prometheus-2-llama-3-8b

Feedback Evaluation

Preference Evaluation Template

Citations

Model tree for zli12321/prometheus2-llama3.1-8B

Datasets used to train zli12321/prometheus2-llama3.1-8B