Warning this model might be overhyped

by rombodawg - opened Oct 17

Discussion

rombodawg

Owner Oct 17

hiauiarau

Oct 17

Hi. What datasets were used for the fine tuning?

rombodawg

Owner Oct 17

@hiauiarau like the V2.5 I didnt finetune, i only applied the last step of my method, which is to merge the already finetuned model (Llama-3.1-Nemotron-70B-Instruct-HF) with the original instruct and base models to reduce the loss gained from finetuning. So my version is basically Llama-3.1-Nemotron-70B-Instruct-HF without the loss from tuning.

Although ive leared that my method does have the downside of some instruct following loss

hiauiarau

Oct 17

@rombodawg And what do you mean by instructional losses? When the instructions differ from one finetune to another?

rombodawg

Owner Oct 17

No its mainly just a drop on the "IFEval" on the open llm leaderboard which is suppose to be a instruction following benchmark, but it could also be something else that causes it

hiauiarau

Oct 18

What do you think it could be related to? And can it be overcome by preparing another lora adapter for the IFEval task?

fblgit

Oct 19

the merge yaml:

models:
  - model: ./mergekit/models/llama-3.1-70b-instruct
    parameters:
      weight: 1
      density: 1
  - model: ./mergekit/models/Nemotron-70B-Instruct-HF
    parameters:
      weight: 1
      density: 1
merge_method: ties
base_model: ./mergekit/models/llama-3.1-70b-base
parameters:
  weight: 1
  density: 1
  normalize: true
  int8_mask: true
dtype: bfloat16

So it uses llama-3.1-70b + llama-3.1-70b-instruct + nemotron-70b-instruct with a regular 1:1:1 weightage.

Just to be sure im not missing anything in the picture.. if i re-run that merge and compare the parameter.item from the merge an the uploaded weights, it should match 1:1 right? there is no SFT/DPO or training session of any kind ..correct?

Handgun1773

Oct 20

No its mainly just a drop on the "IFEval" on the open llm leaderboard which is suppose to be a instruction following benchmark, but it could also be something else that causes it

Well, you are “averaging” the weight of an instruction fine-tuned model with its base model.
It's kind of intuitive that it would lose some instruction following capacity.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment