Leaderboard evaluation failed.

#3
by adamo1139 - opened

I was curious how this one would pan out on the leaderboard, but it failed evaluation for some reason.
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/brucethemoose/Yi-34B-200K-DARE-merge-v5_eval_request_False_bfloat16_Original.json

The 200K models are unreliable on the leaderboard. I think they can make the evaluation servers OOM because they try to load the 200k context.

HF staff just have to manually fix it, perhaps we should bring their attention to it in a post on the request page? I don't even care about the leaderboard position really, I just want the best 200k model possible and want a datapoint against the other merges and 200Ks :P

I will open a discussion on the leaderboard community page about this later today (in 12 hours) unless I see you doing it first.
I assume you've tried the model yourself via exl2/gguf quants only due to limited vram, yes? Can you check whether it loads in transformers if you set load_in_4bit=True and manually edit max_position_embeddings to a lower value? I have limited bandwidth so I can't download it this month to verify that myself.

Yeah, bnb transformers is how I always test them first, before quantizing. In fact I go through a few merge variants with bnb 4 bit and pick the best one.

Transformers is quite a RAM hog at long context though. I can fit 3K context with bnb, and 47K context with exllamav2 4bpw.

adamo1139 changed discussion status to closed

Sign up or log in to comment