8-kv-heads
#17
by
ArthurZ
HF staff
- opened
No description provided.
@ArthurZ I'm waiting on this as well.
ArthurZ
changed pull request status to
open
ArthurZ
changed pull request status to
merged
Can you explain the precise rationale on why this change was made? The reason this configuration existed is that a 405b model at bf16 isn't loadable on 8 GPUs on any hardware we knew. Is the intended use case one where the weights are loaded and then dynamically quantized and then this configuration leads to faster and more efficient loads since the duplicate heads aren't needed?