So this is like Lora adapters, but for the GPU rich?
Does it need to maintain all the models in vram? Or does it dynamically swap layers in and out of ram/disk and into vram as needed? Thanks!
It loads them all into vram - but you can run them quantized - so you can go a long way with 7-8b models. We also recently released Kraken Lora - which routes to different lora adapters, further lowering the vram use.
oh nice, the kraken-lora sounds great for my use-case :-)
Was afraid that it required loading all of the models to be loaded, but having lora support means I can have one single fast model and one big model which is better at reasoning, and then use lora adapters on top of them instead of having multiple weights, while having an easy format like this to specify it. Also curious if Kraken could self-update its routing mechanism while loaded to optimize which expert it chooses, but that is something extra I expect to dig into myself, just airing some thoughts :-)
Thanks for your work!
Also curious if Kraken could self-update its routing mechanism while loaded to optimize which expert it chooses, but that is something extra I expect to dig into myself, just airing some thoughts :-)
Not a bad idea - and it's possible. Training a .5b classifier model is compute efficient.