Quantized model coming?
Do you plan to release a quantized version, or do you have tips on the best way to quantize this model for local inference?
Hi, I just uploaded two quantized versions: an 8bit and a 4bit, using BitsAndBytes conversions. Would be curious how it works for you!
A GGUF would be nice for people who do not have nvidia gpus.
UPD: oh, it is not yet supported https://github.com/ggerganov/llama.cpp/issues/6803
Thank you for your attention. We plan to release the quantized model in the next few days. We have successfully implemented 8-bit quantization, but encountered some issues with 4-bit.
Is it in plan to look into the issues with the 4-bit version?
https://huggingface.co/failspy/InternVL-Chat-V1-5-4bit
This fits on my 3090 but the response is only an empty string "" to such an image with the following prompt (question)
question = "Please describe the picture in detail"
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(question, response)
I was following https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5#model-usage , just switched path = "OpenGVLab/InternVL-Chat-V1-5" to path = "failspy/InternVL-Chat-V1-5-4bit" and load_in_4bit instead of load_in_8bit, but got "" as response from model.:( But at least I hope that this means I will be able to run a working one, I hope so because the online demo has impressed me big time!
I have the same issue with the failspy/InternVL-Chat-V1-5-4bit on my 3090TI... Possibly something with the quantization_config in the config.json
possibly? I'm curious if anyone has actually run this locally in 24GB VRAM...
I tried it locally with 22GB VRAM Nvidia A10G with AQM Quant , not enough memory. How much total needed for 4 bit q?