FP8 Quantized model now available! (only requires half the original model's VRAM)

#33
by mysticbeing - opened

Runs on 1x H100 / A100 (80GB) : https://huggingface.co/mysticbeing/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC

Weight-and-activation quantization to FP8 is virtually lossless, as the text generated by FP8 models is nearly indistinguishable from that of their unquantized counterparts, requiring a very close examination to notice any differences.

Sign up or log in to comment