fbaldassarri
/

meta-llama_Llama-3.1-8B-Instruct-auto_round-int4-gs128-sym

Text Generation

text-generation-inference

4-bit precision

intel/auto-round

Model card Files Files and versions Community

fbaldassarri commited on 3 days ago

Commit

56cbb02

•

1 Parent(s): a9df14a

Upload README.md

Files changed (1) hide show

README.md +91 -3

README.md CHANGED Viewed

@@ -1,3 +1,91 @@
----
-license: llama3.1
----

+---
+language:
+- en
+- de
+- fr
+- it
+- pt
+- hi
+- es
+- th
+license: llama3.1
+library_name: transformers
+tags:
+- autoround
+- intel
+- gptq
+- woq
+- meta
+- pytorch
+- llama
+- llama-3
+model_name: Llama 3.1 8B Instruct
+base_model: meta-llama/Llama-3.1-8B-Instruct
+inference: false
+model_creator: meta-llama
+pipeline_tag: text-generation
+prompt_template: '{prompt}
+  '
+quantized_by: fbaldassarri
+---
+## Model Information
+Quantized version of [meta-llama/Llama-3.1-8B-Instruct](meta-llama/Llama-3.1-8B-Instruct) using torch.float32 for quantization tuning.
+- 4 bits (INT4)
+- group size = 128
+- symmetrical Quantization
+Fast and low memory, 2-3X speedup (slight accuracy drop at W4G128)
+Quantization framework: [Intel AutoRound](https://github.com/intel/auto-round)
+Note: this INT4 version of Llama-3.1-8B-Instruct has been quantized to run inference through CPU.
+## Replication Recipe
+### Step 1 Install Requirements
+I suggest to install requirements into a dedicated python-virtualenv or a conda enviroment.
+```
+python -m pip install <package> --upgrade
+```
+- accelerate==1.0.1
+- auto_gptq==0.7.1
+- neural_compressor==3.1
+- torch==2.3.0+cpu
+- torchaudio==2.5.0+cpu
+- torchvision==0.18.0+cpu
+- transformers==4.45.2
+### Step 2 Build Intel Autoround wheel from sources
+```
+python -m pip install git+https://github.com/intel/auto-round.git
+```
+### Step 3 Script for Quantization
+```
+  from transformers import AutoModelForCausalLM, AutoTokenizer
+  model_name = "meta-llama/Llama-3.1-8B-Instruct"
+  model = AutoModelForCausalLM.from_pretrained(model_name)
+  tokenizer = AutoTokenizer.from_pretrained(model_name)
+  from auto_round import AutoRound
+  bits, group_size, sym = 4, 128, True
+  autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size, sym=sym)
+  autoround.quantize()
+  output_dir = "./AutoRound/meta-llama_Llama-3.1-8B-Instruct-auto_round-int4-gs128-sym"
+  autoround.save_quantized(output_dir, format='auto_round', inplace=True)
+```
+## License
+[Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)
+## Disclaimer
+This quantized model comes with no warrenty. It has been developed only for research purposes.