fbaldassarri commited on
Commit
56cbb02
1 Parent(s): a9df14a

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -3
README.md CHANGED
@@ -1,3 +1,91 @@
1
- ---
2
- license: llama3.1
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ - fr
6
+ - it
7
+ - pt
8
+ - hi
9
+ - es
10
+ - th
11
+ license: llama3.1
12
+ library_name: transformers
13
+ tags:
14
+ - autoround
15
+ - intel
16
+ - gptq
17
+ - woq
18
+ - meta
19
+ - pytorch
20
+ - llama
21
+ - llama-3
22
+ model_name: Llama 3.1 8B Instruct
23
+ base_model: meta-llama/Llama-3.1-8B-Instruct
24
+ inference: false
25
+ model_creator: meta-llama
26
+ pipeline_tag: text-generation
27
+ prompt_template: '{prompt}
28
+ '
29
+ quantized_by: fbaldassarri
30
+ ---
31
+
32
+ ## Model Information
33
+
34
+ Quantized version of [meta-llama/Llama-3.1-8B-Instruct](meta-llama/Llama-3.1-8B-Instruct) using torch.float32 for quantization tuning.
35
+ - 4 bits (INT4)
36
+ - group size = 128
37
+ - symmetrical Quantization
38
+
39
+ Fast and low memory, 2-3X speedup (slight accuracy drop at W4G128)
40
+
41
+ Quantization framework: [Intel AutoRound](https://github.com/intel/auto-round)
42
+
43
+ Note: this INT4 version of Llama-3.1-8B-Instruct has been quantized to run inference through CPU.
44
+
45
+ ## Replication Recipe
46
+
47
+ ### Step 1 Install Requirements
48
+
49
+ I suggest to install requirements into a dedicated python-virtualenv or a conda enviroment.
50
+
51
+ ```
52
+ python -m pip install <package> --upgrade
53
+ ```
54
+
55
+ - accelerate==1.0.1
56
+ - auto_gptq==0.7.1
57
+ - neural_compressor==3.1
58
+ - torch==2.3.0+cpu
59
+ - torchaudio==2.5.0+cpu
60
+ - torchvision==0.18.0+cpu
61
+ - transformers==4.45.2
62
+
63
+ ### Step 2 Build Intel Autoround wheel from sources
64
+
65
+ ```
66
+ python -m pip install git+https://github.com/intel/auto-round.git
67
+ ```
68
+
69
+ ### Step 3 Script for Quantization
70
+
71
+ ```
72
+ from transformers import AutoModelForCausalLM, AutoTokenizer
73
+ model_name = "meta-llama/Llama-3.1-8B-Instruct"
74
+ model = AutoModelForCausalLM.from_pretrained(model_name)
75
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
76
+ from auto_round import AutoRound
77
+ bits, group_size, sym = 4, 128, True
78
+ autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size, sym=sym)
79
+ autoround.quantize()
80
+ output_dir = "./AutoRound/meta-llama_Llama-3.1-8B-Instruct-auto_round-int4-gs128-sym"
81
+ autoround.save_quantized(output_dir, format='auto_round', inplace=True)
82
+ ```
83
+
84
+ ## License
85
+
86
+ [Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)
87
+
88
+ ## Disclaimer
89
+
90
+ This quantized model comes with no warrenty. It has been developed only for research purposes.
91
+