ymcki commited on
Commit
fd06b7b
1 Parent(s): e06e19c

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -3
README.md CHANGED
@@ -1,3 +1,159 @@
1
- ---
2
- license: gemma
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: google/gemma-2-9b-it
3
+ language:
4
+ - multilingual
5
+ datasets:
6
+ - TFMC/imatrix-dataset-for-japanese-llm
7
+ library_name: transformers
8
+ license: gemma
9
+ license_link: https://ai.google.dev/gemma/terms
10
+ pipeline_tag: text-generation
11
+ tags:
12
+ - nlp
13
+ - code
14
+ quantized_by: ymcki
15
+ widget:
16
+ - messages:
17
+ - role: user
18
+ content: Can you provide ways to eat combinations of bananas and dragonfruits?
19
+ ---
20
+
21
+ Original model: https://huggingface.co/google/gemma-2-9b-it
22
+
23
+ ## Description
24
+
25
+ The purpose of this repository is to see whether Japanese specific
26
+ imatrix can improve the performance of a non Japanese optimized model.
27
+
28
+ It also provides the Q4_0_8_8, Q4_0_4_8 and Q4_0_4_4 ggufs for edge
29
+ devices that were otherwise not made by bartowski. These models should
30
+ also be good for edge devices with 16GB RAM.
31
+
32
+ ## Prompt format
33
+
34
+ ```
35
+ <start_of_turn>user
36
+ {prompt}<end_of_turn>
37
+ <start_of_turn>model
38
+ <end_of_turn>
39
+ <start_of_turn>model
40
+
41
+ ```
42
+
43
+ Note that this model does not support a System prompt.
44
+
45
+ ## Download a file (not the whole branch) from below:
46
+
47
+ ELIZA-Tasks-100 is pretty standard benchmark for Japanese LLMs.
48
+ The perfect score is 5.00. As a reference, bartowski's gemma-2-27b-it.Q6_K.gguf scores 4.04.
49
+
50
+ | Filename | Quant type | File Size | Split | ELIZA-Tasks-100 | Nvidia 3090 | Description |
51
+ | -------- | ---------- | --------- | ----- | --------------- | ----------- | ----------- |
52
+ | [gemma-2-9b-it.f16.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.f16.gguf) | f16 | 18.49GB | false | 3.75 | 31.9t/s | Full F16 weights. |
53
+ | [gemma-2-9b-it.Q8_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q8_0.gguf) | Q8_0 | 9.83GB | false | 3.06 | 56.1t/s | Extremely high quality, *recommended for edge devices with 16GB RAM*. |
54
+ | [gemma-2-2b-jpn-it-imatrix.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0.gguf) | Q4_0 | 1.63GB | false | 2.89 | 137t/s | Good quality, *recommended for edge devices wth 8GB RAM*. |
55
+ | [gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf) | Q4_0_8_8 | 1.63GB | false | TBD | TBD | Good quality, *recommended for edge device <8GB RAM*. |
56
+ | [gemma-2-2b-jpn-it-imatrix.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_4_8.gguf) | Q4_0_4_8 | 1.63GB | false | TBD | TBD | Good quality, *recommended for edge device <8GB RAM*. |
57
+ | [gemma-2-2b-jpn-it-imatrix.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_4_4.gguf) | Q4_0_4_4 | 1.63GB | false | TBD | TBD | Good quality, *recommended for edge device <8GB RAM*. |
58
+ | [gemma-2-9b-it.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-9b-it.Q4_0.gguf) | Q4_0 | 5.44GB | false | 3.64 | 65.1t/s | Good quality, *recommended for edge device with 8GB RAM* |
59
+ | [gemma-2-2b-jpn-it.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0_8_8.gguf) | Q4_0_8_8 | 1.63GB | false | TBD | TBD | Good quality, *recommended for edge device <8GB RAM* |
60
+ | [gemma-2-2b-jpn-it.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0_4_8.gguf) | Q4_0_4_8 | 1.63GB | false | TBD | TBD | Good quality, *recommended for edge device <8GB RAM* |
61
+ | [gemma-2-2b-jpn-it.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-9b-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0_4_4.gguf) | Q4_0_4_4 | 1.63GB | false | TBD | TBD | Good quality, *recommended for edge device <8GB RAM*. |
62
+
63
+ ## How to check i8mm and sve support for ARM devices
64
+
65
+ ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM architecture >= ARMv8.6-A supports i8mm.
66
+
67
+ ARM sve support is necessary to take advantage of Q4_0_8_8 gguf. sve is an optional feature that starts from ARMv8.2-A but majority of ARM chips doesn't implement it.
68
+
69
+ For ARM devices without both, it is recommended to use Q4_0_4_4.
70
+
71
+ With these support, the inference speed should be faster in the order of Q4_0_8_8 > Q4_0_4_8 > Q4_0_4_4 > Q4_0 without much effect on the quality of response.
72
+
73
+ This is a [list](https://gpages.juszkiewicz.com.pl/arm-socs-table/arm-socs.html) of ARM devices that support different ARM instructions. Apparently, it is only a partial list. It is better you check for i8mm and sve support by yourself.
74
+
75
+ For Apple devices,
76
+
77
+ ```
78
+ sysctl hw
79
+ ```
80
+
81
+ For other ARM devices (ie most Android devices),
82
+ ```
83
+ cat /proc/cpuinfo
84
+ ```
85
+
86
+ There are also android apps that can display /proc/cpuinfo.
87
+
88
+ I was told that for Intel/AMD CPU inference, support for AVX2/AVX512 can also improve the performance of Q4_0_8_8.
89
+
90
+ On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0.
91
+
92
+ ## Which Q4_0 model to use for ARM devices
93
+ | Brand | Series | Model | i8mm | sve | Quant Type |
94
+ | ----- | ------ | ----- | ---- | --- | -----------|
95
+ | Apple | A | A4 to A14 | No | No | Q4_0_4_4 |
96
+ | Apple | A | A15 to A18 | Yes | No | Q4_0_4_8 |
97
+ | Apple | M | M1 | No | No | Q4_0_4_4 |
98
+ | Apple | M | M2/M3/M4 | Yes | No | Q4_0_4_8 |
99
+ | Google | Tensor | G1,G2 | No | No | Q4_0_4_4 |
100
+ | Google | Tensor | G3,G4 | Yes | Yes | Q4_0_8_8 |
101
+ | Samsung | Exynos | 2200,2400 | Yes | Yes | Q4_0_8_8 |
102
+ | Mediatek | Dimensity | 9000 | Yes | Yes | Q4_0_8_8 |
103
+ | Mediatek | Dimensity | 9300 | Yes | No | Q4_0_4_8 |
104
+ | Qualcomm | Snapdragon | 8 Gen 1 | Yes | Yes | Q4_0_8_8 |
105
+ | Qualcomm | Snapdragon | 8 Gen 2,8 Gen 3,X Elite | Yes | No | Q4_0_4_8 |
106
+
107
+ ## imatrix quantization
108
+
109
+ According to this [blog](https://sc-bakushu.hatenablog.com/entry/2024/04/20/050213), adding imatrix to low bit quant can significantly improve performance. The best dataset for Japanese is [MTFMC/imatrix-dataset-for-japanese-llm](https://huggingface.co/datasets/TFMC/imatrix-dataset-for-japanese-llm). Therefore, I also created the imatrix versions of different Q4_0 quants.
110
+
111
+ However, based on my benchmarking results, the difference is not significant.
112
+
113
+ ## Convert safetensors to f16 gguf
114
+
115
+ Make sure you have llama.cpp git cloned:
116
+
117
+ ```
118
+ python3 convert_hf_to_gguf.py gemma-2-2b-jpn-it/ --outfile gemma-2-2b-jpn-it.f16.gguf --outtype f16
119
+ ```
120
+
121
+ ## Convert f16 gguf to Q8_0 gguf without imatrix
122
+ Make sure you have llama.cpp compiled:
123
+ ```
124
+ ./llama-quantize gemma-2-2b-jpn-it.f16.gguf gemma-2-2b-jpn-it.Q8_0.gguf q8_0
125
+ ```
126
+
127
+ ## Convert f16 gguf to other ggufs with imatrix
128
+
129
+ First, prepare imatrix from f16 gguf and c4_en_ja_imatrix.txt
130
+
131
+ ```
132
+ ./llama-imatrix -m gemma-2-2b-jpn-it.f16.gguf -f c4_en_ja_imatrix.txt -o gemma-2-2b-jpn-it.imatrix --chunks 32
133
+ ```
134
+
135
+ Then, convert f16 gguf with imatrix to create imatrix gguf
136
+
137
+ ```
138
+ ./llama-quantize --imatrix gemma-2-2b-jpn-it.imatrix gemma-2-2b-jpn-it.f16.gguf gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf q4_0_8_8
139
+ ```
140
+
141
+ ## Downloading using huggingface-cli
142
+
143
+ First, make sure you have hugginface-cli installed:
144
+
145
+ ```
146
+ pip install -U "huggingface_hub[cli]"
147
+ ```
148
+
149
+ Then, you can target the specific file you want:
150
+
151
+ ```
152
+ huggingface-cli download ymcki/gemma-2-2b-jpn-it-GGUF --include "gemma-2-2b-jpn-it-Q8_0.gguf" --local-dir ./
153
+ ```
154
+
155
+ ## Credits
156
+
157
+ Thank you bartowski for providing a README.md to get me started.
158
+
159
+ Thank you YoutechA320U for the ELYZA-tasks-100 auto evaluation tool.