tobchef/Gemma2-9B-IT-Simpo-Infinity-Preference-Q4_K_M-GGUF

This model was converted to GGUF format from BAAI/Gemma2-9B-IT-Simpo-Infinity-Preference using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

个人使用心得

我的使用场景

受限于3060Ti的8G显存,要全部加载到GPU推理,只能使用9B以内的Q4级别的量化gguf模型
我的应用场景主要是文档翻译和本地知识库RAG查询,重点关注模型的语言/推理/指令跟随能力所以很长一段时间内,稳定使用的是Qwen2-7B

9B内开源模型

参考大模型榜单opencompass ,在9B以内开源模型中 Gemma-2-9B-it虽然中文语言大幅落后于Qwen2-7B-Instruct,但中文指令跟随却一骑绝尘甚至超过Qwen2-72B-Instruct 经过BAAI微调增强了中文语言能力之后的Gemma2-9B-IT-Simpo-Infinity-Preference算是补足了短板,我的实际试用体验也是超过了Qwen2-7B

量化心得

Q4_K_M虽然体积和显存占用略高于IQ3~4,但推理速度更快,大多数情况下都是最优选择.
Imatrix量化能提升表现,而且使用不同的Imatrix数据集也会影响表现,对于Gemma2-9B-IT-Simpo-Infinity-Preference-Q4_K_M-GGUF 我对比了以下数据集:
经典的wiki.train,全英文,txt体积10M
wikipedia-cn,我从中再次筛选纯中英文(不含日法俄等词汇)的最短的前25%条目,txt体积10M
《思考，快与慢》中译本全文.txt,体积0.9M

测试结果:

中文Imatrix数据集相比英文会提升中文翻译的表现
《思考，快与慢》数据集能明显改善翻译所用的中文词汇
综合表现是wikipedia-cn最优,特别是一些复杂长句,能最准确的表达原意,虽然某些用词不如《思考，快与慢》,可能是因为《思考，快与慢》作为数据集过于单一不够多样化.

所以本模型使用的Imatrix数据集就是wikipedia-cn,我从中再次筛选纯中英文(不含日法俄等词汇)的最短的前25%条目,txt体积10M

使用后记

使用了几天之后发现以下问题，还是退回使用qwen2-7b了：

Gemma2-9B-IT-Simpo-Infinity-Preference的优势似乎在于理解能力，能更准确的理解指令意图，更准确的理解文本之间的上下文逻辑关系，但是执行和生成能力却很不稳定，特别是文本较长时。
量化模型的推理运行也很不稳定，llama-server容易无故崩溃，挂机的时候很杯具。
qwen2-7b换用英文prompt替代中文prompt时，表现会提升，一定程度拉近了差距。而且qwen最大的优势还是全能和稳定，稳定，稳定。

Use with llama.cpp

Install llama.cpp through brew (works on Mac and Linux)

brew install llama.cpp

Invoke the llama.cpp server or the CLI.

CLI:

llama-cli --hf-repo tobchef/Gemma2-9B-IT-Simpo-Infinity-Preference-Q4_K_M-GGUF --hf-file gemma2-9b-it-simpo-infinity-preference-q4_k_m-imat.gguf -p "The meaning to life and the universe is"

Server:

llama-server --hf-repo tobchef/Gemma2-9B-IT-Simpo-Infinity-Preference-Q4_K_M-GGUF --hf-file gemma2-9b-it-simpo-infinity-preference-q4_k_m-imat.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.

git clone https://github.com/ggerganov/llama.cpp

Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1 flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).

cd llama.cpp && LLAMA_CURL=1 make

Step 3: Run inference through the main binary.

./llama-cli --hf-repo tobchef/Gemma2-9B-IT-Simpo-Infinity-Preference-Q4_K_M-GGUF --hf-file gemma2-9b-it-simpo-infinity-preference-q4_k_m-imat.gguf -p "The meaning to life and the universe is"

./llama-server --hf-repo tobchef/Gemma2-9B-IT-Simpo-Infinity-Preference-Q4_K_M-GGUF --hf-file gemma2-9b-it-simpo-infinity-preference-q4_k_m-imat.gguf -c 2048

tobchef
/

Gemma2-9B-IT-Simpo-Infinity-Preference-Q4_K_M-GGUF