Hi. Thank you very much for nice model. I tried to make quants using gguf-my-repo HF space but got this error:
Error: Error quantizing: b'main: build = 2824 (4426e298)\nmain: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu\nmain: quantizing 'tweety-7b-armenian-v24a/tweety-7b-armenian-v24a.fp16.bin' to 'tweety-7b-armenian-v24a/tweety-7b-armenian-v24a.Q8_0.gguf' as Q8_0\nllama_model_loader: loaded meta data with 22 key-value pairs and 49 tensors from tweety-7b-armenian-v24a/tweety-7b-armenian-v24a.fp16.bin (version GGUF V3 (latest))\nllama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.\nllama_model_loader: - kv 0: general.architecture str = llama\nllama_model_loader: - kv 1: general.name str = .\nllama_model_loader: - kv 2: llama.vocab_size u32 = 32000\nllama_model_loader: - kv 3: llama.context_length u32 = 32768\nllama_model_loader: - kv 4: llama.embedding_length u32 = 4096\nllama_model_loader: - kv 5: llama.block_count u32 = 32\nllama_model_loader: - kv 6: llama.feed_forward_length u32 = 14336\nllama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128\nllama_model_loader: - kv 8: llama.attention.head_count u32 = 32\nllama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8\nllama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010\nllama_model_loader: - kv 11: llama.rope.freq_base f32 = 10000.000000\nllama_model_loader: - kv 12: general.file_type u32 = 1\nllama_model_loader: - kv 13: tokenizer.ggml.model str = llama\nllama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "", "\xe2\x96...\nllama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...\nllama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, ...\nllama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1\nllama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2\nllama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0\nllama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true\nllama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = true\nllama_model_loader: - type f32: 10 tensors\nllama_model_loader: - type f16: 39 tensors\nGGML_ASSERT: llama.cpp:14705: (qs.n_attention_wv == 0 || qs.n_attention_wv == (int)model.hparams.n_layer) && "n_attention_wv is unexpected"\nAborted (core dumped)\n'
Could it be fixed somehow? Does exist any way to make gguf for your model?
Hi! I have no experience with this and this is not my model, so I can't say for sure, but your output seems to mention Llama while this model is based on Mistral. Maybe that's issue?
Good luck with further investigations, and do not hesitate to share your status, I wouldn't mind debugging this further with you.