--- base_model: mistralai/Mistral-7B-Instruct-v0.3 language: - en pipeline_tag: text-generation license: apache-2.0 model_creator: Mistral AI model_name: Mistral-7B-Instruct-v0.3 model_type: mistral quantized_by: CISC --- # Mistral-7B-Instruct-v0.3 - SOTA GGUF - Model creator: [Mistral AI](https://huggingface.co/mistralai) - Original model: [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) ## Description This repo contains State Of The Art quantized GGUF format model files for [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3). Quantization was done with an importance matrix that was trained for ~1M tokens (256 batches of 4096 tokens) of [groups_merged.txt](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384) and [wiki.train.raw](https://raw.githubusercontent.com/pytorch/examples/main/word_language_model/data/wikitext-2/train.txt) concatenated. The embedded chat template has been extended to support function calling via OpenAI-compatible `tools` parameter, see [example](#simple-llama-cpp-python-example-function-calling-code). ## Prompt template: Mistral v3 ``` [AVAILABLE_TOOLS] [{"name": "function_name", "description": "Description", "parameters": {...}}, ...][/AVAILABLE_TOOLS][INST] {prompt}[/INST] ``` ## Compatibility These quantised GGUFv3 files are compatible with llama.cpp from February 27th 2024 onwards, as of commit [0becb22](https://github.com/ggerganov/llama.cpp/commit/0becb22ac05b6542bd9d5f2235691aa1d3d4d307) They are also compatible with many third party UIs and libraries provided they are built using a recent llama.cpp. ## Explanation of quantisation methods
Click to see details The new methods available are: * GGML_TYPE_IQ1_S - 1-bit quantization in super-blocks with an importance matrix applied, effectively using 1.56 bits per weight (bpw) * GGML_TYPE_IQ1_M - 1-bit quantization in super-blocks with an importance matrix applied, effectively using 1.75 bpw * GGML_TYPE_IQ2_XXS - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.06 bpw * GGML_TYPE_IQ2_XS - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.31 bpw * GGML_TYPE_IQ2_S - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.5 bpw * GGML_TYPE_IQ2_M - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.7 bpw * GGML_TYPE_IQ3_XXS - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.06 bpw * GGML_TYPE_IQ3_XS - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.3 bpw * GGML_TYPE_IQ3_S - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.44 bpw * GGML_TYPE_IQ3_M - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.66 bpw * GGML_TYPE_IQ4_XS - 4-bit quantization in super-blocks with an importance matrix applied, effectively using 4.25 bpw * GGML_TYPE_IQ4_NL - 4-bit non-linearly mapped quantization with an importance matrix applied, effectively using 4.5 bpw Refer to the Provided Files table below to see what files use which methods, and how.
## Provided files | Name | Quant method | Bits | Size | Max RAM required | Use case | | ---- | ---- | ---- | ---- | ---- | ----- | | [Mistral-7B-Instruct-v0.3.IQ1_S.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ1_S.gguf) | IQ1_S | 1 | 1.5 GB| 2.5 GB | smallest, significant quality loss - **TBD**: Waiting for [this issue](https://github.com/ggerganov/llama.cpp/issues/5996) to be resolved | | [Mistral-7B-Instruct-v0.3.IQ1_M.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ1_M.gguf) | IQ1_M | 1 | 1.6 GB| 2.6 GB | very small, significant quality loss | | [Mistral-7B-Instruct-v0.3.IQ2_XXS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ2_XXS.gguf) | IQ2_XXS | 2 | 1.8 GB| 2.8 GB | very small, high quality loss | | [Mistral-7B-Instruct-v0.3.IQ2_XS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ2_XS.gguf) | IQ2_XS | 2 | 1.9 GB| 2.9 GB | very small, high quality loss | | [Mistral-7B-Instruct-v0.3.IQ2_S.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ2_S.gguf) | IQ2_S | 2 | 2.1 GB| 3.1 GB | small, substantial quality loss | | [Mistral-7B-Instruct-v0.3.IQ2_M.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ2_M.gguf) | IQ2_M | 2 | 2.2 GB| 3.2 GB | small, greater quality loss | | [Mistral-7B-Instruct-v0.3.IQ3_XXS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ3_XXS.gguf) | IQ3_XXS | 3 | 2.5 GB| 3.5 GB | very small, high quality loss | | [Mistral-7B-Instruct-v0.3.IQ3_XS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ3_XS.gguf) | IQ3_XS | 3 | 2.7 GB| 3.7 GB | small, substantial quality loss | | [Mistral-7B-Instruct-v0.3.IQ3_S.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ3_S.gguf) | IQ3_S | 3 | 2.8 GB| 3.8 GB | small, greater quality loss | | [Mistral-7B-Instruct-v0.3.IQ3_M.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ3_M.gguf) | IQ3_M | 3 | 3.0 GB| 4.0 GB | medium, balanced quality - recommended | | [Mistral-7B-Instruct-v0.3.IQ4_XS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ4_XS.gguf) | IQ4_XS | 4 | 3.4 GB| 4.4 GB | small, substantial quality loss | Generated importance matrix file: [Mistral-7B-Instruct-v0.3.imatrix.dat](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.imatrix.dat) **Note**: the above RAM figures assume no GPU offloading with 4K context. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead. ## Example `llama.cpp` command Make sure you are using `llama.cpp` from commit [0becb22](https://github.com/ggerganov/llama.cpp/commit/0becb22ac05b6542bd9d5f2235691aa1d3d4d307) or later. ```shell ./main -ngl 33 -m Mistral-7B-Instruct-v0.3.IQ4_XS.gguf --color -c 32768 --temp 0 --repeat-penalty 1.1 -p "[AVAILABLE_TOOLS] {tools}[/AVAILABLE_TOOLS][INST] {prompt}[/INST]" ``` Change `-ngl 33` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. Change `-c 32768` to the desired sequence length. If you want to have a chat-style conversation, replace the `-p ` argument with `-i -ins` If you are low on V/RAM try quantizing the K-cache with `-ctk q8_0` or even `-ctk q4_0` for big memory savings (depending on context size). There is a similar option for V-cache (`-ctv`), however that is [not working yet](https://github.com/ggerganov/llama.cpp/issues/4425). For other parameters and how to use them, please refer to [the llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md) ## How to run from Python code You can use GGUF models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) module. ### How to load this model in Python code, using llama-cpp-python For full documentation, please see: [llama-cpp-python docs](https://llama-cpp-python.readthedocs.io/en/latest/). #### First install the package Run one of the following commands, according to your system: ```shell # Prebuilt wheel with basic CPU support pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu # Prebuilt wheel with NVidia CUDA acceleration pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121 (or cu122 etc.) # Prebuilt wheel with Metal GPU acceleration pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal # Build base version with no GPU acceleration pip install llama-cpp-python # With NVidia CUDA acceleration CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python # Or with OpenBLAS acceleration CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python # Or with CLBLast acceleration CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python # Or with AMD ROCm GPU acceleration (Linux only) CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python # Or with Metal GPU acceleration for macOS systems only CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python # Or with Vulkan acceleration CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python # Or with Kompute acceleration CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python # Or with SYCL acceleration CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python # In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA: $env:CMAKE_ARGS = "-DLLAMA_CUDA=on" pip install llama-cpp-python ``` #### Simple llama-cpp-python example code ```python from llama_cpp import Llama # Chat Completion API llm = Llama(model_path="./Mistral-7B-Instruct-v0.3.IQ4_XS.gguf", n_gpu_layers=33, n_ctx=32768) print(llm.create_chat_completion( messages = [ { "role": "user", "content": "Pick a LeetCode challenge and solve it in Python." } ] )) ``` #### Simple llama-cpp-python example function calling code ```python from llama_cpp import Llama # Chat Completion API grammar = LlamaGrammar.from_json_schema(json.dumps({ "type": "array", "items": { "type": "object", "required": [ "name", "arguments" ], "properties": { "name": { "type": "string" }, "arguments": { "type": "object" } } } })) llm = Llama(model_path="./Mistral-7B-Instruct-v0.3.IQ4_XS.gguf", n_gpu_layers=33, n_ctx=32768, temperature=0.0, repeat_penalty=1.1) response = llm.create_chat_completion( messages = [ { "role": "user", "content": "What's the weather like in Oslo and Stockholm?" } ], tools=[{ "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA" }, "unit": { "type": "string", "enum": [ "celsius", "fahrenheit" ] } }, "required": [ "location" ] } } }], grammar = grammar ) print(json.loads(response["choices"][0]["text"])) print(llm.create_chat_completion( messages = [ { "role": "user", "content": "What's the weather like in Oslo?" }, { # The tool_calls is from the response to the above with tool_choice active "role": "assistant", "content": None, "tool_calls": [ { "id": "call__0_get_current_weather_cmpl-...", "type": "function", "function": { "name": "get_current_weather", "arguments": '{ "location": "Oslo, NO" ,"unit": "celsius"} ' } } ] }, { # The tool_call_id is from tool_calls and content is the result from the function call you made "role": "tool", "content": "20", "tool_call_id": "call__0_get_current_weather_cmpl-..." } ], tools=[{ "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA" }, "unit": { "type": "string", "enum": [ "celsius", "fahrenheit" ] } }, "required": [ "location" ] } } }], #tool_choice={ # "type": "function", # "function": { # "name": "get_current_weather" # } #} )) ```