Fix example parameters

8acad5e verified 6 months ago

13.3 kB

	---
	base_model: mistralai/Mistral-7B-Instruct-v0.3
	language:
	- en
	pipeline_tag: text-generation
	license: apache-2.0
	model_creator: Mistral AI
	model_name: Mistral-7B-Instruct-v0.3
	model_type: mistral
	quantized_by: CISC
	---

	# Mistral-7B-Instruct-v0.3 - SOTA GGUF
	- Model creator: [Mistral AI](https://huggingface.co/mistralai)
	- Original model: [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)

	<!-- description start -->
	## Description

	This repo contains State Of The Art quantized GGUF format model files for [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3).

	Quantization was done with an importance matrix that was trained for ~1M tokens (256 batches of 4096 tokens) of [groups_merged.txt](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384) and [wiki.train.raw](https://raw.githubusercontent.com/pytorch/examples/main/word_language_model/data/wikitext-2/train.txt) concatenated.

	The embedded chat template has been extended to support function calling via OpenAI-compatible `tools` parameter, see [example](#simple-llama-cpp-python-example-function-calling-code).

	<!-- description end -->


	<!-- prompt-template start -->
	## Prompt template: Mistral v3

	```
	[AVAILABLE_TOOLS] [{"name": "function_name", "description": "Description", "parameters": {...}}, ...][/AVAILABLE_TOOLS][INST] {prompt}[/INST]
	```

	<!-- prompt-template end -->


	<!-- compatibility_gguf start -->
	## Compatibility

	These quantised GGUFv3 files are compatible with llama.cpp from February 27th 2024 onwards, as of commit [0becb22](https://github.com/ggerganov/llama.cpp/commit/0becb22ac05b6542bd9d5f2235691aa1d3d4d307)

	They are also compatible with many third party UIs and libraries provided they are built using a recent llama.cpp.

	## Explanation of quantisation methods

	<details>
	<summary>Click to see details</summary>

	The new methods available are:

	* GGML_TYPE_IQ1_S - 1-bit quantization in super-blocks with an importance matrix applied, effectively using 1.56 bits per weight (bpw)
	* GGML_TYPE_IQ1_M - 1-bit quantization in super-blocks with an importance matrix applied, effectively using 1.75 bpw
	* GGML_TYPE_IQ2_XXS - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.06 bpw
	* GGML_TYPE_IQ2_XS - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.31 bpw
	* GGML_TYPE_IQ2_S - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.5 bpw
	* GGML_TYPE_IQ2_M - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.7 bpw
	* GGML_TYPE_IQ3_XXS - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.06 bpw
	* GGML_TYPE_IQ3_XS - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.3 bpw
	* GGML_TYPE_IQ3_S - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.44 bpw
	* GGML_TYPE_IQ3_M - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.66 bpw
	* GGML_TYPE_IQ4_XS - 4-bit quantization in super-blocks with an importance matrix applied, effectively using 4.25 bpw
	* GGML_TYPE_IQ4_NL - 4-bit non-linearly mapped quantization with an importance matrix applied, effectively using 4.5 bpw

	Refer to the Provided Files table below to see what files use which methods, and how.
	</details>
	<!-- compatibility_gguf end -->

	<!-- README_GGUF.md-provided-files start -->
	## Provided files

	\| Name \| Quant method \| Bits \| Size \| Max RAM required \| Use case \|
	\| ---- \| ---- \| ---- \| ---- \| ---- \| ----- \|
	\| [Mistral-7B-Instruct-v0.3.IQ1_S.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ1_S.gguf) \| IQ1_S \| 1 \| 1.5 GB\| 2.5 GB \| smallest, significant quality loss - TBD: Waiting for [this issue](https://github.com/ggerganov/llama.cpp/issues/5996) to be resolved \|
	\| [Mistral-7B-Instruct-v0.3.IQ1_M.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ1_M.gguf) \| IQ1_M \| 1 \| 1.6 GB\| 2.6 GB \| very small, significant quality loss \|
	\| [Mistral-7B-Instruct-v0.3.IQ2_XXS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ2_XXS.gguf) \| IQ2_XXS \| 2 \| 1.8 GB\| 2.8 GB \| very small, high quality loss \|
	\| [Mistral-7B-Instruct-v0.3.IQ2_XS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ2_XS.gguf) \| IQ2_XS \| 2 \| 1.9 GB\| 2.9 GB \| very small, high quality loss \|
	\| [Mistral-7B-Instruct-v0.3.IQ2_S.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ2_S.gguf) \| IQ2_S \| 2 \| 2.1 GB\| 3.1 GB \| small, substantial quality loss \|
	\| [Mistral-7B-Instruct-v0.3.IQ2_M.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ2_M.gguf) \| IQ2_M \| 2 \| 2.2 GB\| 3.2 GB \| small, greater quality loss \|
	\| [Mistral-7B-Instruct-v0.3.IQ3_XXS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ3_XXS.gguf) \| IQ3_XXS \| 3 \| 2.5 GB\| 3.5 GB \| very small, high quality loss \|
	\| [Mistral-7B-Instruct-v0.3.IQ3_XS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ3_XS.gguf) \| IQ3_XS \| 3 \| 2.7 GB\| 3.7 GB \| small, substantial quality loss \|
	\| [Mistral-7B-Instruct-v0.3.IQ3_S.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ3_S.gguf) \| IQ3_S \| 3 \| 2.8 GB\| 3.8 GB \| small, greater quality loss \|
	\| [Mistral-7B-Instruct-v0.3.IQ3_M.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ3_M.gguf) \| IQ3_M \| 3 \| 3.0 GB\| 4.0 GB \| medium, balanced quality - recommended \|
	\| [Mistral-7B-Instruct-v0.3.IQ4_XS.gguf](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.IQ4_XS.gguf) \| IQ4_XS \| 4 \| 3.4 GB\| 4.4 GB \| small, substantial quality loss \|

	Generated importance matrix file: [Mistral-7B-Instruct-v0.3.imatrix.dat](https://huggingface.co/CISCai/Mistral-7B-Instruct-v0.3-SOTA-GGUF/blob/main/Mistral-7B-Instruct-v0.3.imatrix.dat)

	Note: the above RAM figures assume no GPU offloading with 4K context. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.

	<!-- README_GGUF.md-provided-files end -->

	<!-- README_GGUF.md-how-to-run start -->
	## Example `llama.cpp` command

	Make sure you are using `llama.cpp` from commit [0becb22](https://github.com/ggerganov/llama.cpp/commit/0becb22ac05b6542bd9d5f2235691aa1d3d4d307) or later.

	```shell
	./main -ngl 33 -m Mistral-7B-Instruct-v0.3.IQ4_XS.gguf --color -c 32768 --temp 0 --repeat-penalty 1.1 -p "[AVAILABLE_TOOLS] {tools}[/AVAILABLE_TOOLS][INST] {prompt}[/INST]"
	```

	Change `-ngl 33` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.

	Change `-c 32768` to the desired sequence length.

	If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`

	If you are low on V/RAM try quantizing the K-cache with `-ctk q8_0` or even `-ctk q4_0` for big memory savings (depending on context size).
	There is a similar option for V-cache (`-ctv`), however that is [not working yet](https://github.com/ggerganov/llama.cpp/issues/4425).

	For other parameters and how to use them, please refer to [the llama.cpp documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)

	## How to run from Python code

	You can use GGUF models from Python using the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) module.

	### How to load this model in Python code, using llama-cpp-python

	For full documentation, please see: [llama-cpp-python docs](https://llama-cpp-python.readthedocs.io/en/latest/).

	#### First install the package

	Run one of the following commands, according to your system:

	```shell
	# Prebuilt wheel with basic CPU support
	pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
	# Prebuilt wheel with NVidia CUDA acceleration
	pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121 (or cu122 etc.)
	# Prebuilt wheel with Metal GPU acceleration
	pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
	# Build base version with no GPU acceleration
	pip install llama-cpp-python
	# With NVidia CUDA acceleration
	CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
	# Or with OpenBLAS acceleration
	CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
	# Or with CLBLast acceleration
	CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
	# Or with AMD ROCm GPU acceleration (Linux only)
	CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
	# Or with Metal GPU acceleration for macOS systems only
	CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
	# Or with Vulkan acceleration
	CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python
	# Or with Kompute acceleration
	CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python
	# Or with SYCL acceleration
	CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python

	# In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
	$env:CMAKE_ARGS = "-DLLAMA_CUDA=on"
	pip install llama-cpp-python
	```

	#### Simple llama-cpp-python example code

	```python
	from llama_cpp import Llama

	# Chat Completion API

	llm = Llama(model_path="./Mistral-7B-Instruct-v0.3.IQ4_XS.gguf", n_gpu_layers=33, n_ctx=32768)
	print(llm.create_chat_completion(
	messages = [
	{
	"role": "user",
	"content": "Pick a LeetCode challenge and solve it in Python."
	}
	]
	))
	```

	#### Simple llama-cpp-python example function calling code

	```python
	from llama_cpp import Llama

	# Chat Completion API

	grammar = LlamaGrammar.from_json_schema(json.dumps({
	"type": "array",
	"items": {
	"type": "object",
	"required": [ "name", "arguments" ],
	"properties": {
	"name": {
	"type": "string"
	},
	"arguments": {
	"type": "object"
	}
	}
	}
	}))

	llm = Llama(model_path="./Mistral-7B-Instruct-v0.3.IQ4_XS.gguf", n_gpu_layers=33, n_ctx=32768)
	response = llm.create_chat_completion(
	temperature = 0.0,
	repeat_penalty = 1.1,
	messages = [
	{
	"role": "user",
	"content": "What's the weather like in Oslo and Stockholm?"
	}
	],
	tools=[{
	"type": "function",
	"function": {
	"name": "get_current_weather",
	"description": "Get the current weather in a given location",
	"parameters": {
	"type": "object",
	"properties": {
	"location": {
	"type": "string",
	"description": "The city and state, e.g. San Francisco, CA"
	},
	"unit": {
	"type": "string",
	"enum": [ "celsius", "fahrenheit" ]
	}
	},
	"required": [ "location" ]
	}
	}
	}],
	grammar = grammar
	)
	print(json.loads(response["choices"][0]["text"]))

	print(llm.create_chat_completion(
	temperature = 0.0,
	repeat_penalty = 1.1,
	messages = [
	{
	"role": "user",
	"content": "What's the weather like in Oslo?"
	},
	{ # The tool_calls is from the response to the above with tool_choice active
	"role": "assistant",
	"content": None,
	"tool_calls": [
	{
	"id": "call__0_get_current_weather_cmpl-...",
	"type": "function",
	"function": {
	"name": "get_current_weather",
	"arguments": '{ "location": "Oslo, NO" ,"unit": "celsius"} '
	}
	}
	]
	},
	{ # The tool_call_id is from tool_calls and content is the result from the function call you made
	"role": "tool",
	"content": "20",
	"tool_call_id": "call__0_get_current_weather_cmpl-..."
	}
	],
	tools=[{
	"type": "function",
	"function": {
	"name": "get_current_weather",
	"description": "Get the current weather in a given location",
	"parameters": {
	"type": "object",
	"properties": {
	"location": {
	"type": "string",
	"description": "The city and state, e.g. San Francisco, CA"
	},
	"unit": {
	"type": "string",
	"enum": [ "celsius", "fahrenheit" ]
	}
	},
	"required": [ "location" ]
	}
	}
	}],
	#tool_choice={
	# "type": "function",
	# "function": {
	# "name": "get_current_weather"
	# }
	#}
	))
	```

	<!-- README_GGUF.md-how-to-run end -->