TheBloke commited on
Commit
22662e2
1 Parent(s): cea98af

Initial GGUF model commit

Browse files
Files changed (1) hide show
  1. README.md +12 -43
README.md CHANGED
@@ -47,11 +47,14 @@ GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is
47
 
48
  The key benefit of GGUF is that it is a extensible, future-proof format which stores more information about the model as metadata. It also includes significantly improved tokenization code, including for the first time full support for special tokens. This should improve performance, especially with models that use new special tokens and implement custom prompt templates.
49
 
50
- As of August 23rd 2023, only llama.cpp supports GGUF. However, third-party clients and libraries are expected to add support very soon.
 
 
 
 
51
 
52
  Here is a list of clients and libraries, along with their expected timeline for GGUF support. Where possible a link to the relevant issue or PR is provided:
53
  * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), awaiting llama-cpp-python support.
54
- * [KoboldCpp](https://github.com/LostRuins/koboldcpp), [in active development](https://github.com/LostRuins/koboldcpp/issues/387). Test builds are working, but GPU acceleration remains to be tested.
55
  * [LM Studio](https://lmstudio.ai/), in active development - hoped to be ready by August 25th-26th.
56
  * [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), will work as soon as ctransformers or llama-cpp-python is updated.
57
  * [ctransformers](https://github.com/marella/ctransformers), [development will start soon](https://github.com/marella/ctransformers/issues/102).
@@ -83,7 +86,9 @@ Here is a list of clients and libraries, along with their expected timeline for
83
 
84
  These quantised GGUF files are compatible with llama.cpp from August 21st 2023 onwards, as of commit [6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9](https://github.com/ggerganov/llama.cpp/commit/6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9)
85
 
86
- As of August 23rd 2023 they are not yet compatible with any third-party UIS, libraries or utilities but this is expected to change very soon.
 
 
87
 
88
  ## Explanation of quantisation methods
89
  <details>
@@ -95,7 +100,6 @@ The new methods available are:
95
  * GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
96
  * GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
97
  * GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
98
- * GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.
99
 
100
  Refer to the Provided Files table below to see what files use which methods, and how.
101
  </details>
@@ -106,55 +110,20 @@ Refer to the Provided Files table below to see what files use which methods, and
106
 
107
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
108
  | ---- | ---- | ---- | ---- | ---- | ----- |
 
109
  | [nous-puffin-70b.Q2_K.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q2_K.gguf) | Q2_K | 2 | 29.11 GB| 31.61 GB | smallest, significant quality loss - not recommended for most purposes |
110
  | [nous-puffin-70b.Q3_K_S.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q3_K_S.gguf) | Q3_K_S | 3 | 29.75 GB| 32.25 GB | very small, high quality loss |
111
  | [nous-puffin-70b.Q3_K_M.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q3_K_M.gguf) | Q3_K_M | 3 | 33.10 GB| 35.60 GB | very small, high quality loss |
112
  | [nous-puffin-70b.Q3_K_L.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q3_K_L.gguf) | Q3_K_L | 3 | 36.15 GB| 38.65 GB | small, substantial quality loss |
 
 
 
113
  | [nous-puffin-70b.Q4_K_S.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q4_K_S.gguf) | Q4_K_S | 4 | 38.99 GB| 41.49 GB | small, greater quality loss |
114
  | [nous-puffin-70b.Q4_K_M.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q4_K_M.gguf) | Q4_K_M | 4 | 41.38 GB| 43.88 GB | medium, balanced quality - recommended |
115
  | [nous-puffin-70b.Q5_K_S.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q5_K_S.gguf) | Q5_K_S | 5 | 47.46 GB| 49.96 GB | large, low quality loss - recommended |
116
  | [nous-puffin-70b.Q5_K_M.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q5_K_M.gguf) | Q5_K_M | 5 | 48.75 GB| 51.25 GB | large, very low quality loss - recommended |
117
- | nous-puffin-70b.Q6_K.bin | q6_K | 6 | 56.82 GB | 59.32 GB | very large, extremely low quality loss |
118
- | nous-puffin-70b.Q8_0.bin | q8_0 | 8 | 73.29 GB | 75.79 GB | very large, extremely low quality loss - not recommended |
119
 
120
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
121
-
122
- ### Q6_K and Q8_0 files are split and require joining
123
-
124
- **Note:** HF does not support uploading files larger than 50GB. Therefore I have uploaded the Q6_K and Q8_0 files as split files.
125
-
126
- <details>
127
- <summary>Click for instructions regarding Q6_K and Q8_0 files</summary>
128
-
129
- ### q6_K
130
- Please download:
131
- * `nous-puffin-70b.Q6_K.gguf-split-a`
132
- * `nous-puffin-70b.Q6_K.gguf-split-b`
133
-
134
- ### q8_0
135
- Please download:
136
- * `nous-puffin-70b.Q8_0.gguf-split-a`
137
- * `nous-puffin-70b.Q8_0.gguf-split-b`
138
-
139
- To join the files, do the following:
140
-
141
- Linux and macOS:
142
- ```
143
- cat nous-puffin-70b.Q6_K.gguf-split-* > nous-puffin-70b.Q6_K.gguf && rm nous-puffin-70b.Q6_K.gguf-split-*
144
- cat nous-puffin-70b.Q8_0.gguf-split-* > nous-puffin-70b.Q8_0.gguf && rm nous-puffin-70b.Q8_0.gguf-split-*
145
- ```
146
- Windows command line:
147
- ```
148
- COPY /B nous-puffin-70b.Q6_K.gguf-split-a + nous-puffin-70b.Q6_K.gguf-split-b nous-puffin-70b.Q6_K.gguf
149
- del nous-puffin-70b.Q6_K.gguf-split-a nous-puffin-70b.Q6_K.gguf-split-b
150
-
151
- COPY /B nous-puffin-70b.Q8_0.gguf-split-a + nous-puffin-70b.Q8_0.gguf-split-b nous-puffin-70b.Q8_0.gguf
152
- del nous-puffin-70b.Q8_0.gguf-split-a nous-puffin-70b.Q8_0.gguf-split-b
153
- ```
154
-
155
- </details>
156
-
157
-
158
  <!-- README_GGUF.md-provided-files end -->
159
 
160
  <!-- README_GGUF.md-how-to-run start -->
 
47
 
48
  The key benefit of GGUF is that it is a extensible, future-proof format which stores more information about the model as metadata. It also includes significantly improved tokenization code, including for the first time full support for special tokens. This should improve performance, especially with models that use new special tokens and implement custom prompt templates.
49
 
50
+ As of August 24th 2023, llama.cpp and KoboldCpp support GGUF. Other third-party clients and libraries are expected to add support very soon.
51
+
52
+ Here is a list of clients and libraries that are known to support GGUF:
53
+ * [llama.cpp](https://github.com/ggerganov/llama.cpp)
54
+ * [KoboldCpp](https://github.com/LostRuins/koboldcpp), now supports GGUF as of release 1.41!
55
 
56
  Here is a list of clients and libraries, along with their expected timeline for GGUF support. Where possible a link to the relevant issue or PR is provided:
57
  * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), awaiting llama-cpp-python support.
 
58
  * [LM Studio](https://lmstudio.ai/), in active development - hoped to be ready by August 25th-26th.
59
  * [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui), will work as soon as ctransformers or llama-cpp-python is updated.
60
  * [ctransformers](https://github.com/marella/ctransformers), [development will start soon](https://github.com/marella/ctransformers/issues/102).
 
86
 
87
  These quantised GGUF files are compatible with llama.cpp from August 21st 2023 onwards, as of commit [6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9](https://github.com/ggerganov/llama.cpp/commit/6381d4e110bd0ec02843a60bbeb8b6fc37a9ace9)
88
 
89
+ As of August 24th 2023 they are now compatible with KoboldCpp, release 1.41 and later.
90
+
91
+ They are are not yet compatible with any other third-party UIS, libraries or utilities but this is expected to change very soon.
92
 
93
  ## Explanation of quantisation methods
94
  <details>
 
100
  * GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
101
  * GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
102
  * GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
 
103
 
104
  Refer to the Provided Files table below to see what files use which methods, and how.
105
  </details>
 
110
 
111
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
112
  | ---- | ---- | ---- | ---- | ---- | ----- |
113
+ | [nous-puffin-70b.Q6_K.gguf-split-b](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q6_K.gguf-split-b) | Q6_K | 6 | 19.89 GB| 22.39 GB | very large, extremely low quality loss |
114
  | [nous-puffin-70b.Q2_K.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q2_K.gguf) | Q2_K | 2 | 29.11 GB| 31.61 GB | smallest, significant quality loss - not recommended for most purposes |
115
  | [nous-puffin-70b.Q3_K_S.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q3_K_S.gguf) | Q3_K_S | 3 | 29.75 GB| 32.25 GB | very small, high quality loss |
116
  | [nous-puffin-70b.Q3_K_M.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q3_K_M.gguf) | Q3_K_M | 3 | 33.10 GB| 35.60 GB | very small, high quality loss |
117
  | [nous-puffin-70b.Q3_K_L.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q3_K_L.gguf) | Q3_K_L | 3 | 36.15 GB| 38.65 GB | small, substantial quality loss |
118
+ | [nous-puffin-70b.Q8_0.gguf-split-b](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q8_0.gguf-split-b) | Q8_0 | 8 | 36.53 GB| 39.03 GB | very large, extremely low quality loss - not recommended |
119
+ | [nous-puffin-70b.Q6_K.gguf-split-a](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q6_K.gguf-split-a) | Q6_K | 6 | 36.70 GB| 39.20 GB | very large, extremely low quality loss |
120
+ | [nous-puffin-70b.Q8_0.gguf-split-a](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q8_0.gguf-split-a) | Q8_0 | 8 | 36.70 GB| 39.20 GB | very large, extremely low quality loss - not recommended |
121
  | [nous-puffin-70b.Q4_K_S.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q4_K_S.gguf) | Q4_K_S | 4 | 38.99 GB| 41.49 GB | small, greater quality loss |
122
  | [nous-puffin-70b.Q4_K_M.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q4_K_M.gguf) | Q4_K_M | 4 | 41.38 GB| 43.88 GB | medium, balanced quality - recommended |
123
  | [nous-puffin-70b.Q5_K_S.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q5_K_S.gguf) | Q5_K_S | 5 | 47.46 GB| 49.96 GB | large, low quality loss - recommended |
124
  | [nous-puffin-70b.Q5_K_M.gguf](https://huggingface.co/TheBloke/Nous-Puffin-70B-GGUF/blob/main/nous-puffin-70b.Q5_K_M.gguf) | Q5_K_M | 5 | 48.75 GB| 51.25 GB | large, very low quality loss - recommended |
 
 
125
 
126
  **Note**: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  <!-- README_GGUF.md-provided-files end -->
128
 
129
  <!-- README_GGUF.md-how-to-run start -->