TheBloke
/

falcon-40b-instruct-GGML

Transformers

English

falcon

Model card Files Files and versions Community

TheBloke commited on Jun 19, 2023

Commit

b58cc00

•

1 Parent(s): 801a3be

Update README.md

Browse files

Files changed (1) hide show

README.md +4 -7

README.md CHANGED Viewed

@@ -44,7 +44,6 @@ To build cmp-nct's fork of llama.cpp with Falcon 40B support plus preliminary CU
 ```
 git clone https://github.com/cmp-nct/ggllm.cpp
 cd ggllm.cpp
-git checkout cuda-integration
 rm -rf build && mkdir build && cd build && cmake -DGGML_CUBLAS=1 .. && cmake --build . --config Release
 ```
@@ -52,25 +51,23 @@ Compiling on Windows: developer cmp-nct notes: 'I personally compile it using VS
 Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
 ```
-bin/falcon_main -t 8 -ngl 100 -m /workspace/wizard-falcon40b.ggmlv3.q3_K_S.bin -p "What is a falcon?\n### Response:"
 ```
-Using `-ngl 100` will offload all layers to GPU. If you do not have enough VRAM for this, either lower the number or try a smaller quant size as otherwise performance will be severely affected.
 Adjust `-t 8` according to what performs best on your system. Do not exceed the number of physical CPU cores you have.
 <!-- compatibility_ggml end -->
 ## Provided files
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
 | falcon40b-instruct.ggmlv3.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
-| falcon40b-instruct.ggmlv3.q3_K_L.bin | q3_K_L | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
-| falcon40b-instruct.ggmlv3.q3_K_M.bin | q3_K_M | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
 | falcon40b-instruct.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
-| falcon40b-instruct.ggmlv3.q4_K_M.bin | q4_K_M | 4 | 23.54 GB | 26.04 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
 | falcon40b-instruct.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
-| falcon40b-instruct.ggmlv3.q5_K_M.bin | q5_K_M | 5 | 28.77 GB | 31.27 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
 | falcon40b-instruct.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
 | falcon40b-instruct.ggmlv3.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
 | falcon40b-instruct.ggmlv3.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |

 ```
 git clone https://github.com/cmp-nct/ggllm.cpp
 cd ggllm.cpp
 rm -rf build && mkdir build && cd build && cmake -DGGML_CUBLAS=1 .. && cmake --build . --config Release
 ```
 Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
 ```
+bin/falcon_main -t 8 -ngl 100 -b 1 -m falcon40b-instruct.ggmlv3.q3_K_S.bin -p "What is a falcon?\n### Response:"
 ```
+You can specify `-ngl 100` regardles of your VRAM, as it will automatically detect how much VRAM is available can be used.
 Adjust `-t 8` according to what performs best on your system. Do not exceed the number of physical CPU cores you have.
+`-b 1` reduces batch size to 1. This slightly lowers prompt evaluation time, but frees up VRAM to load more of the model on to your GPU.  If you find prompt evaluation too slow and have enough spare VRAM, you can remove this parameter.
 <!-- compatibility_ggml end -->
 ## Provided files
 | Name | Quant method | Bits | Size | Max RAM required | Use case |
 | ---- | ---- | ---- | ---- | ---- | ----- |
 | falcon40b-instruct.ggmlv3.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
 | falcon40b-instruct.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
 | falcon40b-instruct.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
 | falcon40b-instruct.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
 | falcon40b-instruct.ggmlv3.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
 | falcon40b-instruct.ggmlv3.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |