Update README.md
Browse files
README.md
CHANGED
@@ -44,7 +44,6 @@ To build cmp-nct's fork of llama.cpp with Falcon 40B support plus preliminary CU
|
|
44 |
```
|
45 |
git clone https://github.com/cmp-nct/ggllm.cpp
|
46 |
cd ggllm.cpp
|
47 |
-
git checkout cuda-integration
|
48 |
rm -rf build && mkdir build && cd build && cmake -DGGML_CUBLAS=1 .. && cmake --build . --config Release
|
49 |
```
|
50 |
|
@@ -52,25 +51,23 @@ Compiling on Windows: developer cmp-nct notes: 'I personally compile it using VS
|
|
52 |
|
53 |
Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
|
54 |
```
|
55 |
-
bin/falcon_main -t 8 -ngl 100 -m
|
56 |
```
|
57 |
|
58 |
-
|
59 |
|
60 |
Adjust `-t 8` according to what performs best on your system. Do not exceed the number of physical CPU cores you have.
|
61 |
|
|
|
|
|
62 |
<!-- compatibility_ggml end -->
|
63 |
|
64 |
## Provided files
|
65 |
| Name | Quant method | Bits | Size | Max RAM required | Use case |
|
66 |
| ---- | ---- | ---- | ---- | ---- | ----- |
|
67 |
| falcon40b-instruct.ggmlv3.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
|
68 |
-
| falcon40b-instruct.ggmlv3.q3_K_L.bin | q3_K_L | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
|
69 |
-
| falcon40b-instruct.ggmlv3.q3_K_M.bin | q3_K_M | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
|
70 |
| falcon40b-instruct.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
|
71 |
-
| falcon40b-instruct.ggmlv3.q4_K_M.bin | q4_K_M | 4 | 23.54 GB | 26.04 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
|
72 |
| falcon40b-instruct.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
|
73 |
-
| falcon40b-instruct.ggmlv3.q5_K_M.bin | q5_K_M | 5 | 28.77 GB | 31.27 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
|
74 |
| falcon40b-instruct.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
|
75 |
| falcon40b-instruct.ggmlv3.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
|
76 |
| falcon40b-instruct.ggmlv3.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
|
|
|
44 |
```
|
45 |
git clone https://github.com/cmp-nct/ggllm.cpp
|
46 |
cd ggllm.cpp
|
|
|
47 |
rm -rf build && mkdir build && cd build && cmake -DGGML_CUBLAS=1 .. && cmake --build . --config Release
|
48 |
```
|
49 |
|
|
|
51 |
|
52 |
Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
|
53 |
```
|
54 |
+
bin/falcon_main -t 8 -ngl 100 -b 1 -m falcon40b-instruct.ggmlv3.q3_K_S.bin -p "What is a falcon?\n### Response:"
|
55 |
```
|
56 |
|
57 |
+
You can specify `-ngl 100` regardles of your VRAM, as it will automatically detect how much VRAM is available can be used.
|
58 |
|
59 |
Adjust `-t 8` according to what performs best on your system. Do not exceed the number of physical CPU cores you have.
|
60 |
|
61 |
+
`-b 1` reduces batch size to 1. This slightly lowers prompt evaluation time, but frees up VRAM to load more of the model on to your GPU. If you find prompt evaluation too slow and have enough spare VRAM, you can remove this parameter.
|
62 |
+
|
63 |
<!-- compatibility_ggml end -->
|
64 |
|
65 |
## Provided files
|
66 |
| Name | Quant method | Bits | Size | Max RAM required | Use case |
|
67 |
| ---- | ---- | ---- | ---- | ---- | ----- |
|
68 |
| falcon40b-instruct.ggmlv3.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
|
|
|
|
|
69 |
| falcon40b-instruct.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
|
|
|
70 |
| falcon40b-instruct.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
|
|
|
71 |
| falcon40b-instruct.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
|
72 |
| falcon40b-instruct.ggmlv3.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
|
73 |
| falcon40b-instruct.ggmlv3.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
|