Transformers
English
falcon
TheBloke commited on
Commit
b58cc00
1 Parent(s): 801a3be

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -7
README.md CHANGED
@@ -44,7 +44,6 @@ To build cmp-nct's fork of llama.cpp with Falcon 40B support plus preliminary CU
44
  ```
45
  git clone https://github.com/cmp-nct/ggllm.cpp
46
  cd ggllm.cpp
47
- git checkout cuda-integration
48
  rm -rf build && mkdir build && cd build && cmake -DGGML_CUBLAS=1 .. && cmake --build . --config Release
49
  ```
50
 
@@ -52,25 +51,23 @@ Compiling on Windows: developer cmp-nct notes: 'I personally compile it using VS
52
 
53
  Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
54
  ```
55
- bin/falcon_main -t 8 -ngl 100 -m /workspace/wizard-falcon40b.ggmlv3.q3_K_S.bin -p "What is a falcon?\n### Response:"
56
  ```
57
 
58
- Using `-ngl 100` will offload all layers to GPU. If you do not have enough VRAM for this, either lower the number or try a smaller quant size as otherwise performance will be severely affected.
59
 
60
  Adjust `-t 8` according to what performs best on your system. Do not exceed the number of physical CPU cores you have.
61
 
 
 
62
  <!-- compatibility_ggml end -->
63
 
64
  ## Provided files
65
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
66
  | ---- | ---- | ---- | ---- | ---- | ----- |
67
  | falcon40b-instruct.ggmlv3.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
68
- | falcon40b-instruct.ggmlv3.q3_K_L.bin | q3_K_L | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
69
- | falcon40b-instruct.ggmlv3.q3_K_M.bin | q3_K_M | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K |
70
  | falcon40b-instruct.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
71
- | falcon40b-instruct.ggmlv3.q4_K_M.bin | q4_K_M | 4 | 23.54 GB | 26.04 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K |
72
  | falcon40b-instruct.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
73
- | falcon40b-instruct.ggmlv3.q5_K_M.bin | q5_K_M | 5 | 28.77 GB | 31.27 GB | New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K |
74
  | falcon40b-instruct.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
75
  | falcon40b-instruct.ggmlv3.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
76
  | falcon40b-instruct.ggmlv3.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
 
44
  ```
45
  git clone https://github.com/cmp-nct/ggllm.cpp
46
  cd ggllm.cpp
 
47
  rm -rf build && mkdir build && cd build && cmake -DGGML_CUBLAS=1 .. && cmake --build . --config Release
48
  ```
49
 
 
51
 
52
  Once compiled you can then use `bin/falcon_main` just like you would use llama.cpp. For example:
53
  ```
54
+ bin/falcon_main -t 8 -ngl 100 -b 1 -m falcon40b-instruct.ggmlv3.q3_K_S.bin -p "What is a falcon?\n### Response:"
55
  ```
56
 
57
+ You can specify `-ngl 100` regardles of your VRAM, as it will automatically detect how much VRAM is available can be used.
58
 
59
  Adjust `-t 8` according to what performs best on your system. Do not exceed the number of physical CPU cores you have.
60
 
61
+ `-b 1` reduces batch size to 1. This slightly lowers prompt evaluation time, but frees up VRAM to load more of the model on to your GPU. If you find prompt evaluation too slow and have enough spare VRAM, you can remove this parameter.
62
+
63
  <!-- compatibility_ggml end -->
64
 
65
  ## Provided files
66
  | Name | Quant method | Bits | Size | Max RAM required | Use case |
67
  | ---- | ---- | ---- | ---- | ---- | ----- |
68
  | falcon40b-instruct.ggmlv3.q2_K.bin | q2_K | 2 | 13.74 GB | 16.24 GB | New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
 
 
69
  | falcon40b-instruct.ggmlv3.q3_K_S.bin | q3_K_S | 3 | 17.98 GB | 20.48 GB | New k-quant method. Uses GGML_TYPE_Q3_K for all tensors |
 
70
  | falcon40b-instruct.ggmlv3.q4_K_S.bin | q4_K_S | 4 | 23.54 GB | 26.04 GB | New k-quant method. Uses GGML_TYPE_Q4_K for all tensors |
 
71
  | falcon40b-instruct.ggmlv3.q5_K_S.bin | q5_K_S | 5 | 28.77 GB | 31.27 GB | New k-quant method. Uses GGML_TYPE_Q5_K for all tensors |
72
  | falcon40b-instruct.ggmlv3.q6_K.bin | q6_K | 6 | 34.33 GB | 36.83 GB | New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors |
73
  | falcon40b-instruct.ggmlv3.q8_0.bin | q8_0 | 8 | 44.46 GB | 46.96 GB | Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |