File size: 3,129 Bytes
b35d0e0 c27b183 b35d0e0 a6b068b c27b183 3ef72c0 a6b068b 83a65a5 0591b6b c27b183 5ccccff 0591b6b 5ccccff 0ad53da c27b183 0ad53da 0591b6b 0ad53da c27b183 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
---
license: other
inference: false
---
# Quantised GGMLs of alpaca-lora-65B
Quantised 4bit and 5bit GGMLs of [changsung's alpaca-lora-65B](https://huggingface.co/chansung/alpaca-lora-65b) for CPU inference with [llama.cpp](https://github.com/ggerganov/llama.cpp).
I also have 4bit GPTQ files for GPU inference available here: [TheBloke/alpaca-lora-65B-GPTQ-4bit](https://huggingface.co/TheBloke/alpaca-lora-65B-GPTQ-4bit).
## REQUIRES LATEST LLAMA.CPP (May 12th 2023 - commit b9fd7ee)!
llama.cpp recently made a breaking change to its quantisation methods.
I have re-quantised the GGML files in this repo. Therefore you will require llama.cpp compiled on May 12th or later (commit `b9fd7ee` or later) to use them.
The previous files, which will still work in older versions of llama.cpp, can be found in branch `previous_llama`.
## Provided files
| Name | Quant method | Bits | Size | RAM required | Use case |
| ---- | ---- | ---- | ---- | ---- | ----- |
`alpaca-lora-65B.ggml.q4_0.bin` | q4_0 | 4bit | 40.8GB | 43GB | 4bit. |
`alpaca-lora-65B.ggml.q5_0.bin` | q5_0 | 5bit | 44.9GB | 47GB | 5bit. Higher quality than 4bit, at cost of slightly higher resources. |
`alpaca-lora-65B.ggml.q5_1.bin` | q5_1 | 5bit | 49GB | 51GB | Sbit. Slightly higher resource usage and quality than q5_0. |
* The q4_0 file provides lower quality, but maximal compatibility. It will work with past and future versions of llama.cpp
* The q5_0 file is using brand new 5bit method released 26th April. This is the 5bit equivalent of q4_0.
* The q5_1 file is using brand new 5bit method released 26th April. This is the 5bit equivalent of q4_1.
## How to run in `llama.cpp`
I use the following command line; adjust for your tastes and needs:
```
./main -t 18 -m alpaca-lora-65B.ggml.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Write a story about llamas
### Response:"
```
Change `-t 18` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
## How to run in `text-generation-webui`
Further instructions here: [text-generation-webui/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md).
Note: at this time text-generation-webui will not support the new q5 quantisation methods.
**Thireus** has written a [great guide on how to update it to the latest llama.cpp code](https://huggingface.co/TheBloke/wizardLM-7B-GGML/discussions/5) so that these files can be used in the UI.
# Original model card not provided
No model card was provided in [changsung's original repository](https://huggingface.co/chansung/alpaca-lora-65b).
Based on the name, I assume this is the result of fine tuning using the original GPT 3.5 Alpaca dataset. It is unknown as to whether the original Stanford data was used, or the [cleaned tloen/alpaca-lora variant](https://github.com/tloen/alpaca-lora). |