|
--- |
|
license: other |
|
inference: false |
|
--- |
|
|
|
# Quantised GGMLs of alpaca-lora-65B |
|
|
|
Quantised 4bit and 5bit GGMLs of [changsung's alpaca-lora-65B](https://huggingface.co/chansung/alpaca-lora-65b) for CPU inference with [llama.cpp](https://github.com/ggerganov/llama.cpp). |
|
|
|
I also have 4bit GPTQ files for GPU inference available here: [TheBloke/alpaca-lora-65B-GPTQ-4bit](https://huggingface.co/TheBloke/alpaca-lora-65B-GPTQ-4bit). |
|
|
|
## REQUIRES LATEST LLAMA.CPP (May 12th 2023 - commit b9fd7ee)! |
|
|
|
llama.cpp recently made a breaking change to its quantisation methods. |
|
|
|
I have re-quantised the GGML files in this repo. Therefore you will require llama.cpp compiled on May 12th or later (commit `b9fd7ee` or later) to use them. |
|
|
|
The previous files, which will still work in older versions of llama.cpp, can be found in branch `previous_llama`. |
|
|
|
## Provided files |
|
| Name | Quant method | Bits | Size | RAM required | Use case | |
|
| ---- | ---- | ---- | ---- | ---- | ----- | |
|
`alpaca-lora-65B.ggml.q4_0.bin` | q4_0 | 4bit | 40.8GB | 43GB | 4bit. | |
|
`alpaca-lora-65B.ggml.q5_0.bin` | q5_0 | 5bit | 44.9GB | 47GB | 5bit. Higher quality than 4bit, at cost of slightly higher resources. | |
|
`alpaca-lora-65B.ggml.q5_1.bin` | q5_1 | 5bit | 49GB | 51GB | Sbit. Slightly higher resource usage and quality than q5_0. | |
|
|
|
* The q4_0 file provides lower quality, but maximal compatibility. It will work with past and future versions of llama.cpp |
|
* The q5_0 file is using brand new 5bit method released 26th April. This is the 5bit equivalent of q4_0. |
|
* The q5_1 file is using brand new 5bit method released 26th April. This is the 5bit equivalent of q4_1. |
|
|
|
## How to run in `llama.cpp` |
|
|
|
I use the following command line; adjust for your tastes and needs: |
|
|
|
``` |
|
./main -t 18 -m alpaca-lora-65B.ggml.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Below is an instruction that describes a task. Write a response that appropriately completes the request. |
|
### Instruction: |
|
Write a story about llamas |
|
### Response:" |
|
``` |
|
Change `-t 18` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`. |
|
|
|
If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins` |
|
|
|
## How to run in `text-generation-webui` |
|
|
|
Further instructions here: [text-generation-webui/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md). |
|
|
|
Note: at this time text-generation-webui will not support the new q5 quantisation methods. |
|
|
|
**Thireus** has written a [great guide on how to update it to the latest llama.cpp code](https://huggingface.co/TheBloke/wizardLM-7B-GGML/discussions/5) so that these files can be used in the UI. |
|
|
|
# Original model card not provided |
|
|
|
No model card was provided in [changsung's original repository](https://huggingface.co/chansung/alpaca-lora-65b). |
|
|
|
Based on the name, I assume this is the result of fine tuning using the original GPT 3.5 Alpaca dataset. It is unknown as to whether the original Stanford data was used, or the [cleaned tloen/alpaca-lora variant](https://github.com/tloen/alpaca-lora). |