TheBloke
/

alpaca-lora-65B-GGML

Model card Files Files and versions Community

alpaca-lora-65B-GGML / README.md

TheBloke's picture

Update README.md

3ef72c0 over 1 year ago

|

3.13 kB

	---
	license: other
	inference: false
	---

	# Quantised GGMLs of alpaca-lora-65B

	Quantised 4bit and 5bit GGMLs of [changsung's alpaca-lora-65B](https://huggingface.co/chansung/alpaca-lora-65b) for CPU inference with [llama.cpp](https://github.com/ggerganov/llama.cpp).

	I also have 4bit GPTQ files for GPU inference available here: [TheBloke/alpaca-lora-65B-GPTQ-4bit](https://huggingface.co/TheBloke/alpaca-lora-65B-GPTQ-4bit).

	## REQUIRES LATEST LLAMA.CPP (May 12th 2023 - commit b9fd7ee)!

	llama.cpp recently made a breaking change to its quantisation methods.

	I have re-quantised the GGML files in this repo. Therefore you will require llama.cpp compiled on May 12th or later (commit `b9fd7ee` or later) to use them.

	The previous files, which will still work in older versions of llama.cpp, can be found in branch `previous_llama`.

	## Provided files
	\| Name \| Quant method \| Bits \| Size \| RAM required \| Use case \|
	\| ---- \| ---- \| ---- \| ---- \| ---- \| ----- \|
	`alpaca-lora-65B.ggml.q4_0.bin` \| q4_0 \| 4bit \| 40.8GB \| 43GB \| 4bit. \|
	`alpaca-lora-65B.ggml.q5_0.bin` \| q5_0 \| 5bit \| 44.9GB \| 47GB \| 5bit. Higher quality than 4bit, at cost of slightly higher resources. \|
	`alpaca-lora-65B.ggml.q5_1.bin` \| q5_1 \| 5bit \| 49GB \| 51GB \| Sbit. Slightly higher resource usage and quality than q5_0. \|

	* The q4_0 file provides lower quality, but maximal compatibility. It will work with past and future versions of llama.cpp
	* The q5_0 file is using brand new 5bit method released 26th April. This is the 5bit equivalent of q4_0.
	* The q5_1 file is using brand new 5bit method released 26th April. This is the 5bit equivalent of q4_1.

	## How to run in `llama.cpp`

	I use the following command line; adjust for your tastes and needs:

	```
	./main -t 18 -m alpaca-lora-65B.ggml.q4_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "Below is an instruction that describes a task. Write a response that appropriately completes the request.
	### Instruction:
	Write a story about llamas
	### Response:"
	```
	Change `-t 18` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.

	If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`

	## How to run in `text-generation-webui`

	Further instructions here: [text-generation-webui/docs/llama.cpp-models.md](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md).

	Note: at this time text-generation-webui will not support the new q5 quantisation methods.

	Thireus has written a [great guide on how to update it to the latest llama.cpp code](https://huggingface.co/TheBloke/wizardLM-7B-GGML/discussions/5) so that these files can be used in the UI.

	# Original model card not provided

	No model card was provided in [changsung's original repository](https://huggingface.co/chansung/alpaca-lora-65b).

	Based on the name, I assume this is the result of fine tuning using the original GPT 3.5 Alpaca dataset. It is unknown as to whether the original Stanford data was used, or the [cleaned tloen/alpaca-lora variant](https://github.com/tloen/alpaca-lora).