aws-kh commited on
Commit
9f6ae3f
1 Parent(s): 54f7e6d

Added model files.

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ model.safetensors filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,150 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ inference: false
4
+ ---
5
+
6
+ # MistralLite-AWQ Model
7
+
8
+ MistralLite-AWQ is a version of the [MistralLite](https://huggingface.co/amazon/MistralLite) model that was
9
+ quantized using the AWQ method developed by [Lin et al. (2023)](https://arxiv.org/abs/2306.00978).
10
+ The MistralLite-AWQ models are approximately **70% smaller** than those of MistralLite whilst maintaining comparable performance.
11
+
12
+ Please refer to the [original MistralLite model card](https://huggingface.co/amazon/MistralLite) for details about the model
13
+ preparation and training processes.
14
+
15
+ ## MistralLite-AWQ Variants
16
+
17
+ | Branch | Approx. Model Size | `q_group_size` | `w_bit` | `version` |
18
+ |--------|---:|---------------:|--------:|-----------|
19
+ | [main](https://huggingface.co/amazon/MistralLite-AWQ/tree/main) | 3.9 GB | 128 | 4 | GEMM |
20
+ | [MistralLite-AWQ-64g-4b-GEMM](https://huggingface.co/amazon/MistralLite-AWQ/tree/MistralLite-AWQ-64g-4b-GEMM) | 4.0 GB | 64 | 4 | GEMM |
21
+ | [MistralLite-AWQ-32g-4b-GEMM](https://huggingface.co/amazon/MistralLite-AWQ/tree/MistralLite-AWQ-32g-4b-GEMM) | 4.3 GB | 32 | 4 | GEMM |
22
+
23
+ ## Dependencies
24
+ - [`autoawq==0.2.5`](https://pypi.org/project/autoawq/0.2.5/) – [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) was used to quantize the MistralLite model.
25
+ - [`vllm==0.4.2`](https://pypi.org/project/vllm/0.4.2/) – [vLLM](https://github.com/vllm-project/vllm) was used to host models for benchmarking.
26
+
27
+ ## Evaluations
28
+
29
+ ### Long Context
30
+
31
+ The following benchmark results are shown as _accuracy_ (%) values, unless stated otherwise.
32
+
33
+ #### Topic Retrieval
34
+
35
+ See https://lmsys.org/blog/2023-06-29-longchat/
36
+
37
+ | Model Name | n_topics=05 | n_topics=10 | n_topics=15 | n_topics=20 | n_topics=25 |
38
+ |:---------------------------------------------------|--------------:|--------------:|--------------:|--------------:|--------------:|
39
+ | _n_tokens_ (approx.) = | _3048_ | _5966_ | _8903_ | _11832_ | _14757_ |
40
+ | MistralLite | 100 | 100 | 100 | 100 | 98 |
41
+ | **MistralLite-AWQ** | **100** | **100** | **100**| **100** | **98** |
42
+ | **MistralLite-AWQ-64g-4b-GEMM** | **100** | **100** | **100**| **100** | **98** |
43
+ | **MistralLite-AWQ-32g-4b-GEMM** | **100** | **100** | **100**| **100** | **98** |
44
+ | Mistral-7B-Instruct-v0.1 | 96 | 52 | 2 | 0 | 0 |
45
+ | Mistral-7B-Instruct-v0.2 | 100 | 100 | 100 | 100 | 100 |
46
+ | Mixtral-8x7B-v0.1 | 0 | 0 | 0 | 0 | 0 |
47
+ | Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 100 | 100 |
48
+
49
+ #### [Line Retrieval](https://lmsys.org/blog/2023-06-29-longchat/#longeval-results)
50
+
51
+ See https://lmsys.org/blog/2023-06-29-longchat/#longeval-results
52
+
53
+ | Model Name | n_lines=200 | n_lines=300 | n_lines=400 | n_lines=500 | n_lines=600 | n_lines=680 |
54
+ |:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:|
55
+ | _n_tokens_ (approx.) = | _4317_ | _6415_ | _8510_ | _10610_ | _12698_ | _14373_ |
56
+ | MistralLite | 100 | 94 | 86 | 82 | 76 | 66 |
57
+ | **MistralLite-AWQ** | **96**| **94**| **88** | **80** | **70**| **62** |
58
+ | **MistralLite-AWQ-64g-4b-GEMM** | **96**| **96**| **90** | **70** | **72**| **60** |
59
+ | **MistralLite-AWQ-32g-4b-GEMM** | **98**| **96**| **84** | **76** | **70**| **62** |
60
+ | Mistral-7B-Instruct-v0.1 | 96 | 56 | 38 | 36 | 30 | 30 |
61
+ | Mistral-7B-Instruct-v0.2 | 100 | 100 | 96 | 98 | 96 | 84 |
62
+ | Mixtral-8x7B-v0.1 | 54 | 38 | 56 | 66 | 62 | 38 |
63
+ | Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 100 | 100 | 100 |
64
+
65
+ #### Pass Key Retrieval
66
+
67
+ See https://github.com/epfml/landmark-attention/blob/main/llama/run_test.py#L101
68
+
69
+ | Model Name | n_garbage=12000 | n_garbage=20000 | n_garbage=31000 | n_garbage=38000 | n_garbage=45000 | n_garbage=60000 |
70
+ |:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:|
71
+ | _n_tokens_ (approx.) = | _3272_ | _5405_ | _8338_ | _10205_ | _12071_ | _16072_ |
72
+ | MistralLite | 100 | 100 | 100 | 100 | 100 | 100|
73
+ | **MistralLite-AWQ** | **100** | **100**| **100**| **100** | **100**| **100**|
74
+ | **MistralLite-AWQ-64g-4b-GEMM** | **100** | **100**| **100**| **100** | **100**| **100**|
75
+ | **MistralLite-AWQ-32g-4b-GEMM** | **100** | **100**| **100**| **100** | **100**| **100**|
76
+ | Mistral-7B-Instruct-v0.1 | 100 | 50 | 30 | 20 | 10 | 10 |
77
+ | Mistral-7B-Instruct-v0.2 | 100 | 100 | 100 | 100 | 100 | 100 |
78
+ | Mixtral-8x7B-v0.1 | 100 | 100 | 100 | 100 | 100 | 100 |
79
+ | Mixtral-8x7B-Instruct-v0.1 | 100 | 100 | 100 | 90 | 100 | 100 |
80
+
81
+
82
+ #### QuALITY (Question Answering with Long Input Texts, Yes!
83
+
84
+ See https://nyu-mll.github.io/quality/
85
+
86
+ |Model Name| Test set Accuracy | Hard subset Accuracy|
87
+ |:----------|-------------:|-------------:|
88
+ | MistralLite | 56.8 | 74.5 |
89
+ | **MistralLite-AWQ** | **55.3** | **71.8** |
90
+ | **MistralLite-AWQ-64g-4b-GEMM** | **55.2** | **72.9** |
91
+ | **MistralLite-AWQ-32g-4b-GEMM** | **56.6** | **72.8** |
92
+ | Mistral-7B-Instruct-v0.1 | 45.2 | 58.9 |
93
+ | Mistral-7B-Instruct-v0.2 | 55.5 | 74 |
94
+ | Mixtral-8x7B-v0.1 | 75 | 74.1 |
95
+ | Mixtral-8x7B-Instruct-v0.1 | 68.7 | 83.3 |
96
+
97
+ ## Usage
98
+
99
+ ## Inference via vLLM HTTP Host
100
+
101
+ ### Launch Host
102
+ ```bash
103
+ python -m vllm.entrypoints.openai.api_server \
104
+ --model amazon/MistralLite-AWQ \
105
+ --quantization awq
106
+ ```
107
+
108
+ ### Query Host
109
+ ```bash
110
+ curl -X POST http://localhost:8000/v1/completions \
111
+ -H "Content-Type: application/json" \
112
+ -d '{ "model": "amazon/MistralLite-AWQ",
113
+ "prompt": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
114
+ "temperature": 0,
115
+ "echo": false
116
+ }'
117
+ ```
118
+
119
+ ## Inference via [vLLM Offline Inference](https://docs.vllm.ai/en/latest/getting_started/examples/offline_inference.html)
120
+ ```python
121
+ from vllm import LLM, SamplingParams
122
+
123
+ prompts = [
124
+ "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
125
+ ]
126
+ sampling_params = SamplingParams(temperature=0, max_tokens=100)
127
+
128
+ llm = LLM(model="amazon/MistralLite-AWQ")
129
+
130
+ outputs = llm.generate(prompts, sampling_params)
131
+
132
+ # Print the outputs.
133
+ for output in outputs:
134
+ prompt = output.prompt
135
+ generated_text = output.outputs[0].text
136
+ print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
137
+
138
+ ```
139
+
140
+ ## License
141
+
142
+ Apache 2.0
143
+
144
+ ## Limitations
145
+
146
+ Before using the MistralLite-AWQ model, it is important to perform your own
147
+ independent assessment, and take measures to ensure that your use would comply
148
+ with your own specific quality control practices and standards, and that your
149
+ use would comply with the local rules, laws, regulations, licenses and terms
150
+ that apply to you, and your content.
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/home/ubuntu/.cache/huggingface/hub/models--amazon--MistralLite/snapshots/a6083667f229a8b1503c816c863fd21be053871d",
3
+ "architectures": [
4
+ "MistralForCausalLM"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 4096,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 14336,
13
+ "max_position_embeddings": 32768,
14
+ "model_type": "mistral",
15
+ "num_attention_heads": 32,
16
+ "num_hidden_layers": 32,
17
+ "num_key_value_heads": 8,
18
+ "quantization_config": {
19
+ "bits": 4,
20
+ "group_size": 128,
21
+ "modules_to_not_convert": null,
22
+ "quant_method": "awq",
23
+ "version": "gemm",
24
+ "zero_point": true
25
+ },
26
+ "rms_norm_eps": 1e-05,
27
+ "rope_theta": 1000000,
28
+ "sliding_window": null,
29
+ "tie_word_embeddings": false,
30
+ "torch_dtype": "float16",
31
+ "transformers_version": "4.40.2",
32
+ "use_cache": true,
33
+ "vocab_size": 32003
34
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "do_sample": true,
5
+ "eos_token_id": 2,
6
+ "transformers_version": "4.40.2"
7
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95960bb0b19cd967e4ef0a1397ec885e6123bbd63b5934a992f253612e42dc1d
3
+ size 4150929384
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<unk>",
4
+ "<s>",
5
+ "</s>",
6
+ "<|assistant|>",
7
+ "<|prompter|>"
8
+ ],
9
+ "bos_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "pad_token": {
24
+ "content": "[PAD]",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": true,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "<unk>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "32000": {
30
+ "content": "[PAD]",
31
+ "lstrip": true,
32
+ "normalized": false,
33
+ "rstrip": true,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "32001": {
38
+ "content": "<|assistant|>",
39
+ "lstrip": true,
40
+ "normalized": false,
41
+ "rstrip": true,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "32002": {
46
+ "content": "<|prompter|>",
47
+ "lstrip": true,
48
+ "normalized": false,
49
+ "rstrip": true,
50
+ "single_word": false,
51
+ "special": true
52
+ }
53
+ },
54
+ "additional_special_tokens": [
55
+ "<unk>",
56
+ "<s>",
57
+ "</s>",
58
+ "<|assistant|>",
59
+ "<|prompter|>"
60
+ ],
61
+ "bos_token": "<s>",
62
+ "clean_up_tokenization_spaces": false,
63
+ "eos_token": "</s>",
64
+ "legacy": true,
65
+ "model_max_length": 1000000000000000019884624838656,
66
+ "pad_token": "[PAD]",
67
+ "sp_model_kwargs": {},
68
+ "spaces_between_special_tokens": false,
69
+ "tokenizer_class": "LlamaTokenizer",
70
+ "unk_token": "<unk>",
71
+ "use_default_system_prompt": true
72
+ }