lmdeploy
/

llama2-chat-7b-w4

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

unsubscribe commited on Sep 27, 2023

Commit

d881779

•

1 Parent(s): c8da55e

Update README.md

Files changed (1) hide show

README.md +7 -3

README.md CHANGED Viewed

@@ -75,9 +75,13 @@ We benchmarked the Llama 2 7B and 13B with 4-bit quantization on NVIDIA GeForce
 | Llama 2 13B | N/A     | 90.7    | 115.8     |
 ```shell
-python benchmark/profile_generation.py \
-  ./workspace \
-  --concurrency 1 --input_seqlen 1 --output_seqlen 512
 ```
 ## 4-bit Weight Quantization

 | Llama 2 13B | N/A     | 90.7    | 115.8     |
 ```shell
+pip install nvidia-ml-py
+```
+```bash
+python profile_generation.py \
+ --model-path /path/to/your/model \
+ --concurrency 1 8 --prompt-tokens 0 512 --completion-tokens 2048 512
 ```
 ## 4-bit Weight Quantization