Update README.md
Browse files
README.md
CHANGED
@@ -33,20 +33,8 @@ huggingface-cli download alpindale/Llama-2-7b-ONNX --repo-type model --cache-dir
|
|
33 |
The `--cache-dir` kwarg is only necessary if your default cache directory (`~/.cache`)
|
34 |
does not have enough disk space to accomodate the entire repository.
|
35 |
|
36 |
-
## Using the model
|
37 |
-
The repository provides example code for running the models.
|
38 |
-
|
39 |
-
```sh
|
40 |
-
python llama2_onnx_inference.py --onnx_file FP16/LlamaV2_7B_float16.onnx --embedding_file embeddings.pth --tokenizer_path tokenizer.model --prompt "What is the lightest element?"
|
41 |
-
```
|
42 |
-
|
43 |
-
Output:
|
44 |
-
```
|
45 |
-
The lightest element is hydrogen. Hydrogen is the lightest element on the periodic table, with an atomic mass of 1.00794 u (unified atomic mass units).
|
46 |
-
```
|
47 |
-
|
48 |
### Chat Interface
|
49 |
-
|
50 |
|
51 |
First, install the required packages:
|
52 |
```sh
|
@@ -67,6 +55,18 @@ python ChatApp/app.py
|
|
67 |
You can then navigate to [http://localhost:7860](https://127.0.0.1:7860) on your browser to access the interface.
|
68 |
|
69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
## FAQ
|
71 |
### Why is the first inference session slow?
|
72 |
ONNX runtime execution provider might need to generate JIT binaries for the underlying hardware, typically the binary is cache and will be loaded directly in the subsequent runs to reduce the overhead.
|
|
|
33 |
The `--cache-dir` kwarg is only necessary if your default cache directory (`~/.cache`)
|
34 |
does not have enough disk space to accomodate the entire repository.
|
35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
### Chat Interface
|
37 |
+
You can use the Gradio chat interface to run the models.
|
38 |
|
39 |
First, install the required packages:
|
40 |
```sh
|
|
|
55 |
You can then navigate to [http://localhost:7860](https://127.0.0.1:7860) on your browser to access the interface.
|
56 |
|
57 |
|
58 |
+
## CLI Interface
|
59 |
+
The repository also provides example code for running the models.
|
60 |
+
|
61 |
+
```sh
|
62 |
+
python llama2_onnx_inference.py --onnx_file FP16/LlamaV2_7B_float16.onnx --embedding_file embeddings.pth --tokenizer_path tokenizer.model --prompt "What is the lightest element?"
|
63 |
+
```
|
64 |
+
|
65 |
+
Output:
|
66 |
+
```
|
67 |
+
The lightest element is hydrogen. Hydrogen is the lightest element on the periodic table, with an atomic mass of 1.00794 u (unified atomic mass units).
|
68 |
+
```
|
69 |
+
|
70 |
## FAQ
|
71 |
### Why is the first inference session slow?
|
72 |
ONNX runtime execution provider might need to generate JIT binaries for the underlying hardware, typically the binary is cache and will be loaded directly in the subsequent runs to reduce the overhead.
|