imone commited on
Commit
c2cfeec
1 Parent(s): 1f0281c

Update README

Browse files
Files changed (1) hide show
  1. README.md +8 -5
README.md CHANGED
@@ -18,9 +18,11 @@ license: llama2
18
 
19
  ## <a id="models"></a> Usage
20
 
21
- To use these models, we highly recommend installing the OpenChat OpenAI-compatible API server from [OpenChat repo](https://github.com/imoneoi/openchat), and run the serving commands in the table below. The server is optimized for high-throughput deployment using vLLM and can run on a GPU with at least 48GB RAM, or two consumer GPUs with tensor parallel. To enable tensor parallel, append `--tensor-parallel-size 2` to the serving command.
22
 
23
- When started, the server listens at `localhost:18888` for requests and is compatible with the [OpenAI ChatCompletion API specifications](https://platform.openai.com/docs/api-reference/chat). See the example request below for reference. Additionally, you can access the [OpenChat Web UI](https://github.com/imoneoi/openchat/#web-ui) for a user-friendly experience.
 
 
24
 
25
  <details>
26
  <summary>Example request (click to expand)</summary>
@@ -33,14 +35,15 @@ curl http://localhost:18888/v1/chat/completions \
33
  "messages": [{"role": "user", "content": "You are a large language model named OpenChat. Write a poem to describe yourself"}]
34
  }'
35
  ```
 
36
  </details>
37
 
38
  | Model | Size | Context | Weights | Serving |
39
  |---------------|------|---------|-------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
40
- | OpenChat 3.1 | 13B | 4096 | [Huggingface](https://huggingface.co/openchat/openchat_v3.1) | `python -m ochat.serving.openai_api_server --model_type openchat_v3.1_llama2 --model openchat/openchat_v3.1 --engine-use-ray --worker-use-ray --max-num-batched-tokens 5120` |
41
- | OpenChat 3.2 | 13B | 4096 | [Huggingface](https://huggingface.co/openchat/openchat_v3.2) | `python -m ochat.serving.openai_api_server --model_type openchat_v3.2 --model openchat/openchat_v3.2 --engine-use-ray --worker-use-ray --max-num-batched-tokens 5120` |
42
 
43
- To run inference with Huggingface Transformers (slow and not recommended), follow the conversation template provided below:
44
 
45
  <details>
46
  <summary>Conversation templates (click to expand)</summary>
 
18
 
19
  ## <a id="models"></a> Usage
20
 
21
+ To use these models, we highly recommend installing the OpenChat package by following the [installation guide](https://github.com/imoneoi/openchat/#installation) and using the OpenChat OpenAI-compatible API server by running the serving command from the table below. The server is optimized for high-throughput deployment using [vLLM](https://github.com/vllm-project/vllm) and can run on a GPU with at least 48GB RAM or two consumer GPUs with tensor parallelism. To enable tensor parallelism, append `--tensor-parallel-size 2` to the serving command.
22
 
23
+ When started, the server listens at `localhost:18888` for requests and is compatible with the [OpenAI ChatCompletion API specifications](https://platform.openai.com/docs/api-reference/chat). See the example request below for reference. Additionally, you can access the [OpenChat Web UI](#web-ui) for a user-friendly experience.
24
+
25
+ To deploy the server as an online service, use `--api-keys sk-KEY1 sk-KEY2 ...` to specify allowed API keys and `--disable-log-requests --disable-log-stats --log-file openchat.log` for logging only to a file. We recommend using a [HTTPS gateway](https://fastapi.tiangolo.com/es/deployment/concepts/#security-https) in front of the server for security purposes.
26
 
27
  <details>
28
  <summary>Example request (click to expand)</summary>
 
35
  "messages": [{"role": "user", "content": "You are a large language model named OpenChat. Write a poem to describe yourself"}]
36
  }'
37
  ```
38
+
39
  </details>
40
 
41
  | Model | Size | Context | Weights | Serving |
42
  |---------------|------|---------|-------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
43
+ | OpenChat 3.1 | 13B | 4096 | [Huggingface](https://huggingface.co/openchat/openchat_v3.1) | `python -m ochat.serving.openai_api_server --model-type openchat_v3.1_llama2 --model openchat/openchat_v3.1 --engine-use-ray --worker-use-ray --max-num-batched-tokens 5120` |
44
+ | OpenChat 3.2 | 13B | 4096 | [Huggingface](https://huggingface.co/openchat/openchat_v3.2) | `python -m ochat.serving.openai_api_server --model-type openchat_v3.2 --model openchat/openchat_v3.2 --engine-use-ray --worker-use-ray --max-num-batched-tokens 5120` |
45
 
46
+ For inference with Huggingface Transformers (slow and not recommended), follow the conversation template provided below:
47
 
48
  <details>
49
  <summary>Conversation templates (click to expand)</summary>