Text Generation
Transformers
PyTorch
English
gpt_neox
text-generation-inference
Inference Endpoints
xzyao commited on
Commit
cbdcae0
1 Parent(s): 4b96b8d

update readme

Browse files
Files changed (1) hide show
  1. README.md +81 -19
README.md CHANGED
@@ -4,10 +4,9 @@ language:
4
  - en
5
  ---
6
 
7
- # RedPajama-Chat-INCITE-2.8B
8
 
9
- RedPajama-Chat-INCITE-2.8B-v1, is a large transformer-based language model developed by Together Computer and trained on the RedPajama-Data-1T dataset.
10
- It is further fine-tuned on GPT-JT's datasets enhance zero/few-shot in-context learning.
11
 
12
  ## Model Details
13
  - **Developed by**: Together Computer.
@@ -18,52 +17,115 @@ It is further fine-tuned on GPT-JT's datasets enhance zero/few-shot in-context l
18
 
19
  # Quick Start
20
 
 
 
21
  ## GPU Inference
22
 
23
  This requires a GPU with 8GB memory.
 
24
  ```python
 
 
25
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
 
 
 
 
 
26
  # init
27
- tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-Chat-INCITE-2.8B-v1")
28
- model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-Chat-INCITE-2.8B-v1", torch_dtype=torch.float16)
29
  model = model.to('cuda:0')
30
  # infer
31
- inputs = tokenizer("Hello", return_tensors='pt').to(model.device)
32
- outputs = model.generate(**inputs, max_new_tokens=10, do_sample=True, temperature=0.8)
33
- output_str = tokenizer.decode(outputs[0])
 
 
 
 
 
34
  print(output_str)
 
 
 
35
  ```
36
 
37
  ## GPU Inference in Int8
38
 
39
  This requires a GPU with 6GB memory.
40
 
 
 
 
 
 
 
 
 
 
41
  ```python
 
 
42
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
 
 
 
 
 
43
  # init
44
- tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-Chat-INCITE-2.8B-v1")
45
- model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-Chat-INCITE-2.8B-v1", device_map="auto", load_in_8bit=True)
 
46
  # infer
47
- inputs = tokenizer("Hello", return_tensors='pt').to(model.device)
48
- outputs = model.generate(**inputs, max_new_tokens=10, do_sample=True, temperature=0.8)
49
- output_str = tokenizer.decode(outputs[0])
 
 
 
 
 
50
  print(output_str)
 
 
 
51
  ```
52
 
53
  ## CPU Inference
54
 
55
  ```python
 
 
56
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
 
 
 
 
 
57
  # init
58
- tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-Chat-INCITE-2.8B-v1")
59
- model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-Chat-INCITE-2.8B-v1", torch_dtype=torch.bfloat16)
60
  # infer
61
- inputs = tokenizer("<human>: Hello!\n<bot>:", return_tensors='pt').to(model.device)
62
- outputs = model.generate(**inputs, max_new_tokens=10, do_sample=True, temperature=0.8)
63
- output_str = tokenizer.decode(outputs[0])
 
 
 
 
 
64
  print(output_str)
 
 
 
65
  ```
66
 
 
 
67
 
68
  # Uses
69
 
@@ -123,7 +185,7 @@ Please refer to [togethercomputer/RedPajama-Data-1T](https://huggingface.co/data
123
  - **Hardware:** 8 A100
124
  - **Optimizer:** Adam
125
  - **Gradient Accumulations**: 1
126
- - **Num of Tokens:** 1B Tokens
127
  - **Learning rate:** 1e-5
128
 
129
  ## Community
 
4
  - en
5
  ---
6
 
7
+ # RedPajama-Instruct-INCITE-2.8B
8
 
9
+ RedPajama-Instruct-INCITE-2.8B-v1, is a large transformer-based language model developed by Together Computer and trained on the RedPajama-Data-1T dataset.
 
10
 
11
  ## Model Details
12
  - **Developed by**: Together Computer.
 
17
 
18
  # Quick Start
19
 
20
+ Please note that the model requires `transformers` version >= 4.25.1.
21
+
22
  ## GPU Inference
23
 
24
  This requires a GPU with 8GB memory.
25
+
26
  ```python
27
+ import torch
28
+ import transformers
29
  from transformers import AutoTokenizer, AutoModelForCausalLM
30
+
31
+ MIN_TRANSFORMERS_VERSION = '4.25.1'
32
+
33
+ # check transformers version
34
+ assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'
35
+
36
  # init
37
+ tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-Instruct-INCITE-2.8B-v1")
38
+ model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-Instruct-INCITE-2.8B-v1", torch_dtype=torch.float16)
39
  model = model.to('cuda:0')
40
  # infer
41
+ prompt = "Q: The capital of France is?\nA:"
42
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
43
+ input_length = inputs.input_ids.shape[1]
44
+ outputs = model.generate(
45
+ **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
46
+ )
47
+ token = outputs.sequences[0, input_length:]
48
+ output_str = tokenizer.decode(token)
49
  print(output_str)
50
+ """
51
+ Paris
52
+ """
53
  ```
54
 
55
  ## GPU Inference in Int8
56
 
57
  This requires a GPU with 6GB memory.
58
 
59
+ To run inference with int8, please ensure you have installed accelerate and bitandbytes. You can install them with the following command:
60
+
61
+ ```bash
62
+ pip install accelerate
63
+ pip install bitsandbytes
64
+ ```
65
+
66
+ Then you can run inference with int8 as follows:
67
+
68
  ```python
69
+ import torch
70
+ import transformers
71
  from transformers import AutoTokenizer, AutoModelForCausalLM
72
+
73
+ MIN_TRANSFORMERS_VERSION = '4.25.1'
74
+
75
+ # check transformers version
76
+ assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'
77
+
78
  # init
79
+ tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-Instruct-INCITE-2.8B-v1")
80
+ model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-Instruct-INCITE-2.8B-v1", device_map='auto', torch_dtype=torch.float16, load_in_8bit=True)
81
+
82
  # infer
83
+ prompt = "Q: The capital of France is?\nA:"
84
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
85
+ input_length = inputs.input_ids.shape[1]
86
+ outputs = model.generate(
87
+ **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
88
+ )
89
+ token = outputs.sequences[0, input_length:]
90
+ output_str = tokenizer.decode(token)
91
  print(output_str)
92
+ """
93
+ Paris
94
+ """
95
  ```
96
 
97
  ## CPU Inference
98
 
99
  ```python
100
+ import torch
101
+ import transformers
102
  from transformers import AutoTokenizer, AutoModelForCausalLM
103
+
104
+ MIN_TRANSFORMERS_VERSION = '4.25.1'
105
+
106
+ # check transformers version
107
+ assert transformers.__version__ >= MIN_TRANSFORMERS_VERSION, f'Please upgrade transformers to version {MIN_TRANSFORMERS_VERSION} or higher.'
108
+
109
  # init
110
+ tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-Instruct-INCITE-2.8B-v1")
111
+ model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-Instruct-INCITE-2.8B-v1", torch_dtype=torch.bfloat16)
112
  # infer
113
+ prompt = "Q: The capital of France is?\nA:"
114
+ inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
115
+ input_length = inputs.input_ids.shape[1]
116
+ outputs = model.generate(
117
+ **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
118
+ )
119
+ token = outputs.sequences[0, input_length:]
120
+ output_str = tokenizer.decode(token)
121
  print(output_str)
122
+ """
123
+ Paris
124
+ """
125
  ```
126
 
127
+ Please note that since `LayerNormKernelImpl` is not implemented in fp16 for CPU, we use `bfloat16` for CPU inference.
128
+
129
 
130
  # Uses
131
 
 
185
  - **Hardware:** 8 A100
186
  - **Optimizer:** Adam
187
  - **Gradient Accumulations**: 1
188
+ - **Num of Tokens:** 131M tokens
189
  - **Learning rate:** 1e-5
190
 
191
  ## Community