rskuzma commited on
Commit
f91e002
1 Parent(s): fa41f1d

Change pipeline_tag to text generation, add placeholders for paper links, incorporate SL change recommendations

Browse files
Files changed (1) hide show
  1. README.md +9 -7
README.md CHANGED
@@ -7,10 +7,11 @@ tags:
7
  license: apache-2.0
8
  datasets:
9
  - the_pile
10
-
11
  ---
12
 
13
  # Cerebras-GPT 111M
 
14
 
15
  ## Model Description
16
 
@@ -18,7 +19,7 @@ The Cerebras-GPT family is released to facilitate research into LLM scaling laws
18
 
19
  The family includes 111M, 256M, 590M, 1.3B, 2.7B, 6.7B, and 13B models.
20
 
21
- All models in the Cerebras-GPT family have been trained in accordance with [Chinchilla scaling laws](https://arxiv.org/abs/2203.15556) (20 tokens per model parameter) which yields improved performance at smaller model size.
22
 
23
  These models were trained on the [Andromeda](https://www.cerebras.net/andromeda/) AI supercomputer comprised of 16 CS-2 wafer scale systems. Cerebras' [weight streaming technology](https://www.cerebras.net/blog/linear-scaling-made-possible-with-weight-streaming) simplifies the training of LLMs by disaggregating compute from model storage. This allowed for efficient scaling of training across nodes using simple data parallelism.
24
 
@@ -28,7 +29,7 @@ Cerebras systems for pre-training and fine tuning are available in the cloud via
28
  * Developed by: [Cerebras Systems](https://www.cerebras.net/)
29
  * License: Apache 2.0
30
  * Model type: Transformer-based Language Model
31
- * Architecture: GPT-2 model architecture with hyperparameters more similar to GPT-3.
32
  * Data set: The Pile
33
  * Tokenizer: Byte Pair Encoding
34
  * Vocabulary Size: 50257
@@ -109,7 +110,7 @@ Our tokenized version of the Pile has 371B tokens. We used byte-pair encoding, a
109
 
110
  ## Training procedure
111
 
112
- We use the GPT-2 model architecture with hyperparameters more similar to GPT-3. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1.
113
 
114
  All models were trained to Chinchilla point: 20x more tokens than model parameters. Number of steps changed based on fixed batch size (2048) and sequence length (varied by model). See Training Table, below, for detail.
115
 
@@ -192,14 +193,15 @@ We evaluate our models on the PILE validation set comprising 380M tokens. We als
192
  ## Uses and Limitations
193
 
194
  ### Intended Use
195
- The models we train are being open-sourced to further research into LLM scaling laws but are not intended for use as production models. You may fine-tune and adapt Cerebras-GPT models for deployment via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or the Hugging Face Transformers Library. We recommend assessing potential bias and harms prior to deployment of any LLM.
 
 
196
 
197
- The primary intended users of these models are AI researchers and practitioners interested in testing the behaviors, capabilities, and limitations of large-scale generative language models.
198
 
199
  ### Out of Scope Use
200
  Cerebras-GPT models are trained on the Pile, with English language only, and are not suitable for machine translation tasks.
201
 
202
- Cerebras-GPT models have not been tuned for human-facing dialog applications like chatbots and will not respond to prompts in a similar way to models that have received instruction tuning or Reinforcement Learning from Human Feedback (RLHF) like Flan-T5 or ChatGPT. Cerebras-GPT models can be tuned using those methods.
203
 
204
  ### Risk and Bias
205
  Like many large text corpora, the Pile contains offensive text. Cerebras-GPT models trained on this text may create offensive or undesirable text outputs regardless of whether the initial prompt is offensive. Human filtering of responses is recommended.
 
7
  license: apache-2.0
8
  datasets:
9
  - the_pile
10
+ pipeline_tag: text-generation
11
  ---
12
 
13
  # Cerebras-GPT 111M
14
+ [TODO: arXiv paper](https://www.cerebras.net), [TODO: Blog Post](https://www.cerebras.net)
15
 
16
  ## Model Description
17
 
 
19
 
20
  The family includes 111M, 256M, 590M, 1.3B, 2.7B, 6.7B, and 13B models.
21
 
22
+ All models in the Cerebras-GPT family have been trained in accordance with [Chinchilla scaling laws](https://arxiv.org/abs/2203.15556) (20 tokens per model parameter) which is compute-optimal.
23
 
24
  These models were trained on the [Andromeda](https://www.cerebras.net/andromeda/) AI supercomputer comprised of 16 CS-2 wafer scale systems. Cerebras' [weight streaming technology](https://www.cerebras.net/blog/linear-scaling-made-possible-with-weight-streaming) simplifies the training of LLMs by disaggregating compute from model storage. This allowed for efficient scaling of training across nodes using simple data parallelism.
25
 
 
29
  * Developed by: [Cerebras Systems](https://www.cerebras.net/)
30
  * License: Apache 2.0
31
  * Model type: Transformer-based Language Model
32
+ * Architecture: GPT-3 style architecture
33
  * Data set: The Pile
34
  * Tokenizer: Byte Pair Encoding
35
  * Vocabulary Size: 50257
 
110
 
111
  ## Training procedure
112
 
113
+ We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1.
114
 
115
  All models were trained to Chinchilla point: 20x more tokens than model parameters. Number of steps changed based on fixed batch size (2048) and sequence length (varied by model). See Training Table, below, for detail.
116
 
 
193
  ## Uses and Limitations
194
 
195
  ### Intended Use
196
+ The models we train are being open-sourced to further research into LLM scaling laws, but release these models with a fully permissive Apache license for the community to use freely.
197
+
198
+ You may fine-tune and adapt Cerebras-GPT models for deployment via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or the Hugging Face Transformers Library. We recommend assessing potential bias and harms prior to deployment of any LLM.
199
 
 
200
 
201
  ### Out of Scope Use
202
  Cerebras-GPT models are trained on the Pile, with English language only, and are not suitable for machine translation tasks.
203
 
204
+ Cerebras-GPT models have not been tuned for human-facing dialog applications like chatbots and will not respond to prompts in a similar way to models that have received instruction tuning or reinforcement learning from human feedback (RLHF) like Flan-T5 or ChatGPT. Cerebras-GPT models can be tuned using those methods.
205
 
206
  ### Risk and Bias
207
  Like many large text corpora, the Pile contains offensive text. Cerebras-GPT models trained on this text may create offensive or undesirable text outputs regardless of whether the initial prompt is offensive. Human filtering of responses is recommended.