Files changed (1) hide show
  1. README.md +209 -0
README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - pytorch
6
+ - causal-lm
7
+ - pythia
8
+ license: apache-2.0
9
+ datasets:
10
+ - the_pile
11
+ ---
12
+
13
+ The *Pythia Scaling Suite* is a collection of models developed to facilitate
14
+ interpretability research. It contains two sets of eight models of sizes
15
+ 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B. For each size, there are two
16
+ models: one trained on the Pile, and one trained on the Pile after the dataset
17
+ has been globally deduplicated. All 8 model sizes are trained on the exact
18
+ same data, in the exact same order. All Pythia models are available
19
+ [on Hugging Face](https://huggingface.co/EleutherAI).
20
+
21
+ Some design choices were made for the sake of interpretability research and
22
+ to ensure consistency across all models. However, the Pythia models are
23
+ competitive with, or mildly outperform, other similar and same-sized models,
24
+ such as OPT and the GPT-Neo suite.
25
+
26
+ Please note that all models in the *Pythia* suite were re-named in January
27
+ 2023. For clarity, a <a href="#naming-convention-and-parameter-count">table
28
+ comparing the old and new names</a> is provided in this model card, together
29
+ with exact model parameter counts.
30
+
31
+ ## Pythia-70M
32
+
33
+ ### Model Details
34
+
35
+ - Developed by: [EleutherAI](http://eleuther.ai)
36
+ - Model type: Transformer-based Language Model
37
+ - Language: English
38
+ - Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
39
+ for training procedure, config files, and details on how to use.
40
+ - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
41
+ - License: Apache 2.0
42
+ - Contact: to ask questions about this model, join the [EleutherAI
43
+ Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
44
+ Please read the existing *Pythia* documentation before asking about it in the
45
+ EleutherAI Discord. For general correspondence:
46
47
+
48
+ <figure>
49
+
50
+ | Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models |
51
+ | -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
52
+ | 70M | 18,915,328 | 6 | 512 | 8 | 2M | 1.0 x 10<sup>-3</sup> | — |
53
+ | 160M | 85,056,000 | 12 | 768 | 12 | 4M | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
54
+ | 410M | 302,311,424 | 24 | 1024 | 16 | 4M | 3.0 x 10<sup>-4</sup> | OPT-350M |
55
+ | 1.0B | 805,736,448 | 16 | 2048 | 8 | 2M | 3.0 x 10<sup>-4</sup> | — |
56
+ | 1.4B | 1,208,602,624 | 24 | 2048 | 16 | 4M | 2.0 x 10<sup>-4</sup> | GPT-Neo 1.3B, OPT-1.3B |
57
+ | 2.8B | 2,517,652,480 | 32 | 2560 | 32 | 2M | 1.6 x 10<sup>-4</sup> | GPT-Neo 2.7B, OPT-2.7B |
58
+ | 6.9B | 6,444,163,072 | 32 | 4096 | 32 | 2M | 1.2 x 10<sup>-4</sup> | OPT-6.7B |
59
+ | 12B | 11,327,027,200 | 36 | 5120 | 40 | 2M | 1.2 x 10<sup>-4</sup> | — |
60
+ <figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and
61
+ non-deduped models of a given size have the same hyperparameters. “Equivalent”
62
+ models have <b>exactly</b> the same architecture, and the same number of
63
+ non-embedding parameters.</figcaption>
64
+ </figure>
65
+
66
+ ### Uses and Limitations
67
+
68
+ #### Intended Use
69
+
70
+ All Pythia models were developed specifically for research purposes. This
71
+ suite is intended to provide a controlled setting for performing scientific
72
+ experiments. To enable the study of how language models change over the course
73
+ of training, we provide 143 evenly spaced intermediate checkpoints per model.
74
+ These checkpoints are hosted on Hugging Face as branches. Note that branch
75
+ `143000` corresponds exactly to the model checkpoint on the `main` branch
76
+ of each model.
77
+
78
+ #### Out-of-scope use
79
+
80
+ Performance on NLP benchmarks is not a priority for *Pythia* models, although
81
+ its evaluation results are competitive with similarly-sized language models,
82
+ such as those from the OPT and BLOOM suites.
83
+
84
+ Pythia-70M has not been fine-tuned for downstream tasks for which
85
+ language models are commonly deployed, such as writing genre prose,
86
+ or commercial chatbots. This means Pythia-70M will likely **not**
87
+ respond to a given prompt the way e.g. ChatGPT does. This is because, unlike
88
+ this model, ChatGPT was fine-tuned using Reinforcement Learning from Human
89
+ Feedback (RLHF) to better “understand” human instructions.
90
+
91
+ #### Limitations and biases
92
+
93
+ The core functionality of a large language model is to take a string of text
94
+ and predict the next token. The token deemed statistically most likely by the
95
+ model need not produce the most “accurate” text. Never rely on
96
+ Pythia-70M to produce factually accurate output.
97
+
98
+ This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
99
+ known to contain profanity and texts that are lewd or otherwise offensive.
100
+ See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
101
+ discussion of documented biases with regards to gender, religion, and race.
102
+ Pythia-70M may produce socially unacceptable or undesirable text,
103
+ *even if* the prompt itself does not include anything explicitly offensive.
104
+
105
+ If you plan on using text generated through, for example, the Hosted Inference
106
+ API, we recommend having a human curate the outputs of this language model
107
+ before presenting it to other people. Please inform your audience that the
108
+ text was generated by Pythia-70M.
109
+
110
+ ### Quickstart
111
+
112
+ Pythia models can be loaded and used via the following code, demonstrated here
113
+ for the third `pythia-70m-deduped` checkpoint:
114
+
115
+ ```python
116
+ from transformers import GPTNeoXForCausalLM, AutoTokenizer
117
+
118
+ model = GPTNeoXForCausalLM.from_pretrained(
119
+ "EleutherAI/pythia-70m-deduped",
120
+ revision="step3000",
121
+ cache_dir="./pythia-70m-deduped/step3000",
122
+ )
123
+
124
+ tokenizer = AutoTokenizer.from_pretrained(
125
+ "EleutherAI/pythia-70m-deduped",
126
+ revision="step3000",
127
+ cache_dir="./pythia-70m-deduped/step3000",
128
+ )
129
+
130
+ inputs = tokenizer("Hello, I am", return_tensors="pt")
131
+ tokens = model.generate(**inputs)
132
+ tokenizer.decode(tokens[0])
133
+ ```
134
+
135
+ Revision/branch `step143000` corresponds exactly to the model checkpoint on
136
+ the `main` branch of each model.
137
+
138
+ For more information on how to use all Pythia models, see [documentation on
139
+ GitHub](https://github.com/EleutherAI/pythia).
140
+
141
+ ### Training
142
+
143
+ #### Training data
144
+
145
+ [The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
146
+ English. It was created by EleutherAI specifically for training large language
147
+ models. It contains texts from 22 diverse sources, roughly broken down into
148
+ five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
149
+ prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
150
+ miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
151
+ paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
152
+ methodology, and a discussion of ethical implications. Consult [the
153
+ datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
154
+ about the Pile and its component datasets. The Pile can be downloaded from
155
+ the [official website](https://pile.eleuther.ai/), or from a [community
156
+ mirror](https://the-eye.eu/public/AI/pile/).
157
+
158
+ The Pile was **not** deduplicated before being used to train Pythia-70M.
159
+
160
+ #### Training procedure
161
+
162
+ All models were trained on the exact same data, in the exact same order. Each
163
+ model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
164
+ model are saved every 2,097,152,000 tokens, spaced evenly throughout training.
165
+ This corresponds to training for just under 1 epoch on the Pile for
166
+ non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
167
+
168
+ All Pythia models trained for the equivalent of 143000 steps at a batch size
169
+ of 2,097,152 tokens. Two batch sizes were used: 2M and 4M. Models with a batch
170
+ size of 4M tokens listed were originally trained for 71500 steps instead, with
171
+ checkpoints every 500 steps. The checkpoints on Hugging Face are renamed for
172
+ consistency with all 2M batch models, so `step1000` is the first checkpoint
173
+ for `pythia-1.4b` that was saved (corresponding to step 500 in training), and
174
+ `step1000` is likewise the first `pythia-6.9b` checkpoint that was saved
175
+ (corresponding to 1000 “actual” steps).
176
+
177
+ See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
178
+ procedure, including [how to reproduce
179
+ it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).
180
+
181
+ ### Evaluations
182
+
183
+ All 16 *Pythia* models were evaluated using the [LM Evaluation
184
+ Harness](https://github.com/EleutherAI/lm-evaluation-harness). You can access
185
+ the results by model and step at `results/json/*` in the [GitHub
186
+ repository](https://github.com/EleutherAI/pythia/tree/main/results/json).
187
+
188
+ February 2023 note: select evaluations and comparison with OPT and BLOOM
189
+ models will be added here at a later date.
190
+
191
+ ### Naming convention and parameter count
192
+
193
+ Pythia models were re-named in January 2023. It is possible that the old
194
+ naming convention still persists in some documentation by accident. The
195
+ current naming convention (70M, 160M, etc.) is based on total parameter count.
196
+
197
+ <figure style="width:32em">
198
+
199
+ | current Pythia suffix | old suffix | total params | non-embedding params |
200
+ | --------------------: | ---------: | -------------: | -------------------: |
201
+ | 70M | 19M | 70,426,624 | 18,915,328 |
202
+ | 160M | 125M | 162,322,944 | 85,056,000 |
203
+ | 410M | 350M | 405,334,016 | 302,311,424 |
204
+ | 1B | 800M | 1,011,781,632 | 805,736,448 |
205
+ | 1.4B | 1.3B | 1,414,647,808 | 1,208,602,624 |
206
+ | 2.8B | 2.7B | 2,775,208,960 | 2,517,652,480 |
207
+ | 6.9B | 6.7B | 6,857,302,016 | 6,444,163,072 |
208
+ | 12B | 13B | 11,846,072,320 | 11,327,027,200 |
209
+ </figure>