Text Generation
Transformers
PyTorch
olmo
Inference Endpoints
upiter commited on
Commit
351df30
1 Parent(s): 25989e8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -1
README.md CHANGED
@@ -3,4 +3,56 @@ license: apache-2.0
3
  datasets:
4
  - bigcode/the-stack
5
  - HuggingFaceFW/fineweb
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  datasets:
4
  - bigcode/the-stack
5
  - HuggingFaceFW/fineweb
6
+ ---
7
+
8
+
9
+ # Model Details
10
+
11
+ The TinyCodeLM family of tiny language models (LMs) is a collection of pretrained and instruction tuned generative code models in 150M and 400M sizes. These models are pretrained on a mixture of open-source web text and Python code. The instruction tuned TinyCodeLM models are optimized for Python code synthesis, and are trained on [synthetic edit sequence data generated with the LintSeq algorithm](https://arxiv.org/abs/2410.02749).
12
+
13
+ Despite being trained on only 72 billion tokens of text, the models outperform many of the available open source Python code synthesis models on HumanEval and MBPP. The TinyCodeLM-LintSeqInstruct models are state-of-the-art on Python synthesis for their size.
14
+
15
+ **Model Developers** Ulyana Piterbarg, Lerrel Pinto, Rob Fergus (NYU)
16
+
17
+ **Variations** TinyCodeLM comes in two sizes (150M and 400M parameters) in pretrained and edit sequence instruction tuned variants.
18
+
19
+ **Input** Text only.
20
+
21
+ **Output** Models generate text and code. Instruction tuned models generate code via sequences of "diffs".
22
+
23
+ **Model Architecture** TinyCodeLMs are autoregressive language models with architectures that mimic the two smallest versions of GPT-2 (Radford et al., 2019), while integrating the transformer architecture changes of the OLMo models.
24
+
25
+ **Instruction Tuning Data** TinyCodeLMs are instruction tuned on paired instruction and Python edit sequence data. These edit sequences are generated with the LintSeq algorithm over a source dataset of paired instruction and Python programs drawn from the Magicoder and StarCoder2 OSS-Instruct datasets (Wei et al., 2024).
26
+
27
+ # Benchmarks
28
+
29
+ **Pretrained (Temperature 0)**
30
+
31
+ |**Benchmark**|**TinyCodeLM 150M** |**TinyCodeLM 400M** |
32
+ | :--------------------- | -----------------: | -----------------: |
33
+ | HumanEval, pass@1 | 6.1 | 6.7 |
34
+ | MBPP(+), pass@1 | 5.4 | 6.8 |
35
+
36
+
37
+ **Edit Sequence / Instruction Tuned (Temperature-Tuned)**
38
+
39
+ |**Benchmark** |**TinyCodeLM 150M** |**TinyCodeLM 400M** |
40
+ | :----------- | -----------------: | -----------------: |
41
+ | HumanEval, pass@1 | 12.8 | 13.4 |
42
+ | HumanEval, pass@10 | 20.6 | 20.9 |
43
+ | MBPP(+), pass@1 | 13.6 | 24.4 |
44
+ | MBPP(+), pass@10 | 24.4 | 29.9 |
45
+
46
+
47
+ # Citation
48
+
49
+ ```
50
+ @misc{piterbarg2024training,
51
+ title={Training Language Models on Synthetic Edit Sequences Improves Code Synthesis},
52
+ author={Ulyana Piterbarg and Lerrel Pinto and Rob Fergus},
53
+ year={2024},
54
+ eprint={2410.02749},
55
+ archivePrefix={arXiv},
56
+ primaryClass={cs.LG}
57
+ }
58
+ ```