Text Generation
Transformers
PyTorch
olmo
Inference Endpoints
File size: 3,574 Bytes
25989e8
 
 
 
 
6e33bdb
 
4085710
351df30
 
bdf3cd0
22d9738
351df30
 
22d9738
351df30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e33bdb
 
 
351df30
 
 
 
 
 
 
 
 
 
 
 
 
 
bdf3cd0
351df30
 
 
 
 
 
6ce6271
351df30
 
 
 
 
 
 
 
6e33bdb
 
4085710
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
license: apache-2.0
datasets:
- bigcode/the-stack
- HuggingFaceFW/fineweb
base_model:
- upiter/TinyCodeLM-400M
library_name: transformers
---



# Model Details

The TinyCodeLM family of tiny language models (LMs) is a collection of fully open-source pretrained and instruction tuned generative code models in 150M and 400M sizes. These models are pretrained on a mixture of open-source web text and Python code. The instruction tuned TinyCodeLM models are optimized for Python code synthesis, and are trained on [synthetic edit sequence data generated with the LintSeq algorithm](https://lintseq.github.io/).

Despite being trained on only 72 billion tokens of text, the models outperform many of the available open source Python code synthesis models on HumanEval and MBPP. The TinyCodeLM-LintSeqInstruct models are state-of-the-art on Python synthesis for their size.

**Model Developers** Ulyana Piterbarg, Lerrel Pinto, Rob Fergus (NYU)

**Variations** TinyCodeLM comes in two sizes (150M and 400M parameters) in pretrained and edit sequence instruction tuned variants.

**Input** Text only.

**Output** Models generate text and code. Instruction tuned models generate code via sequences of "diffs".

**Model Architecture** TinyCodeLMs are autoregressive language models with architectures that mimic the two smallest versions of GPT-2 (Radford et al., 2019), while integrating the transformer architecture changes of the OLMo models. 

**Instruction Tuning Data** TinyCodeLMs are instruction tuned on paired instruction and Python edit sequence data. These edit sequences are generated with the LintSeq algorithm over a source dataset of paired instruction and Python programs drawn from the Magicoder and StarCoder2 OSS-Instruct datasets (Wei et al., 2024).

# Training Details
TinyCodeLM models were pretrained from scratch on a single H100 node (four GPUs) for two epochs. Pretraining took about two days and six days, respectively. Instruction tuning was conducted on a single H100 GPU using DeepSpeed and took no more than several hours.

# Benchmarks 

**Pretrained (Temperature 0)**
|**Benchmark**|**TinyCodeLM 150M** |**TinyCodeLM 400M** |
| :--------------------- | -----------------: | -----------------: |
|  HumanEval, pass@1 |    6.1   |      6.7  |
|  MBPP(+), pass@1 |        5.4   |         6.8  |


**Edit Sequence / Instruction Tuned (Temperature-Tuned)**
|**Benchmark** |**TinyCodeLM 150M** |**TinyCodeLM 400M** |
| :----------- | -----------------: | -----------------: |
|  HumanEval, pass@1   | 12.8   | 13.4  |
|  HumanEval, pass@10 |    20.6   |  20.9  |
|  MBPP(+), pass@1       | 13.6   |   19.4  |
|  MBPP(+), pass@10     |  24.4   |    29.9  |


# Citation

```
@misc{piterbarg2024editseq,
      title={Training Language Models on Synthetic Edit Sequences Improves Code Synthesis}, 
      author={Ulyana Piterbarg and Lerrel Pinto and Rob Fergus},
      year={2024},
      eprint={2410.02749},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
```

# Safety
This work explores data-driven mechanisms for improving the quality of language model-generated code. Our synthetic data generation method relies on open-source data and our experiments leverage open-source software and resources. It is important to acknowledge that all language models for code synthesis have the potential to be misused – whether intentionally or unintentionally – for generation of code with vulnerabilities and/or malicious behaviors. Any and all model generated code has the potential to be harmful and must not be executed without precautions.