upiter
/

TinyCodeLM-400M-LintSeqInstruct

Text Generation

Inference Endpoints

Model card Files Files and versions Community

upiter commited on 6 days ago

Commit

351df30

•

1 Parent(s): 25989e8

Update README.md

Files changed (1) hide show

README.md +53 -1

README.md CHANGED Viewed

@@ -3,4 +3,56 @@ license: apache-2.0
 datasets:
 - bigcode/the-stack
 - HuggingFaceFW/fineweb
----

 datasets:
 - bigcode/the-stack
 - HuggingFaceFW/fineweb
+---
+# Model Details
+The TinyCodeLM family of tiny language models (LMs) is a collection of pretrained and instruction tuned generative code models in 150M and 400M sizes. These models are pretrained on a mixture of open-source web text and Python code. The instruction tuned TinyCodeLM models are optimized for Python code synthesis, and are trained on [synthetic edit sequence data generated with the LintSeq algorithm](https://arxiv.org/abs/2410.02749).
+Despite being trained on only 72 billion tokens of text, the models outperform many of the available open source Python code synthesis models on HumanEval and MBPP. The TinyCodeLM-LintSeqInstruct models are state-of-the-art on Python synthesis for their size.
+**Model Developers** Ulyana Piterbarg, Lerrel Pinto, Rob Fergus (NYU)
+**Variations** TinyCodeLM comes in two sizes (150M and 400M parameters) in pretrained and edit sequence instruction tuned variants.
+**Input** Text only.
+**Output** Models generate text and code. Instruction tuned models generate code via sequences of "diffs".
+**Model Architecture** TinyCodeLMs are autoregressive language models with architectures that mimic the two smallest versions of GPT-2 (Radford et al., 2019), while integrating the transformer architecture changes of the OLMo models.
+**Instruction Tuning Data** TinyCodeLMs are instruction tuned on paired instruction and Python edit sequence data. These edit sequences are generated with the LintSeq algorithm over a source dataset of paired instruction and Python programs drawn from the Magicoder and StarCoder2 OSS-Instruct datasets (Wei et al., 2024).
+# Benchmarks
+**Pretrained (Temperature 0)**
+|**Benchmark**|**TinyCodeLM 150M** |**TinyCodeLM 400M** |
+| :--------------------- | -----------------: | -----------------: |
+|  HumanEval, pass@1 |    6.1   |      6.7  |
+|  MBPP(+), pass@1 |        5.4   |         6.8  |
+**Edit Sequence / Instruction Tuned (Temperature-Tuned)**
+|**Benchmark** |**TinyCodeLM 150M** |**TinyCodeLM 400M** |
+| :----------- | -----------------: | -----------------: |
+|  HumanEval, pass@1   | 12.8   | 13.4  |
+|  HumanEval, pass@10 |    20.6   |  20.9  |
+|  MBPP(+), pass@1       | 13.6   |   24.4  |
+|  MBPP(+), pass@10     |  24.4   |    29.9  |
+# Citation
+```
+@misc{piterbarg2024training,
+      title={Training Language Models on Synthetic Edit Sequences Improves Code Synthesis},
+      author={Ulyana Piterbarg and Lerrel Pinto and Rob Fergus},
+      year={2024},
+      eprint={2410.02749},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
+```