upiter
/

TinyCodeLM-400M-LintSeqInstruct

Text Generation

Inference Endpoints

Model card Files Files and versions Community

TinyCodeLM-400M-LintSeqInstruct / README.md

upiter's picture

Update README.md

4085710 verified 3 days ago

|

No virus

3.57 kB

	---
	license: apache-2.0
	datasets:
	- bigcode/the-stack
	- HuggingFaceFW/fineweb
	base_model:
	- upiter/TinyCodeLM-400M
	library_name: transformers
	---



	# Model Details

	The TinyCodeLM family of tiny language models (LMs) is a collection of fully open-source pretrained and instruction tuned generative code models in 150M and 400M sizes. These models are pretrained on a mixture of open-source web text and Python code. The instruction tuned TinyCodeLM models are optimized for Python code synthesis, and are trained on [synthetic edit sequence data generated with the LintSeq algorithm](https://lintseq.github.io/).

	Despite being trained on only 72 billion tokens of text, the models outperform many of the available open source Python code synthesis models on HumanEval and MBPP. The TinyCodeLM-LintSeqInstruct models are state-of-the-art on Python synthesis for their size.

	Model Developers Ulyana Piterbarg, Lerrel Pinto, Rob Fergus (NYU)

	Variations TinyCodeLM comes in two sizes (150M and 400M parameters) in pretrained and edit sequence instruction tuned variants.

	Input Text only.

	Output Models generate text and code. Instruction tuned models generate code via sequences of "diffs".

	Model Architecture TinyCodeLMs are autoregressive language models with architectures that mimic the two smallest versions of GPT-2 (Radford et al., 2019), while integrating the transformer architecture changes of the OLMo models.

	Instruction Tuning Data TinyCodeLMs are instruction tuned on paired instruction and Python edit sequence data. These edit sequences are generated with the LintSeq algorithm over a source dataset of paired instruction and Python programs drawn from the Magicoder and StarCoder2 OSS-Instruct datasets (Wei et al., 2024).

	# Training Details
	TinyCodeLM models were pretrained from scratch on a single H100 node (four GPUs) for two epochs. Pretraining took about two days and six days, respectively. Instruction tuning was conducted on a single H100 GPU using DeepSpeed and took no more than several hours.

	# Benchmarks

	Pretrained (Temperature 0)
	\|Benchmark\|TinyCodeLM 150M \|TinyCodeLM 400M \|
	\| :--------------------- \| -----------------: \| -----------------: \|
	\| HumanEval, pass@1 \| 6.1 \| 6.7 \|
	\| MBPP(+), pass@1 \| 5.4 \| 6.8 \|


	Edit Sequence / Instruction Tuned (Temperature-Tuned)
	\|Benchmark \|TinyCodeLM 150M \|TinyCodeLM 400M \|
	\| :----------- \| -----------------: \| -----------------: \|
	\| HumanEval, pass@1 \| 12.8 \| 13.4 \|
	\| HumanEval, pass@10 \| 20.6 \| 20.9 \|
	\| MBPP(+), pass@1 \| 13.6 \| 19.4 \|
	\| MBPP(+), pass@10 \| 24.4 \| 29.9 \|


	# Citation

	```
	@misc{piterbarg2024editseq,
	title={Training Language Models on Synthetic Edit Sequences Improves Code Synthesis},
	author={Ulyana Piterbarg and Lerrel Pinto and Rob Fergus},
	year={2024},
	eprint={2410.02749},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
	}
	```

	# Safety
	This work explores data-driven mechanisms for improving the quality of language model-generated code. Our synthetic data generation method relies on open-source data and our experiments leverage open-source software and resources. It is important to acknowledge that all language models for code synthesis have the potential to be misused – whether intentionally or unintentionally – for generation of code with vulnerabilities and/or malicious behaviors. Any and all model generated code has the potential to be harmful and must not be executed without precautions.