Simingh's picture
Update README.md
b4a4e45 verified
metadata
license: mit
datasets:
  - OpenCoder-LLM/fineweb-code-corpus
  - OpenCoder-LLM/fineweb-math-corpus
  - OpenCoder-LLM/RefineCode-code-corpus-meta
  - OpenCoder-LLM/opc-annealing-corpus
language:
  - en
  - zh
OpenCoder-Icon

🏠 Home Page   |    πŸ€— Model   |    πŸ“Š Dataset   |    πŸ“„Paper  

1. Introduction

OpenCoder is an open and reproducible code LLM family which includes 1.5B and 8B base and chat models, supporting both English and Chinese languages. Starting from scratch, OpenCoder is pretrained on 2.5 trillion tokens composed of 90% raw code and 10% code-related web data, and supervised finetuned on over 4.5M high-quality SFT examples, finally reaching the performance of top-tier code LLMs.

This repository contains all the intermediate checkpoints of OpenCoder-1.5B-Base, saving in different branches. For the final results, please refer to πŸ€— OpenCoder-8B-Base

2. Branches Overview

  • pretrain_iter_0001000 - pretrain_iter_0300000: Intermediate checkpoints during the pretraining stage.

  • anneal_iter_0001000 - anneal_iter_0011920: Intermediate checkpoints during the annealing stage.

The number in each branch name indicates the corresponding current training step, where each step consumes 838,8608 training tokens (2,048 batch size * 4,096 sequence length from pretrain_iter_0001000 for pretrain_iter_0130000; 1,024 batch size * 8,192 sequence length for pretrain_iter_0001000 - pretrain_iter_0130000 and the whole annealing phase).

We use pretrain_iter_0300000 as the starting point for the annealing stage, and use anneal_iter_0010000 as the final base model.