apple
/

DCLM-7B

vaishaal commited on Jul 16

Commit

cc8c8a3

•

1 Parent(s): 1771f49

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -29,7 +29,8 @@ DCLM-Baseline-7B is a 7 billion parameter language model trained on the DCLM-Bas
 ### Model Sources
-- **Repository:** https://github.com/datacomp-team/dclm
 - **Paper:** [DataComp-LM: In search of the next generation of training sets for language models](https://arxiv.org/abs/2406.11794)
 ## Uses
@@ -54,17 +55,18 @@ print(tokenizer.decode(outputs[0]))
 The model was trained using the following setup:
-- **Architecture:** Decoder-only Transformer
 - **Framework:** PyTorch with OpenLM
 - **Optimizer:** AdamW
 - **Learning Rate:** 2e-3 (peak)
 - **Weight Decay:** 0.05
 - **Batch Size:** 2048 sequences
 - **Sequence Length:** 2048 tokens
-- **Total Training Tokens:** 2.6T
 - **Hardware:** Trained on H100 GPUs
 For more detailed training information, please refer to Section 3.4 and Appendix F of the DCLM paper.
 ## Evaluation

 ### Model Sources
+- **Repository:** https://github.com/mlfoundations/dclm
+- **Dataset:** https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0
 - **Paper:** [DataComp-LM: In search of the next generation of training sets for language models](https://arxiv.org/abs/2406.11794)
 ## Uses
 The model was trained using the following setup:
+- **Architecture:** Decoder-only Transformer
 - **Framework:** PyTorch with OpenLM
 - **Optimizer:** AdamW
 - **Learning Rate:** 2e-3 (peak)
 - **Weight Decay:** 0.05
 - **Batch Size:** 2048 sequences
 - **Sequence Length:** 2048 tokens
+- **Total Training Tokens:** 2.5T
 - **Hardware:** Trained on H100 GPUs
 For more detailed training information, please refer to Section 3.4 and Appendix F of the DCLM paper.
+To ensure our trained model is broadly useful, including for math and coding tasks, we combine our 3.8T [DCLM-BASELINE](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0)  with the [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata)  and [ProofPile2](https://huggingface.co/datasets/EleutherAI/proof-pile-2) data to arrive at a 4.1T token dataset.
 ## Evaluation