vaishaal commited on
Commit
cc8c8a3
1 Parent(s): 1771f49

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -29,7 +29,8 @@ DCLM-Baseline-7B is a 7 billion parameter language model trained on the DCLM-Bas
29
 
30
  ### Model Sources
31
 
32
- - **Repository:** https://github.com/datacomp-team/dclm
 
33
  - **Paper:** [DataComp-LM: In search of the next generation of training sets for language models](https://arxiv.org/abs/2406.11794)
34
 
35
  ## Uses
@@ -54,17 +55,18 @@ print(tokenizer.decode(outputs[0]))
54
 
55
  The model was trained using the following setup:
56
 
57
- - **Architecture:** Decoder-only Transformer
58
  - **Framework:** PyTorch with OpenLM
59
  - **Optimizer:** AdamW
60
  - **Learning Rate:** 2e-3 (peak)
61
  - **Weight Decay:** 0.05
62
  - **Batch Size:** 2048 sequences
63
  - **Sequence Length:** 2048 tokens
64
- - **Total Training Tokens:** 2.6T
65
  - **Hardware:** Trained on H100 GPUs
66
 
67
  For more detailed training information, please refer to Section 3.4 and Appendix F of the DCLM paper.
 
68
 
69
  ## Evaluation
70
 
 
29
 
30
  ### Model Sources
31
 
32
+ - **Repository:** https://github.com/mlfoundations/dclm
33
+ - **Dataset:** https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0
34
  - **Paper:** [DataComp-LM: In search of the next generation of training sets for language models](https://arxiv.org/abs/2406.11794)
35
 
36
  ## Uses
 
55
 
56
  The model was trained using the following setup:
57
 
58
+ - **Architecture:** Decoder-only Transformer
59
  - **Framework:** PyTorch with OpenLM
60
  - **Optimizer:** AdamW
61
  - **Learning Rate:** 2e-3 (peak)
62
  - **Weight Decay:** 0.05
63
  - **Batch Size:** 2048 sequences
64
  - **Sequence Length:** 2048 tokens
65
+ - **Total Training Tokens:** 2.5T
66
  - **Hardware:** Trained on H100 GPUs
67
 
68
  For more detailed training information, please refer to Section 3.4 and Appendix F of the DCLM paper.
69
+ To ensure our trained model is broadly useful, including for math and coding tasks, we combine our 3.8T [DCLM-BASELINE](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) with the [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) and [ProofPile2](https://huggingface.co/datasets/EleutherAI/proof-pile-2) data to arrive at a 4.1T token dataset.
70
 
71
  ## Evaluation
72