jacobfulano commited on
Commit
ce11e47
1 Parent(s): fdbb682

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -145,7 +145,7 @@ The learning rate schedule begins with a warmup to a maximum learning rate of 5.
145
  Warmup lasted for 6% of the full training duration. Global batch size was set to 4096, and microbatch size was 128; since global batch size was 4096, full pretraining consisted of 70,000 batches.
146
  We set the maximum sequence length during pretraining to 128, and we used the standard embedding dimension of 768.
147
  For MosaicBERT, we applied 0.1 dropout to the feedforward layers but no dropout to the FlashAttention module, as this was not possible with the OpenAI triton implementation.
148
- Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/bert/yamls/main).
149
 
150
 
151
  ## Evaluation results
 
145
  Warmup lasted for 6% of the full training duration. Global batch size was set to 4096, and microbatch size was 128; since global batch size was 4096, full pretraining consisted of 70,000 batches.
146
  We set the maximum sequence length during pretraining to 128, and we used the standard embedding dimension of 768.
147
  For MosaicBERT, we applied 0.1 dropout to the feedforward layers but no dropout to the FlashAttention module, as this was not possible with the OpenAI triton implementation.
148
+ Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/blob/main/examples/benchmarks/bert/yamls/main/mosaic-bert-base-uncased.yaml).
149
 
150
 
151
  ## Evaluation results