Update vocab and model size

#1
by terru3 - opened
Files changed (1) hide show
  1. app.py +2 -2
app.py CHANGED
@@ -17,8 +17,8 @@ def main():
17
 
18
  st.markdown("""We used the dataset from the [TinyStories Research Paper](https://arxiv.org/pdf/2305.07759.pdf) (Ronen Eldan and Yuanzhi Li, Microsoft),
19
  which consists of 2.1 million synthetic short children's stories generated by GPT-4, to train a Transformer LLM that we built from scratch in PyTorch.""")
20
- st.markdown("""Our final model uses EleutherAI's [gpt-neo-1.3B tokenizer](https://huggingface.co/EleutherAI/gpt-neo-1.3B) (vocab size 50,256) and consists of 8 transformer blocks,
21
- 16 attention heads, and an embedding dimension of 768, for a total of 133M parameters. The model was trained on 8 H100 GPUs for ~7 hours, and has a cross-entropy validation loss of 1.16,
22
  which is superior to any model in the TinyStories paper (likely due to a larger vocab size and far more compute).""")
23
  st.markdown("""Despite the simple themes and limited vocabulary present in the training data, the model is
24
  quite effective at generating new short stories. **Try it out below!**""")
 
17
 
18
  st.markdown("""We used the dataset from the [TinyStories Research Paper](https://arxiv.org/pdf/2305.07759.pdf) (Ronen Eldan and Yuanzhi Li, Microsoft),
19
  which consists of 2.1 million synthetic short children's stories generated by GPT-4, to train a Transformer LLM that we built from scratch in PyTorch.""")
20
+ st.markdown("""Our final model uses EleutherAI's [gpt-neo-1.3B tokenizer](https://huggingface.co/EleutherAI/gpt-neo-1.3B) (vocab size 50,257) and consists of 8 transformer blocks,
21
+ 16 attention heads, and an embedding dimension of 768, for a total of ~56M non-embedding parameters. The model was trained on 8 H100 GPUs for ~7 hours, achieving a cross-entropy validation loss of 1.16,
22
  which is superior to any model in the TinyStories paper (likely due to a larger vocab size and far more compute).""")
23
  st.markdown("""Despite the simple themes and limited vocabulary present in the training data, the model is
24
  quite effective at generating new short stories. **Try it out below!**""")