File size: 2,297 Bytes
0dc08e6
9c4b3dc
 
eecf537
 
9c4b3dc
 
 
 
 
0dc08e6
 
 
 
9c4b3dc
0dc08e6
ab5cc48
 
0dc08e6
 
9c4b3dc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0dc08e6
 
 
9c4b3dc
0dc08e6
 
 
9c4b3dc
0dc08e6
9c4b3dc
 
 
 
 
 
 
0dc08e6
9c4b3dc
 
 
 
 
 
 
0dc08e6
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
datasets:
- karpathy/tiny_shakespeare
library_name: tf-keras
license: mit
metrics:
- accuracy
pipeline_tag: text-generation
tags:
- lstm
---

## Model description

LSTM trained on Andrej Karpathy's [`tiny_shakespeare`](https://huggingface.co/datasets/karpathy/tiny_shakespeare) dataset, from his blog post, [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/).

Made to experiment with Hugging Face and W&B.

## Intended uses & limitations

The model predicts the next character based on a variable-length input sequence. After `18` epochs of training, the model is generating text that is somewhat coherent.

```py
def generate_text(model, encoder, text, n):
    vocab = encoder.get_vocabulary()
    generated_text = text
    for _ in range(n):
        encoded = encoder([generated_text])
        pred = model.predict(encoded, verbose=0)
        pred = tf.squeeze(tf.argmax(pred, axis=-1)).numpy()
        generated_text += vocab[pred]
    return generated_text

sample = "M"
print(generate_text(model, encoder, sample, 100))
```

```
MQLUS:
I will be so that the street of the state,
And then the street of the street of the state,
And
```

## Training and evaluation data

[![https://example.com](https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg)](https://wandb.ai/adamelliotfields/shakespeare)

## Training procedure

The dataset consists of various works of William Shakespeare concatenated into a single file. The resulting file consists of individual speeches separated by `\n\n`.

The tokenizer is a Keras `TextVectorization` preprocessor that uses a simple character-based vocabulary.

To construct the training set, `100` characters are taken with the next character used as the target. This is repeated for each character in the text and results in **1,115,294** shuffled training examples.

*TODO: upload encoder*

### Training hyperparameters

| Hyperparameters   | Value     |
| :---------------- | :-------- |
| `epochs`          | `18`      |
| `batch_size`      | `1024`    |
| `optimizer`       | `AdamW`   |
| `weight_decay`    | `0.001`   |
| `learning_rate`   | `0.00025` |

 ## Model Plot

<details>
<summary>View Model Plot</summary>

![Model Image](./model.png)

</details>