nilq commited on
Commit
e6ff7a1
1 Parent(s): 5961527

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -0
README.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - babylm
7
+ - tokenizer
8
+ datasets:
9
+ - nilq/babylm-100M
10
+ ---
11
+
12
+ ## Baby Tokenizer (Uncased)
13
+
14
+ Compact sentencepiece tokenizer for sample-efficient English language modeling, simply tokenizing natural language.
15
+
16
+ ### Usage
17
+
18
+ #### Transformers
19
+
20
+ ```py
21
+ from transformers import AutoTokenizer
22
+
23
+ tokenizer_baby = AutoTokenizer.from_pretrained("nilq/baby-tokenizer")
24
+ ```
25
+
26
+ #### Tokenizers
27
+
28
+ ```py
29
+ from tokenizers import Tokenizer
30
+
31
+ tokenizer_baby = Tokenizer.from_pretrained("nilq/baby-tokenizer")
32
+ ```
33
+
34
+ ### Data
35
+
36
+ This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources:
37
+ - CHILDES (child-directed speech)
38
+ - Subtitles (speech)
39
+ - BNC (speech)
40
+ - TED talks (speech)
41
+ - children's books (simple written language).
42
+
43
+ ### Specifications
44
+
45
+ - Vocabulary size: 20k
46
+ - Alphabet limit: 150
47
+ - Minimum token frequency: 100