Seunggu35 commited on
Commit
964c86a
1 Parent(s): 3d61b83

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +27 -0
README.md CHANGED
@@ -1,3 +1,30 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ Upstage `solar-docvision-preview` tokenizer
6
+ - Vocab size: 32,064
7
+ - Langauge support: only English
8
+
9
+ Please use this tokenizer for tokenizing inputs for the Upstage `solar-docvision-preview` model.
10
+
11
+ You can load it with the tokenizer library like this:
12
+
13
+ ```python
14
+ from tokenizers import Tokenizer
15
+
16
+ tokenizer = Tokenizer.from_pretrained("upstage/solar-docvision-preview-tokenizer")
17
+
18
+ text = "Hi, how are you?"
19
+ enc = tokenizer.encode(text)
20
+ print("Encoded input:")
21
+ print(enc)
22
+
23
+ inv_vocab = {v: k for k, v in tokenizer.get_vocab().items()}
24
+ tokens = [inv_vocab[token_id] for token_id in enc.ids]
25
+ print("Tokens:")
26
+ print(tokens)
27
+
28
+ number_of_tokens = len(enc.ids)
29
+ print("Number of tokens:", number_of_tokens)
30
+ ```