0xnu commited on
Commit
9cb476d
1 Parent(s): 835a894

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -2
README.md CHANGED
@@ -4,6 +4,37 @@ datasets:
4
  - dmitva/human_ai_generated_text
5
  ---
6
 
7
- ## 0xnu/AGTD-v0.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
- The "0xnu/AGTD-v0.1" model represents a significant breakthrough in distinguishing between human-generated and AI-generated text. It is rooted in sophisticated algorithms and offers exceptional accuracy and efficiency in text analysis and classification. Everything is detailed in the study and accessible [here](https://arxiv.org/abs/2311.15565).
 
4
  - dmitva/human_ai_generated_text
5
  ---
6
 
7
+ # 0xnu/AGTD-v0.1
8
+
9
+ The "0xnu/AGTD-v0.1" model represents a significant breakthrough in distinguishing between human-generated and AI-generated text. It is rooted in sophisticated algorithms and offers exceptional accuracy and efficiency in text analysis and classification. Everything is detailed in the study and accessible [here](https://arxiv.org/abs/2311.15565).
10
+
11
+ ## Instruction Format
12
+
13
+ ```
14
+ <BOS> [CLS] [INST] Instruction [/INST] Model answer [SEP] [INST] Follow-up instruction [/INST] [SEP] [EOS]
15
+ ```
16
+
17
+ Pseudo-code for tokenizing instructions with the new format:
18
+
19
+ ```Python
20
+ def tokenize(text):
21
+ return tok.encode(text, add_special_tokens=False)
22
+
23
+ [BOS_ID] +
24
+ tokenize("[CLS]") + tokenize("[INST]") + tokenize(USER_MESSAGE_1) + tokenize("[/INST]") +
25
+ tokenize(BOT_MESSAGE_1) + tokenize("[SEP]") +
26
+
27
+ tokenize("[INST]") + tokenize(USER_MESSAGE_N) + tokenize("[/INST]") +
28
+ tokenize(BOT_MESSAGE_N) + tokenize("[SEP]") + [EOS_ID]
29
+ ```
30
+
31
+ Notes:
32
+
33
+ - `[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`, and `[MASK]` tokens are integrated based on their definitions in the tokenizer configuration.
34
+ - `[INST]` and `[/INST]` are utilized to encapsulate instructions.
35
+ - The tokenize method should not automatically add BOS or EOS tokens but should add a prefix space.
36
+ - The `do_lower_case` parameter indicates that text should be in lowercase for consistent tokenization.
37
+ - `clean_up_tokenization_spaces` remove unnecessary spaces in the tokenization process.
38
+ - The `tokenize_chinese_chars` parameter indicates special handling for Chinese characters.
39
+ - The maximum model length is set to 512 tokens.
40