File size: 3,546 Bytes
4088a0c
81be7f3
835a894
 
81be7f3
 
99daaca
d41fe7a
81be7f3
 
 
9228127
4088a0c
ce5abbc
9cb476d
 
37f99dc
9cb476d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce5abbc
3a54614
 
 
 
 
 
8c084ea
 
 
0e90b49
8c084ea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5df930e
 
 
85e3d87
 
5df930e
a04446a
 
 
 
 
 
 
 
5df930e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
license: mit
datasets:
- dmitva/human_ai_generated_text
language:
- en
widget:
  - text: "This model trains on a diverse dataset and serves functions in applications requiring a mechanism for distinguishing between human and AI-generated text."
tags:
- nlp
- code
inference: false
---

# 0xnu/AGTD-v0.1

The **0xnu/AGTD-v0.1** model represents a significant breakthrough in distinguishing between text written by humans and one generated by Artificial Intelligence (AI). It is rooted in sophisticated algorithms and offers exceptional accuracy and efficiency in text analysis and classification. I detailed the findings in a study, and it is accessible [here](https://arxiv.org/abs/2311.15565).

## Instruction Format

```
<BOS> [CLS] [INST] Instruction [/INST] Model answer [SEP] [INST] Follow-up instruction [/INST] [SEP] [EOS]
```

Pseudo-code for tokenizing instructions with the new format:

```Python
def tokenize(text):
    return tok.encode(text, add_special_tokens=False)

[BOS_ID] + 
tokenize("[CLS]") + tokenize("[INST]") + tokenize(USER_MESSAGE_1) + tokenize("[/INST]") +
tokenize(BOT_MESSAGE_1) + tokenize("[SEP]") +

tokenize("[INST]") + tokenize(USER_MESSAGE_N) + tokenize("[/INST]") +
tokenize(BOT_MESSAGE_N) + tokenize("[SEP]") + [EOS_ID]
```

Notes:

- `[CLS]`, `[SEP]`, `[PAD]`, `[UNK]`, and `[MASK]` tokens are integrated based on their definitions in the tokenizer configuration.
- `[INST]` and `[/INST]` are utilized to encapsulate instructions.
- The tokenize method should not automatically add BOS or EOS tokens but should add a prefix space.
- The `do_lower_case` parameter indicates that text should be in lowercase for consistent tokenization.
- `clean_up_tokenization_spaces` remove unnecessary spaces in the tokenization process.
- The `tokenize_chinese_chars` parameter indicates special handling for Chinese characters.
- The maximum model length is set to 512 tokens.

## Installing Libraries

```sh
pip install transformers
```

## Run the model

```Python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "0xnu/AGTD-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

# Input text
text = "This model trains on a diverse dataset and serves functions in applications requiring a mechanism for distinguishing between human and AI-generated text."

# Preprocess the text
inputs = tokenizer(text, return_tensors='pt')

# Run the model
outputs = model(**inputs)

# Interpret the output
logits = outputs.logits

# Apply softmax to convert logits to probabilities
probabilities = torch.softmax(logits, dim=1)

# Assuming the first class is 'human' and the second class is 'ai'
human_prob, ai_prob = probabilities.detach().numpy()[0]

# Print probabilities
print(f"Human Probability: {human_prob:.4f}")
print(f"AI Probability: {ai_prob:.4f}")

# Determine if the text is human or AI-generated
if human_prob > ai_prob:
    print("The text is likely human-generated.")
else:
    print("The text is likely AI-generated.")
```

## Citation

Please cite the paper if you are using the resource for your work.

```bibtex
@misc{abiodunfinbarrsoketunji-agtd2023,
  doi = {10.48550/arXiv.2311.15565},
  url = {https://arxiv.org/abs/2311.15565},
  author = {Abiodun Finbarrs Oketunji},
  title = {Evaluating the Efficacy of Hybrid Deep Learning Models in Distinguishing AI-Generated Text},
  publisher = {arXiv},
  year = {2023},
  copyright = {arXiv.org perpetual, non-exclusive license}
}
```