Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,138 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- synturk/turkish-sentence-elements
|
5 |
+
language:
|
6 |
+
- tr
|
7 |
+
metrics:
|
8 |
+
- accuracy
|
9 |
+
- f1
|
10 |
+
- precision
|
11 |
+
- recall
|
12 |
+
library_name: transformers
|
13 |
+
pipeline_tag: token-classification
|
14 |
+
tags:
|
15 |
+
- syntürk
|
16 |
+
- sentagram
|
17 |
+
- sentence
|
18 |
+
- elements
|
19 |
+
- mistake
|
20 |
+
- cümle
|
21 |
+
- cümlenin ögeleri
|
22 |
+
- hata
|
23 |
+
---
|
24 |
+
|
25 |
+
# SENTAGRAM Model
|
26 |
+
|
27 |
+
## Model Summary
|
28 |
+
|
29 |
+
The SENTAGRAM model is a BERT-based model fine-tuned on a custom Turkish grammar dataset. It is designed to analyze and classify grammatical elements within Turkish sentences, such as subjects, predicates, objects, and adjuncts. The model is built on the BERTürk architecture, specifically adapted to understand and process the intricacies of Turkish grammar.
|
30 |
+
For more information visit [ GitHub repository](https://github.com/Syntax-Turkiye/sentagram) of project.
|
31 |
+
|
32 |
+
## Model Description
|
33 |
+
|
34 |
+
- **Architecture**: BERT (BERTürk)
|
35 |
+
- **Language**: Turkish (tr)
|
36 |
+
- **Task**: Token classification, focusing on part-of-speech tagging and grammatical role identification.
|
37 |
+
- **Training Dataset**: The model was fine-tuned using the [turkish-sentence-elements](https://huggingface.co/synturk/turkish-sentence-elements) dataset, which contains annotated sentences from a variety of Turkish sources.
|
38 |
+
|
39 |
+
## Intended Use
|
40 |
+
|
41 |
+
### Applications
|
42 |
+
|
43 |
+
- **Educational Tools**: Can be used to develop applications that help learners of Turkish understand and correct their grammar.
|
44 |
+
- **NLP Research**: Useful for research in Turkish natural language processing, especially in areas related to syntax and grammar.
|
45 |
+
- **Grammatical Analysis**: Can be integrated into text editors, language learning platforms, or automated proofreading tools to provide grammar suggestions and corrections.
|
46 |
+
|
47 |
+
### Limitations
|
48 |
+
|
49 |
+
- **Complex Sentences**: While the model performs well on standard sentences, its performance may degrade on more complex or ambiguous sentence structures.
|
50 |
+
- **Contextual Understanding**: The model's ability to understand context is limited to the token classification task, and it might not perform as well in tasks requiring deep semantic understanding.
|
51 |
+
|
52 |
+
## Performance
|
53 |
+
|
54 |
+
The model was evaluated on the SYNTÜRK SENTAGRAM dataset with the following results:
|
55 |
+
|
56 |
+
| Precision | Recall | F1 Score | Accuracy |
|
57 |
+
|:---------:|:------:|:-------:|:--------:|
|
58 |
+
| 0.911349 | 0.911826 | 0.911588 | 0.935395 |
|
59 |
+
|
60 |
+
These metrics demonstrate the model's effectiveness in correctly identifying and classifying grammatical elements in Turkish sentences.
|
61 |
+
|
62 |
+
## How to Use
|
63 |
+
|
64 |
+
You can load and use the model with Hugging Face's `transformers` library:
|
65 |
+
|
66 |
+
```python
|
67 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
68 |
+
|
69 |
+
# Load the tokenizer and model
|
70 |
+
tokenizer = AutoTokenizer.from_pretrained("synturk/sentagram-berturk")
|
71 |
+
model = AutoModelForTokenClassification.from_pretrained("synturk/sentagram-berturk")
|
72 |
+
|
73 |
+
# Example sentence
|
74 |
+
sentence = "SYNTÜRK yarışmayı kazandı."
|
75 |
+
|
76 |
+
# Tokenize and predict
|
77 |
+
inputs = tokenizer(sentence, return_tensors="pt")
|
78 |
+
outputs = model(**inputs)
|
79 |
+
predictions = torch.argmax(outputs.logits, dim=2)
|
80 |
+
|
81 |
+
# Decode the predictions
|
82 |
+
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
|
83 |
+
predicted_labels = [model.config.id2label[p.item()] for p in predictions[0]]
|
84 |
+
|
85 |
+
print(list(zip(tokens, predicted_labels)))
|
86 |
+
```
|
87 |
+
|
88 |
+
## Training Details
|
89 |
+
|
90 |
+
- Base Model: BERTürk
|
91 |
+
- Fine-tuning Data: [SYNTÜRK SENTAGRAM Dataset](https://huggingface.co/synturk/turkish-sentence-dataset)
|
92 |
+
- Optimizer: Optuna
|
93 |
+
- Hugging Face Trainer API: Used for training and evaluation.
|
94 |
+
|
95 |
+
## Limitations and Future Work
|
96 |
+
|
97 |
+
### Known Limitations
|
98 |
+
|
99 |
+
- Out-of-Distribution Data: The model's performance may not be reliable on sentences or text types significantly different from the training data.
|
100 |
+
- Ambiguity: The model might struggle with ambiguous grammatical structures, where multiple interpretations are possible.
|
101 |
+
|
102 |
+
### Future Improvements
|
103 |
+
|
104 |
+
We plan to enhance the model by integrating additional grammatical features, such as semantic roles and more complex sentence structures. This will further improve its ability to process and understand the nuances of the Turkish language.
|
105 |
+
|
106 |
+
## Ethical Considerations
|
107 |
+
|
108 |
+
- Bias: The model was trained on a dataset that reflects specific sources and styles of Turkish. It may not generalize well to all varieties of the language.
|
109 |
+
- Fairness: Care was taken to ensure that the dataset is balanced in terms of sentence structures and grammatical elements, but there may still be biases present.
|
110 |
+
|
111 |
+
## License
|
112 |
+
|
113 |
+
This model is licensed under the Apache 2.0 License.
|
114 |
+
|
115 |
+
## Citation
|
116 |
+
|
117 |
+
If you use this model in your research or applications, please cite it as follows:
|
118 |
+
|
119 |
+
```ruby
|
120 |
+
@model{synturk-sentagram,
|
121 |
+
author = {SYNTÜRK Team},
|
122 |
+
title = {SENTAGRAM Model},
|
123 |
+
year = {2024},
|
124 |
+
publisher = {Hugging Face},
|
125 |
+
url = {https://huggingface.co/synturk/sentagram},
|
126 |
+
}
|
127 |
+
```
|
128 |
+
|
129 |
+
## Contact
|
130 |
+
|
131 |
+
For more information or questions, please contact the **SYNTÜRK Team** through [our GitHub repository](https://github.com/Syntax-Turkiye).
|
132 |
+
|
133 |
+
Follow **SYNTÜRK Team** on,
|
134 |
+
|
135 |
+
- [GitHub](https://github.com/Syntax-Turkiye)
|
136 |
+
- [HuggingFace](https://huggingface.co/synturk)
|
137 |
+
- [Kaggle](https://www.kaggle.com/syntax-turkiye)
|
138 |
+
- [LinkedIn](https://www.linkedin.com/company/syntax-turkiye)
|