@nyuuzyou on Hugging Face: "🎓 Introducing Kompy.info Uzbek Educational Dataset -…"

Post

1252

🎓 Introducing Kompy.info Uzbek Educational Dataset - nyuuzyou/kompy

Dataset highlights:
- 584,648 pages of educational content extracted from kompy.info, a comprehensive educational resource website
- Content exclusively in Uzbek language, focusing on technical and scientific topics
- Each entry contains: URL, page title, and extracted main text content
- Data extracted using trafilatura HTML extraction tool
- Covers a wide range of academic and educational materials
- Released to the public domain under Creative Commons Zero (CC0) license

The dataset presents a valuable resource for natural language processing tasks in the Uzbek language, particularly in educational and technical domains. It can be used for text classification, topic modeling, and content analysis of educational materials. The large-scale collection of Uzbek-language academic content makes it especially useful for developing educational technology applications and studying pedagogical approaches in Uzbek-language instruction. The dataset's monolingual nature provides a focused corpus for understanding technical and scientific terminology in Uzbek educational contexts.

Join the conversation