nyuuzyou (nyuuzyou)

posted an update 3 days ago

Post

1252

🎓 Introducing Kompy.info Uzbek Educational Dataset - nyuuzyou/kompy

Dataset highlights:
- 584,648 pages of educational content extracted from kompy.info, a comprehensive educational resource website
- Content exclusively in Uzbek language, focusing on technical and scientific topics
- Each entry contains: URL, page title, and extracted main text content
- Data extracted using trafilatura HTML extraction tool
- Covers a wide range of academic and educational materials
- Released to the public domain under Creative Commons Zero (CC0) license

The dataset presents a valuable resource for natural language processing tasks in the Uzbek language, particularly in educational and technical domains. It can be used for text classification, topic modeling, and content analysis of educational materials. The large-scale collection of Uzbek-language academic content makes it especially useful for developing educational technology applications and studying pedagogical approaches in Uzbek-language instruction. The dataset's monolingual nature provides a focused corpus for understanding technical and scientific terminology in Uzbek educational contexts.

reacted to m-ric's post with 🔥 6 days ago

Post

2147

> Oasis: First Real-Time Video Game Without a Game Engine! 🎮

DecartAI & Etched just released Oasis - a fully AI-generated video game running at 20 FPS (frames per second). The model takes keyboard inputs and generates everything - physics, rules, graphics - on the fly, without any game engine.

⚡️ What makes this special? Current text-to-video models (Mochi-1, Sora, Kling) generate about 1 frame every 10-20 seconds (that's the kind of device I had to play LoL back in the day, thus my low rankings). Oasis is 200 times faster, making it the first playable AI-generated game.

⚙️ Under the hood, it uses a vision transformer to encode space and a diffusion model to generate frames. The secret sauce is "dynamic noising" - a technique that keeps the video stable between frames.

Key insights:
⚡️ Generates 20 FPS, vs 0.2 FPS for other DIT-based video models
‣ The specialized hardware Sohu developed by Etched allows to handle 10x more player than H100

🎮 Features real game mechanics
‣ Movement, jumping, item management
‣ Physics and lighting
‣ Procedurally generated worlds

⚠️ Current limitations
‣ Blurry graphics at a distance
‣ Objects sometimes change appearance
‣ Memory issues in long sessions

Try it yourself, the playable demo is impressive! 👉 https://oasis.decart.ai/welcome
Code 👉 https://github.com/etched-ai/open-oasis
Read it in full 👉 https://oasis-model.github.io/

reacted to Muhammadreza's post with ❤️ 6 days ago

Post

2506

Hey guys.
This is my first post here on huggingface. I'm glad to be a part of this amazing community!

2 replies

·

posted an update 9 days ago

Post

2701

🎓 Introducing PPT4Web Educational Materials Dataset - nyuuzyou/ppt4web

Dataset highlights:
- 182,405 presentations from ppt4web.ru, a platform for storing and viewing presentations covering a wide range of educational materials
- Primarily in Russian, with content in English, Kazakh, Ukrainian, and Belarusian
- Each entry includes: URL, title, download URL, and filepath
- Contains original PPTX files (converted from PPT for consistency) in addition to metadata
- Data covers a broad spectrum of educational topics and subjects
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing educational presentation content across various subjects in multiple languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in education, teaching methodologies, and presentation materials used across different academic disciplines. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in educational settings, providing insights into the diverse range of subjects and teaching approaches.

posted an update 19 days ago

Post

1386

🌐 Introducing Websim.ai User Projects Dataset - nyuuzyou/websim

Dataset highlights:
- 137,452 user projects from Websim.ai, a service for creating small sites using Large Language Models (LLMs)
- Primarily in English, with potential for multilingual content in generated websites
- Each entry includes: project metadata, user information, and generated HTML content
- Contains detailed information about project revisions, site generation, and user interactions
- Data covers a wide range of user-generated website projects created through AI assistance
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing AI-assisted web development trends, studying user behavior in LLM-powered creative tools, and exploring the capabilities of language models in web design.

posted an update 22 days ago

Post

420

🎓 Introducing Ukr-lit.com.ua Presentations Dataset - nyuuzyou/ukr-lit

Dataset highlights:
- 18,001 presentations from ukr-lit.com.ua, a platform for storing and viewing presentations covering a wide range of subjects in Ukrainian school education
- Primarily in Ukrainian, with some Russian and English content
- Each entry includes: URL, title, download URL, filepath, and extracted text content (where available)
- Contains original PPT/PPTX files in addition to metadata
- Data covers a broad spectrum of educational topics and subjects taught in Ukrainian schools
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing educational presentation content across various subjects in Ukrainian and other languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in Ukrainian school education, teaching methodologies, and presentation materials used across different academic disciplines. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in Ukrainian educational settings, providing insights into the diverse range of subjects and teaching approaches in the Ukrainian school system.

reacted to erinys's post with 🚀 22 days ago

Post

2099

🌍 Super cool visualization of global PUT requests to Hugging Face over 24 hours, coded by object size, thanks to @port8080 !

We're putting this analysis to work to help us architect a more geo-distributed system for the HF storage backend.

Originally shared on LinkedIn: https://www.linkedin.com/posts/ajitbanerjee_one-of-the-joys-of-working-on-the-xethub-activity-7252688424732614656-tFGD

reacted to davidberenstein1957's post with ➕ 22 days ago

Post

1660

You can now build a custom text classifier without days of human labeling!

👍 LLMs work reasonably well as text classifiers.
👎 They are expensive to run at scale and their performance drops in specialized domains.

👍 Purpose-built classifiers have low latency and can potentially run on CPU.
👎 They require labeled training data.

Combine the best of both worlds: the automatic labeling capabilities of LLMs and the high-quality annotations from human experts to train and deploy a specialized model.

Blog: https://huggingface.co/blog/sdiazlor/custom-text-classifier-ai-human-feedback

posted an update 23 days ago

Post

828

Today I found out about the existence of utter-project/EuroLLM-1.7B-Instruct and unexpectedly it is really good. I think it's a very underrated model - give it a try nyuuzyou/EuroLLM-1.7B-Instruct

replied to clem's post 24 days ago

So why isn't OpenAI this list? Are they not supporting open AI? ¯_(ツ)_/¯

posted an update 25 days ago

Post

1555

🎙 Introducing LiveATC Recordings (Partial 2024-08-26) Dataset - nyuuzyou/liveatc

Dataset highlights:

- 21,172 air traffic control audio recordings from LiveATC.net for August 26, 2024
- Multilingual content, primarily in English with potential for other languages
- Each entry includes: audio file, ICAO airport code, facility type, date, and time
- Contains original MP3 files stored in .tar.zst archives, organized by ICAO airport code
- Data covers various airports and ATC facilities worldwide
- Subject to LiveATC.net's Terms of Use for personal, non-commercial use only

The dataset can be used for audio classification, automatic speech recognition, and analysis of air traffic control communications. The inclusion of recordings from multiple airports allows for comparative analysis across different locations and facility types.

posted an update 26 days ago

Post

480

🎓 Introducing Svitppt.com.ua Presentations Dataset - nyuuzyou/svitppt

Dataset highlights:
- 18,001 presentations from svitppt.com.ua, a platform for storing and viewing presentations for Ukrainian school students
- Primarily in Ukrainian, with some Russian and English content
- Each entry includes: URL, title, download URL, filepath, and extracted text content (where available)
- Contains original PPT/PPTX files in addition to metadata
- Data covers a wide range of educational topics and presentation materials
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing educational presentation content in Ukrainian and other languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in educational presentation materials and sharing practices in the Ukrainian-speaking student community. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in Ukrainian educational settings.

reacted to takeraparterer's post with 🚀 26 days ago

Post

2193

Check this out: I trained an AI on huggingface posts! all of these are AI generated:
----------
Hello!

I'm excited to share that my colleague @felipeebert and I have released the largest Spanish LLM benchmark to date.

We've developed the Spanish LLM Evaluation Benchmark (SLAB), a set of benchmarks designed to evaluate the ability of language models to understand, generate and translate in Spanish.

SLAB includes five different benchmarks:
- Sentiment Analysis: evaluate models' ability to detect and describe sentiment in natural language
- Fact Checking: evaluate models' ability to detect and refute factual errors in text
- Question Answering: evaluate models' ability to answer questions in Spanish
- Open-ended Questions: evaluate models' ability to generate coherent responses in Spanish
- Translation: evaluate models' ability to translate in Spanish

SLAB is aligned with the latest Spanish LLM industry developments and includes the most recent models available on the market. We aim to keep our benchmarks up-to-date and relevant to the Spanish language ecosystem.

SLAB is available at: https://huggingface.co/datasets/argilla/SLAB.

If you would like to collaborate on building additional Spanish LLM benchmarks, let's discuss in the comments.

🔗 SLAB Blog Post: https://argilla.com/blog/slab
----------
Hello everyone,

I'm thrilled to announce the release of

https://huggingface.co/01-AI/01AI-GPT-4o -

A new family of models that brings the power of transformer AI to the masses.

This model is designed to be accessible and easy to use, while still offering high-quality results.

Key features:
- Small model size: only 23M parameters
- Supports text generation, image generation, and text-to-image tasks
- Data-efficient training with a lightweight tokenizer
- Optimized for efficient on-device usage
- Uses the powerful transformer architecture to deliver high-quality results

Excited to see what you all think!

https://huggingface.co/01-AI/01AI-GPT-4o

2 replies

·

reacted to huggingface0's post with 🤯 26 days ago

Post

3938

1+2=3

2 replies

·

posted an update 27 days ago

Post

622

🎓 Introducing Bigslide.ru Presentations Dataset - nyuuzyou/bigslide

Dataset highlights:
- 50,872 presentations from bigslide.ru, a platform for storing and viewing presentations for school students
- Primarily in Russian, with some English and potentially other languages
- Each entry includes: URL, title, download URL, filepath, and extracted text content (where available)
- Contains original PPT/PPTX files in addition to metadata
- Data covers a wide range of educational topics and presentation materials
- Dedicated to the public domain under Creative Commons Zero (CC0) license

The dataset can be used for analyzing educational presentation content in Russian and other languages, text classification tasks, and information retrieval systems. It's particularly valuable for examining trends in educational presentation materials and sharing practices in the Russian-speaking student community. The inclusion of original files allows for in-depth analysis of presentation formats and structures commonly used in educational settings.

posted an update 28 days ago

Post

334

🎓 Introducing Lusana.ru Presentations Dataset - nyuuzyou/lusana

Dataset highlights:
- 38,953 presentations from lusana.ru, a platform for storing presentations, reports, templates, and backgrounds
- Primarily in Russian, with some English and potentially other languages
- Each entry includes: ID, title, download URL, uniqueness score, number of slides, views, downloads, file size, file path, and extracted text content (where available)
- Contains original PPT/PPTX files in addition to metadata
- Data covers a wide range of topics and presentation materials
- Licensed under Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)

The dataset can be used for analyzing presentation content in Russian and other languages, text classification tasks, and information retrieval systems. It's also valuable for examining trends in educational and professional presentation materials and sharing practices in the Russian-speaking community. The inclusion of original files allows for in-depth analysis of presentation formats and structures.

posted an update 29 days ago

Post

1918

🎓 Introducing Doc4web.ru Documents Dataset - nyuuzyou/doc4web

Dataset highlights:
- 223,739 documents from doc4web.ru, a document hosting platform for students and teachers
- Primarily in Russian, with some English and potentially other languages
- Each entry includes: URL, title, download link, file path, and content (where available)
- Contains original document files in addition to metadata
- Data reflects a wide range of educational topics and materials
- Licensed under Creative Commons Zero (CC0) for unrestricted use

The dataset can be used for analyzing educational content in Russian, text classification tasks, and information retrieval systems. It's also valuable for examining trends in educational materials and document sharing practices in the Russian-speaking academic community. The inclusion of original files allows for in-depth analysis of various document formats and structures.

posted an update about 1 month ago

Post

1412

🌐 Subdomain Dataset Update: September 2024 Data Now Available

I have updated the nyuuzyou/subdomains dataset with fresh data for September 2024. This addition further expands this largest collection of subdomain statistics currently available, providing researchers and analysts with even more valuable insights into web infrastructure and domain patterns.

Latest Update Highlights:
- New File: subdomains_2024_09.csv
- Unique Subdomains: 19,191,867
- Total Occurrences: 170,792,927

posted an update about 2 months ago

Post

363

🎓Introducing ruschatgpt Q&A Dataset - nyuuzyou/ruschatgpt-qa

Dataset highlights:
- 190,281 question-answer pairs from ruschatgpt.ru, a Russian question-answering website
- Monolingual content in Russian
- Each entry includes: URL, question, and response
- Data reflects user-generated questions and language model-generated answers
- Licensed under Creative Commons Zero (CC0) for unrestricted use

The dataset can be used for analyzing trends in AI-powered question answering in Russia. It's also valuable for examining Russian language patterns and topic distributions in user queries and AI responses.

posted an update about 2 months ago

Post

1552

🎓Introducing чатгпт-в-россии.рф (meaning in English would be something like chatgpt-in-russia[.]rf) Q&A Dataset - nyuuzyou/chatgpt-in-russia-qa

Dataset highlights:
- 628,186 question-answer pairs from чатгпт-в-россии.рф, a Russian question-answering website
- Monolingual content in Russian
- Each entry includes: URL, question, and response
- Data reflects user-generated questions and language model-generated answers
- Licensed under Creative Commons Zero (CC0) for unrestricted use

The dataset can be used for the purpose of analyzing trends in the use of AI to answer questions in Russia. Additionally, it can be useful for examining language patterns and topic distributions.

nyuuzyou

AI & ML interests

Organizations

nyuuzyou's activity