Salama1429 (Mohamed Salama)

posted an update 4 months ago

Post

2371

📺 Introducing the YouTube-Commons Dataset 📺

🌐 Overview: The YouTube Commons Dataset is a comprehensive collection of 30 billion words from 15,112,121 original and automatically translated transcripts, drawn from 2,063,066 videos on YouTube.

🔗 License: All videos are shared under the CC-BY license, with the majority (71%) in English.

🤖 Applications: This dataset is ideal for training powerful AI models for converting speech to text (ASR) and translation models.

📊 Utilization: The text can be used for model training and is republishable for reproducibility purposes.

🤝 Collaboration: This dataset is the result of a collaboration between state start-up LANGU:IA, the French Ministry of Culture, and DINUM. It will be expanded in the coming months.

🔗 Explore the dataset here: https://lnkd.in/d_paWKFE

#YouTubeCommons #AIResearch #MachineLearning #OpenData #ArtificialIntelligence #NLP #Dataset #TechCollaboration #Innovation #DigitalTransformation

replied to their post 4 months ago

https://huggingface.co/collections/CohereForAI/c4ai-aya-23-664f4cda3fa1a30553b221dc

posted an update 4 months ago

Post

1238

Cohere's Aya 8B & 35B 🔥
> Multilingual (23 languages), beats Mistral 7B and Llama3 8B in preference—open weights.

capabilities:

🌍 **Multilingual Mastery**: Supporting 23 languages, including Arabic!

🏆 **Top Performer**: Outperforms Mistral 7B and Llama3 8B in user preference.

🔍 **Open Weights**: Access open weights for your research and projects.

🔗 **License**: CC-BY-NC with adherence to C4AI's Acceptable Use Policy.

💼 **Developed by**: Cohere For AI and Cohere.

Check out Aya 23 on Hugging Face , link is in comments

#AI #MachineLearning #NLP #Multilingual #Arabic #TechInnovation #OpenSource #CohereAI #AyaModel

2 replies

·

posted an update 4 months ago

Post

1421

Loving the new ChatGPT Mac app.

You can now turn a drawing into a working app in less than a minute!

posted an update 4 months ago

Post

1668

Free Guide: How to Fine-Tune and Prompt Engineer LLMs

While some of the most forward-thinking companies in the world are already using LLMs, few organizations have the bandwidth, compute, or money to train foundational models in-house. It’s become much more common to either fine-tune or prompt engineer existing LLMs for unique business needs. In this guide, you’ll learn:

• How to choose between fine-tuning and prompting
• Popular fine-tuning strategies and their trade-offs
• Tasks where fine-tuning excels vs. ones where it doesn’t
• Tips and current best practices for prompt engineering
• And a whole lot more!

Link: https://wandb.ai/site/resources/whitepapers/llm-fine-tuning

posted an update 4 months ago

Post

1361

📚 Introducing the 101 Billion Arabic Words Dataset

🌐 Exciting Milestone in Arabic Language Technology! hashtag#NLP hashtag#ArabicLLM hashtag#LanguageModels

🚀 Why It Matters:
1. 🌟 Large Language Models (LLMs) have brought transformative changes, primarily in English. It's time for Arabic to shine!
2. 🎯 This project addresses the critical challenge of bias in Arabic LLMs due to reliance on translated datasets.

🔍 Approach:
1. 💪 Undertook a massive data mining initiative focusing exclusively on Arabic from Common Crawl WET files.
2. 🧹 Employed state-of-the-art cleaning and deduplication processes to maintain data quality and uniqueness.

📈 Impact:
1. 🏆 Created the largest Arabic dataset to date with 101 billion words.
2. 📝 Enables the development of Arabic LLMs that are linguistically and culturally accurate.
3. 🌍 Sets a global benchmark for future Arabic language research.

🔗 Paper: https://lnkd.in/dGAiaygn
🔗 Dataset: https://lnkd.in/dGTMe5QV

- 🔄 Share your thoughts and let's drive the future of Arabic NLP together!

hashtag#DataScience hashtag#MachineLearning hashtag#ArtificialIntelligence hashtag#Innovation hashtag#ArabicData

Mohamed Salama

AI & ML interests

Organizations

Salama1429's activity