OpenCoder Collection OpenCoder is an open and reproducible code LLM family which matches the performance of top-tier code LLMs. ā¢ 9 items ā¢ Updated 4 days ago ā¢ 70
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages Paper ā¢ 2410.23825 ā¢ Published 21 days ago ā¢ 3
LLM Reasoning Papers Collection Papers to improve reasoning capabilities of LLMs ā¢ 15 items ā¢ Updated 19 days ago ā¢ 76
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment Paper ā¢ 2410.05873 ā¢ Published Oct 8 ā¢ 3
MaskLID: Code-Switching Language Identification through Iterative Masking Paper ā¢ 2406.06263 ā¢ Published Jun 10 ā¢ 5
view article Article DuckDB: run SQL queries on 50,000+ datasets on the Hugging Face Hub Jun 7, 2023 ā¢ 4
CommonCatalog Collection Common Catalog, a dataset with Creative Commons licensed images and machine-generated caption pairs ā¢ 8 items ā¢ Updated May 16 ā¢ 14
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only Paper ā¢ 2306.01116 ā¢ Published Jun 1, 2023 ā¢ 31
LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons Paper ā¢ 2402.14086 ā¢ Published Feb 21 ā¢ 9
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model Paper ā¢ 2402.07827 ā¢ Published Feb 12 ā¢ 45
GIRT-Model: Automated Generation of Issue Report Templates Paper ā¢ 2402.02632 ā¢ Published Feb 4 ā¢ 1
GlotLID: Language Identification for Low-Resource Languages Paper ā¢ 2310.16248 ā¢ Published Oct 24, 2023 ā¢ 1
GlotScript: A Resource and Tool for Low Resource Writing System Identification Paper ā¢ 2309.13320 ā¢ Published Sep 23, 2023 ā¢ 1
Analytical Derivation and Comparison of Alarm Similarity Measures Paper ā¢ 2003.10600 ā¢ Published Mar 24, 2020 ā¢ 1
GIRT-Data: Sampling GitHub Issue Report Templates Paper ā¢ 2303.09236 ā¢ Published Mar 16, 2023 ā¢ 1