55 24 191

Amir Hossein Kargaran

kargaranamir

https://kargaranamir.github.io

AI & ML interests

#NLP, checkout https://huggingface.co/cis-lmu

Recent Activity

New activity 1 day ago

kargaranamir/selenium-screenshot-gradio

upvoted a collection 12 days ago

OpenCoder

authored a paper 17 days ago

Organizations

Posts 2

Post

1133

Introducing GlotCC: a new 2TB corpus based on an early 2024 CommonCrawl snapshot with data for 1000+ languages.

🤗 corpus v1: cis-lmu/GlotCC-V1
🐱 pipeline v3: https://github.com/cisnlp/GlotCC

More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.

Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).

Post

A Text Language Identification Model with Support for +2000 Labels:

space: cis-lmu/glotlid-space
model: cis-lmu/glotlid
github: https://github.com/cisnlp/GlotLID
paper: GlotLID: Language Identification for Low-Resource Languages (2310.16248)

Collections 1

spaces 8

pinned

Running

📐

models 2

kargaranamir/T5R-base

Text2Text Generation • Updated Oct 24, 2023 • 9 • 1

Amir Hossein Kargaran

AI & ML interests

Recent Activity

Organizations

Posts 2

Collections 1

cis-lmu/GlotCC-V1

HuggingFaceFW/fineweb

allenai/c4

oscar-corpus/OSCAR-2301

spaces 8

Language identification comparison

Selenium Screenshot Gradio

QR CODE

Visual Clutter

ColorHarmonization

LangID-LIME

models 2

kargaranamir/T5R-base

kargaranamir/Hengam

datasets 1

kargaranamir/HengamCorpus

Amir Hossein Kargaran

AI & ML interests

Recent Activity

Organizations

Posts 2

Collections 1

spaces 8 Sort: Recently updated

Language identification comparison

Selenium Screenshot Gradio

QR CODE

Visual Clutter

ColorHarmonization

LangID-LIME

models 2 Sort: Recently updated

datasets 1

spaces 8

models 2