Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
SentdexΒ 
posted an update Feb 19
Post
Hi, welcome to my first post here!

I am slowly wrangling about 5 years of reddit comments (2015-2020). It's a total of billions samples that can be filtered as comment-reply pairs, chains of discussion, filtered by subreddit, up/down votes, controversy, sentiment, and more.

Any requests or ideas for curated datasets from here? I'll also tinker with uploading the entire dataset potentially in chunks or something, but it's quite a few terabytes in total, so I'll need to break it up still. I have some ideas for datasets I personally want too, but curious if anyone has something they'd really like to see that sounds interesting too.

Can give a column describing the tone of the text.. ex. sarcastic angry happy -- would come handy when anybody try to finetune a llm to generate sarcastic content or more human like content

Hi,
How are you getting the comments? Have they previously been scraped, or are you using the Reddit API, or is this in partnership with Reddit?
Thanks!

Β·

You can find datasets from here:
https://academictorrents.com /details/9c263fc85366c1ef8f5bb9da0203f4c8c8db75f4

Godot would be useful πŸ€–

More story writing and multi-turn conversation datasets would be very nice!

Β·

I actually came to the realization that not only could this dataset cover multi-turn, it could handle multiple speakers.

So far we only have instruct pairs like bot/computer, but instead we could have 5 or 10 or 3 ...etc entities in the discussion.