commit WAR

ffd9d26 over 1 year ago

12.1 kB

	1. Title: Online News Popularity

	2. Source Information
	-- Creators: Kelwin Fernandes (kafc ‘@’ inesctec.pt, kelwinfc ’@’ gmail.com),
	Pedro Vinagre (pedro.vinagre.sousa ’@’ gmail.com) and
	Pedro Sernadela
	-- Donor: Kelwin Fernandes (kafc ’@’ inesctec.pt, kelwinfc '@' gmail.com)
	-- Date: May, 2015

	3. Past Usage:
	1. K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision
	Support System for Predicting the Popularity of Online News. Proceedings
	of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence,
	September, Coimbra, Portugal.

	-- Results:
	-- Binary classification as popular vs unpopular using a decision
	threshold of 1400 social interactions.
	-- Experiments with different models: Random Forest (best model),
	Adaboost, SVM, KNN and Naïve Bayes.
	-- Recorded 67% of accuracy and 0.73 of AUC.
	- Predicted attribute: online news popularity (boolean)

	4. Relevant Information:
	-- The articles were published by Mashable (www.mashable.com) and their
	content as the rights to reproduce it belongs to them. Hence, this
	dataset does not share the original content but some statistics
	associated with it. The original content be publicly accessed and
	retrieved using the provided urls.
	-- Acquisition date: January 8, 2015
	-- The estimated relative performance values were estimated by the authors
	using a Random Forest classifier and a rolling windows as assessment
	method. See their article for more details on how the relative
	performance values were set.

	5. Number of Instances: 39797

	6. Number of Attributes: 61 (58 predictive attributes, 2 non-predictive,
	1 goal field)

	7. Attribute Information:
	0. url: URL of the article
	1. timedelta: Days between the article publication and
	the dataset acquisition
	2. n_tokens_title: Number of words in the title
	3. n_tokens_content: Number of words in the content
	4. n_unique_tokens: Rate of unique words in the content
	5. n_non_stop_words: Rate of non-stop words in the content
	6. n_non_stop_unique_tokens: Rate of unique non-stop words in the
	content
	7. num_hrefs: Number of links
	8. num_self_hrefs: Number of links to other articles
	published by Mashable
	9. num_imgs: Number of images
	10. num_videos: Number of videos
	11. average_token_length: Average length of the words in the
	content
	12. num_keywords: Number of keywords in the metadata
	13. data_channel_is_lifestyle: Is data channel 'Lifestyle'?
	14. data_channel_is_entertainment: Is data channel 'Entertainment'?
	15. data_channel_is_bus: Is data channel 'Business'?
	16. data_channel_is_socmed: Is data channel 'Social Media'?
	17. data_channel_is_tech: Is data channel 'Tech'?
	18. data_channel_is_world: Is data channel 'World'?
	19. kw_min_min: Worst keyword (min. shares)
	20. kw_max_min: Worst keyword (max. shares)
	21. kw_avg_min: Worst keyword (avg. shares)
	22. kw_min_max: Best keyword (min. shares)
	23. kw_max_max: Best keyword (max. shares)
	24. kw_avg_max: Best keyword (avg. shares)
	25. kw_min_avg: Avg. keyword (min. shares)
	26. kw_max_avg: Avg. keyword (max. shares)
	27. kw_avg_avg: Avg. keyword (avg. shares)
	28. self_reference_min_shares: Min. shares of referenced articles in
	Mashable
	29. self_reference_max_shares: Max. shares of referenced articles in
	Mashable
	30. self_reference_avg_sharess: Avg. shares of referenced articles in
	Mashable
	31. weekday_is_monday: Was the article published on a Monday?
	32. weekday_is_tuesday: Was the article published on a Tuesday?
	33. weekday_is_wednesday: Was the article published on a Wednesday?
	34. weekday_is_thursday: Was the article published on a Thursday?
	35. weekday_is_friday: Was the article published on a Friday?
	36. weekday_is_saturday: Was the article published on a Saturday?
	37. weekday_is_sunday: Was the article published on a Sunday?
	38. is_weekend: Was the article published on the weekend?
	39. LDA_00: Closeness to LDA topic 0
	40. LDA_01: Closeness to LDA topic 1
	41. LDA_02: Closeness to LDA topic 2
	42. LDA_03: Closeness to LDA topic 3
	43. LDA_04: Closeness to LDA topic 4
	44. global_subjectivity: Text subjectivity
	45. global_sentiment_polarity: Text sentiment polarity
	46. global_rate_positive_words: Rate of positive words in the content
	47. global_rate_negative_words: Rate of negative words in the content
	48. rate_positive_words: Rate of positive words among non-neutral
	tokens
	49. rate_negative_words: Rate of negative words among non-neutral
	tokens
	50. avg_positive_polarity: Avg. polarity of positive words
	51. min_positive_polarity: Min. polarity of positive words
	52. max_positive_polarity: Max. polarity of positive words
	53. avg_negative_polarity: Avg. polarity of negative words
	54. min_negative_polarity: Min. polarity of negative words
	55. max_negative_polarity: Max. polarity of negative words
	56. title_subjectivity: Title subjectivity
	57. title_sentiment_polarity: Title polarity
	58. abs_title_subjectivity: Absolute subjectivity level
	59. abs_title_sentiment_polarity: Absolute polarity level
	60. shares: Number of shares (target)

	8. Missing Attribute Values: None

	9. Class Distribution: the class value (shares) is continuously valued. We
	transformed the task into a binary task using a decision
	threshold of 1400.

	Shares Value Range: Number of Instances in Range:
	< 1400 18490
	>= 1400 21154


	Summary Statistics:
	Feature Min Max Mean SD
	timedelta 8.0000 731.0000 354.5305 214.1611
	n_tokens_title 2.0000 23.0000 10.3987 2.1140
	n_tokens_content 0.0000 8474.0000 546.5147 471.1016
	n_unique_tokens 0.0000 701.0000 0.5482 3.5207
	n_non_stop_words 0.0000 1042.0000 0.9965 5.2312
	n_non_stop_unique_tokens 0.0000 650.0000 0.6892 3.2648
	num_hrefs 0.0000 304.0000 10.8837 11.3319
	num_self_hrefs 0.0000 116.0000 3.2936 3.8551
	num_imgs 0.0000 128.0000 4.5441 8.3093
	num_videos 0.0000 91.0000 1.2499 4.1078
	average_token_length 0.0000 8.0415 4.5482 0.8444
	num_keywords 1.0000 10.0000 7.2238 1.9091
	data_channel_is_lifestyle 0.0000 1.0000 0.0529 0.2239
	data_channel_is_entertainment 0.0000 1.0000 0.1780 0.3825
	data_channel_is_bus 0.0000 1.0000 0.1579 0.3646
	data_channel_is_socmed 0.0000 1.0000 0.0586 0.2349
	data_channel_is_tech 0.0000 1.0000 0.1853 0.3885
	data_channel_is_world 0.0000 1.0000 0.2126 0.4091
	kw_min_min -1.0000 377.0000 26.1068 69.6323
	kw_max_min 0.0000 298400.0000 1153.9517 3857.9422
	kw_avg_min -1.0000 42827.8571 312.3670 620.7761
	kw_min_max 0.0000 843300.0000 13612.3541 57985.2980
	kw_max_max 0.0000 843300.0000 752324.0667 214499.4242
	kw_avg_max 0.0000 843300.0000 259281.9381 135100.5433
	kw_min_avg -1.0000 3613.0398 1117.1466 1137.4426
	kw_max_avg 0.0000 298400.0000 5657.2112 6098.7950
	kw_avg_avg 0.0000 43567.6599 3135.8586 1318.1338
	self_reference_min_shares 0.0000 843300.0000 3998.7554 19738.4216
	self_reference_max_shares 0.0000 843300.0000 10329.2127 41027.0592
	self_reference_avg_sharess 0.0000 843300.0000 6401.6976 24211.0269
	weekday_is_monday 0.0000 1.0000 0.1680 0.3739
	weekday_is_tuesday 0.0000 1.0000 0.1864 0.3894
	weekday_is_wednesday 0.0000 1.0000 0.1875 0.3903
	weekday_is_thursday 0.0000 1.0000 0.1833 0.3869
	weekday_is_friday 0.0000 1.0000 0.1438 0.3509
	weekday_is_saturday 0.0000 1.0000 0.0619 0.2409
	weekday_is_sunday 0.0000 1.0000 0.0690 0.2535
	is_weekend 0.0000 1.0000 0.1309 0.3373
	LDA_00 0.0000 0.9270 0.1846 0.2630
	LDA_01 0.0000 0.9259 0.1413 0.2197
	LDA_02 0.0000 0.9200 0.2163 0.2821
	LDA_03 0.0000 0.9265 0.2238 0.2952
	LDA_04 0.0000 0.9272 0.2340 0.2892
	global_subjectivity 0.0000 1.0000 0.4434 0.1167
	global_sentiment_polarity -0.3937 0.7278 0.1193 0.0969
	global_rate_positive_words 0.0000 0.1555 0.0396 0.0174
	global_rate_negative_words 0.0000 0.1849 0.0166 0.0108
	rate_positive_words 0.0000 1.0000 0.6822 0.1902
	rate_negative_words 0.0000 1.0000 0.2879 0.1562
	avg_positive_polarity 0.0000 1.0000 0.3538 0.1045
	min_positive_polarity 0.0000 1.0000 0.0954 0.0713
	max_positive_polarity 0.0000 1.0000 0.7567 0.2478
	avg_negative_polarity -1.0000 0.0000 -0.2595 0.1277
	min_negative_polarity -1.0000 0.0000 -0.5219 0.2903
	max_negative_polarity -1.0000 0.0000 -0.1075 0.0954
	title_subjectivity 0.0000 1.0000 0.2824 0.3242
	title_sentiment_polarity -1.0000 1.0000 0.0714 0.2654
	abs_title_subjectivity 0.0000 0.5000 0.3418 0.1888
	abs_title_sentiment_polarity 0.0000 1.0000 0.1561 0.2263


	Citation Request:

	Please include this citation if you plan to use this database:

	K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision
	Support System for Predicting the Popularity of Online News. Proceedings
	of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence,
	September, Coimbra, Portugal.