Abstract | ||
---|---|---|
Presence of spam tweets in a dataset may affect the choices of feature selection, algorithm formulation, and system evaluation for many applications. However, most existing studies have not considered the impact of spam tweets. In this paper, we study the impact of spam tweets on hashtag recommendation for hyperlinked tweets (i.e., tweets containing URLs) in HSpam14 dataset. HSpam14 is a collection of 14 million tweets with annotations of being spam and ham (i.e., non-spam). In our experiments, we observe that it is much easier to recommend "correct" hashtags for spam tweets than ham tweets, because of the near duplicates in spam tweets. Simple approaches like recommending most popular hashtags achieves very good accuracy on spam tweets. On the other hand, features that are highly effective on ham tweets may not be effective on spam tweets. Our findings suggest that without removing spam tweets from the data collection (as in most studies), the results obtained could be misleading for hashtag recommendation tasks.
|
Year | DOI | Venue |
---|---|---|
2016 | 10.1145/2872518.2889404 | WWW '16: 25th International World Wide Web Conference
Montréal
Québec
Canada
April, 2016 |
Field | DocType | ISBN |
Data mining,Data collection,World Wide Web,Social media,Feature selection,Computer science,System evaluation,Microblogging | Conference | 978-1-4503-4144-8 |
Citations | PageRank | References |
1 | 0.36 | 4 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Surendra Sedhai | 1 | 54 | 2.83 |
Aixin Sun | 2 | 3071 | 156.89 |