Title
A survey of multilingual human-tagged short message datasets for sentiment analysis tasks.
Abstract
Today, the electronic word-of-mouth (eWOM) statements expressed on blogs, social media or shopping platforms are much frequent and enable customers to share his/her point of view about acquired products or services. These eWOM statements can be used for the industry to improve its products and services and for customers for making better purchase decisions. Sentiment analysis (SA) techniques can be used to extract and analyze these eWOM statements. Research in recent years on SA has advanced considerably, and its applications in business management have grown exponentially. Automatic techniques (such as machine learning, deep learning and statistic approaches) have been used for this purpose. However, training a machine for processing or analyzing sentiments is a hard task, mainly due to the complexity of the natural language. This task is more complicated in multilingual environments. There is still a great paucity regarding training datasets, one of the key resources in achieving more favorable results. Training datasets, in fact, are a reservoir of information serving to teach and refine the skills of automatic techniques. Hence, the higher the quality of the training datasets, the better predictive power of sentiment analysis tasks. English datasets are relatively easy to find in the literature; however, datasets in other languages are very scarce. So, this paper therefore describes and compiles information concerning 25 datasets gleaned from short messages (statements expressed in social media and shopping platforms) in seven different languages, for the most part from Twitter. For quality issues, all the resources were human-tagged, and they are currently available to the scientific community. A new sentiment dataset in English extracted from Twitter has also been drawn up and each message evaluated subjectively. The current survey therefore aims to provide essential quality information for future research related to automatic sentiment analysis in monolingual or multilingual scenarios.
Year
DOI
Venue
2018
10.1007/s00500-017-2766-5
Soft Comput.
Keywords
Field
DocType
Sentiment analysis, Dataset, Corpus, Short messages, Multilingual, Twitter, Human-tagged
Data science,Social media,Statistic,Predictive power,Sentiment analysis,Computer science,Natural language,Business management,Artificial intelligence,Deep learning,Machine learning
Journal
Volume
Issue
ISSN
22
24
1432-7643
Citations 
PageRank 
References 
0
0.34
80
Authors
3
Name
Order
Citations
PageRank
F. Steiner-Correa100.34
María I. Viedma-del Jesús200.34
A. G. Lopez-Herrera300.68