Title
Lightning Talk - Think Outside the Dataset: Finding Fraudulent Reviews using Cross-Dataset Analysis
Abstract
Many crowd-sourced review platforms, such as Yelp, TripAdvisor, and Foursquare, have sprung up to provide a shared space for people to write reviews and rate local businesses. With the substantial impact of businesses’ online ratings on their selling [2], many businesses add themselves to multiple websites to more easily be discovered. Some might also engage in reputation management, which could range from rewarding their customers for a favorable review, or a complex review campaign, where armies of accounts post reviews to influence a business’ average review score. Most of previous work use supervised machine learning, and only focus on textual and stylometry features [1, 3, 4, 7]. Their obtained ground truth data is not large and comprehensive [4, 5, 6, 7, 8, 10]. These works also assume a limited threat model, e.g., an adversary’s activity is assumed to be found near sudden shifts in the data [8], or focused on positive campaigns. We propose OneReview , a system for finding fraudulent content on a crowd-sourced review site, leveraging correlations with other independent review sites, and the use of textual and contextual features. We assume that an attacker may not be able to exert the same influence over a business’ reputation on several websites, due to increased cost. OneReview focuses on isolating anomalous changes in a business’ reputation across multiple review sites, to locate malicious activity without relying on specific patterns. Our intuition is that a business’s reputation should not be very different in multiple review sites; e.g., if a restaurant changes its chef or manager, then the impact of these changes should appear on reviews across all the websites. OneReview utilizes Change Point Analysis method on the reviews of every business independently on every website, and then uses our proposed Change Point Analyzer to evaluate change-points, detect those that do not match across the websites, and identify them as suspicious. Then, it uses supervised machine learning, utilizing a combination of textual and metadata features to locate fraudulent reviews among the suspicious reviews. We evaluated our approach, using data from two reviewing websites, Yelp and TripAdvisor, to find fraudulent activity on Yelp. We obtained Yelp reviews, through the Yelp Data Challenge [9], and used our Change Point Analyzer to correlate this with data crawled from TripAdvisor. Since realistic and varied ground truth data is not currently available, we used a combination of our change point analysis and crowd-labeling to create a set of 5,655 labeled reviews. We used k-cross validation (k=5) on our ground truth and obtained 97% (+/- 0.01) accuracy, 91% (+/- 0.03) precision and 90% (+/- 0.06) recall. The model was used on the suspicious reviews, which classified 61,983 reviews, about 8% of all reviews, as fraudulent. We further detected fraudulent campaigns that are actively initiated by or targeted toward specific businesses. We identified 3,980 businesses with fraudulent reviews, as well as, 14,910 suspected spam, where at least 40% of their reviews are classified as fraudulent. We also used community detection algorithms to locate several large astroturfing campaigns. These results show the effectiveness of OneReview in detecting fraudulent campaigns.
Year
DOI
Venue
2019
10.1145/3308560.3316477
Companion Proceedings of The 2019 World Wide Web Conference
Keywords
Field
DocType
Cross-Dataset Change-Point Analysis, Fraudulent Reviews
Data science,Metadata,World Wide Web,Shared space,Threat model,Computer science,Intuition,Ground truth,Stylometry,Adversary,Reputation
Conference
ISBN
Citations 
PageRank 
978-1-4503-6675-5
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Shirin Nilizadeh11336.92
Hojjat Aghakhani220.71
Eric Gustafson3174.15
Christopher Kruegel48799516.05
Giovanni Vigna57121507.72