Title
Investigation of biases in identity linkage DataSets
Abstract
In social networks, the problem of identity linkage is to find whether a pair of user identities on two social networks belong to the same individual or not. Prior works typically first collect ground truth datasets of user identities across social networks belonging to the same individuals and then build a machine learning model driven by features from user identities. User behaviors in different social networks drive the construction of these datasets, and as a consequence, behavioral biases get manifested in them. Our work performs a detailed investigation into these dataset biases, a work which has mostly remained under-explored in the identity linkage research. More specifically, we characterize, detect, and quantify behavioral biases in the dataset that manifest in the form of lexical differences in user-generated content, particularly in usernames and display names configured by users. We study these biases on more than 1 million user identity pairs obtained by leveraging two user behaviors, namely cross-posting and self-disclosure. We find that users who self-disclose their usernames and display names on different social networks show higher lexical similarity than users who cross-post. These behavioral biases lower down the performance (precision and recall) of learning models by 5-20%. Inspired by discrimination measurement metrics, we propose and implement a framework to quantify the extent of these biases and find that 15--20% of test data get affected.
Year
DOI
Venue
2020
10.1145/3341105.3374015
SAC '20: The 35th ACM/SIGAPP Symposium on Applied Computing Brno Czech Republic March, 2020
Keywords
DocType
ISBN
Bias Detection, Online Social Networks, Data Mining
Conference
978-1-4503-6866-7
Citations 
PageRank 
References 
0
0.34
0
Authors
3
Name
Order
Citations
PageRank
Rishabh Kaushal101.35
Shubham Gupta200.34
Ponnurangam Kumaraguru319216.59