Abstract | ||
---|---|---|
In social networks, the problem of identity linkage is to find whether a pair of user identities on two social networks belong to the same individual or not. Prior works typically first collect ground truth datasets of user identities across social networks belonging to the same individuals and then build a machine learning model driven by features from user identities. User behaviors in different social networks drive the construction of these datasets, and as a consequence, behavioral biases get manifested in them. Our work performs a detailed investigation into these dataset biases, a work which has mostly remained under-explored in the identity linkage research. More specifically, we characterize, detect, and quantify behavioral biases in the dataset that manifest in the form of lexical differences in user-generated content, particularly in usernames and display names configured by users. We study these biases on more than 1 million user identity pairs obtained by leveraging two user behaviors, namely cross-posting and self-disclosure. We find that users who self-disclose their usernames and display names on different social networks show higher lexical similarity than users who cross-post. These behavioral biases lower down the performance (precision and recall) of learning models by 5-20%. Inspired by discrimination measurement metrics, we propose and implement a framework to quantify the extent of these biases and find that 15--20% of test data get affected.
|
Year | DOI | Venue |
---|---|---|
2020 | 10.1145/3341105.3374015 | SAC '20: The 35th ACM/SIGAPP Symposium on Applied Computing
Brno
Czech Republic
March, 2020 |
Keywords | DocType | ISBN |
Bias Detection, Online Social Networks, Data Mining | Conference | 978-1-4503-6866-7 |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Rishabh Kaushal | 1 | 0 | 1.35 |
Shubham Gupta | 2 | 0 | 0.34 |
Ponnurangam Kumaraguru | 3 | 192 | 16.59 |