Title
Generating Labeled Datasets of Twitter Users.
Abstract
In this paper we present a simple, yet powerful approach to generating labeled datasets of Twitter1 users. Our focus falls on sensitive personal details, shared as background information in tweets. Such tweets avoid the focus of user's attention and also tend to resist the vast amounts of humor, wishes or hypothetical thinking typical for tweets. Our approach combines selecting search queries, followed up by a semi-supervised filtering of indicative messages. We create datasets in several unrelated domains and prove that all sorts of target groups can be built with minimal manual annotator effort. The generated datasets include separate groups of users with specific characteristics: pet ownership, blood pressure, diabetes and psychotropic medicine usage, for which to our knowledge manually labeled data was previously not available. Our search-based approach is also used to generate a cross-domain corpus, matching Twitter users with their Yelp2 profiles.
Year
DOI
Venue
2017
10.1145/3099023.3099048
UMAP (Adjunct Publication)
Field
DocType
Citations 
Personal details,World Wide Web,Computer science,Labeled data
Conference
0
PageRank 
References 
Authors
0.34
10
3
Name
Order
Citations
PageRank
yasen kiprov1154.94
Pepa Gencheva2298.87
ivan koychev3576.16