Title
Out-of-Category Document Identification Using Target-Category Names as Weak Supervision
Abstract
Identifying outlier documents, whose content is different from the majority of the documents in a corpus, has played an important role to manage a large text collection. However, due to the absence of explicit information about the inlier (or target) distribution, existing unsupervised outlier detectors are likely to make unreliable results depending on the density or diversity of the outliers in the corpus. To address this challenge, we introduce a new task referred to as out-of-category detection, which aims to distinguish the documents according to their semantic relevance to the inlier (or target) categories by using the category names as weak supervision. In practice, this task can be widely applicable in that it can flexibly designate the scope of target categories according to users' interests while requiring only the target-category names as minimum guidance. In this paper, we present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories. Our framework adopts a two-step approach, to take advantage of both (i) a discriminative text embedding and (ii) a neural text classifier. The experiments on real-world datasets demonstrate that our framework achieves the best detection performance among all baseline methods in various scenarios specifying different target categories.
Year
DOI
Venue
2021
10.1109/ICDM51629.2021.00041
2021 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2021)
Keywords
DocType
ISSN
Text outlier detection, Out-of-category detection, Discriminative text embedding, Weakly supervised classification
Conference
1550-4786
Citations 
PageRank 
References 
0
0.34
6
Authors
4
Name
Order
Citations
PageRank
Dongha Lee1146.77
Dongmin Hyun200.34
Jiawei Han3430853824.48
Hwanjo Yu41715114.02