Title
k-means clustering and kNN classification based on negative databases
Abstract
Nowadays, privacy protection has become an important issue in data mining. k-means clustering and kNN classification are two popular data mining algorithms, which have been widely studied in the past decade. In this paper, we mainly study the problem of privacy protection during k-means clustering and kNN classification. Negative database (NDB) is a new type of data representation which can protect privacy as well as support distance estimation, so it is promising to apply NDBs to privacy-preserving k-means clustering and kNN classification. Existing privacy-preserving k-means clustering and kNN classification algorithms based on NDBs could effectively protect data privacy, but their clustering or classification performance has a non-negligible degradation. To alleviate this problem, we propose a new NDB generation algorithm (named QK-hidden algorithm). Compared to existing NDB generation algorithms, the QK-hidden algorithm employs a new set of parameters to control the selection of different bits when generating NDB records, and this enables a fine-grained control of the accuracy of distance estimation. By the QK-hidden algorithm, private data are converted into NDBs before inputting to k-means and kNN algorithms. Then, we propose an approach specialized for estimating Euclidean distance from the NDBs generated by the QK-hidden algorithm, and this approach is used instead of calculating Euclidean distance on plain text. Moreover, we design a model for estimating the cluster centers from NDBs for the k-means algorithm. The remaining steps of k-means and kNN algorithms keep the same to original versions. Experimental results demonstrate that the proposed algorithms could achieve more than 30% improvement on the clustering performance (in terms of average Davies- Bouldin Index) and 9% improvement on the classification performance (in terms of macro precision, recall and F1 score) in comparison with some existing works. (C) 2021 Elsevier B.V. All rights reserved.
Year
DOI
Venue
2021
10.1016/j.asoc.2021.107732
APPLIED SOFT COMPUTING
Keywords
DocType
Volume
Privacy protection, Negative database, kNN classification, Euclidean distance
Journal
110
ISSN
Citations 
PageRank 
1568-4946
0
0.34
References 
Authors
0
7
Name
Order
Citations
PageRank
Dongdong Zhao13420.62
Xiaoyi Hu2105.66
Shengwu Xiong318953.59
Jing Tian4426.49
Jianwen Xiang511323.36
Jing Zhou632754.75
Huanhuan Li711.12