Title
An efficient document clustering algorithm and its application to a document browser
Abstract
We present an efficient document clustering algorithm that uses a term frequency vector for each document instead of using a huge proximity matrix. The algorithm has the following features: (1) it requires a relatively small amount of memory and runs fast, (2) it produces a hierarchy in the form of a document classification tree and (3) the hierarchy obtained by the algorithm explicitly reveals a collection structure. We confirm these features and thus show the algorithm's feasibility through clustering experiments in which we use two collections of Japanese documents, the sizes of which are 83,099 and 14,701 documents. We also introduce an application of this algorithm to a document browser. This browser is used in our Japanese-to-English translation aid system. The browsing module of the system consists of a huge database of Japanese news articles and their English translations. The Japanese article collection is clustered into a hierarchy by our method. Since each node in the hierarchy corresponds to a topic in the collection, we can use the hierarchy to directly access articles by topic. A user can learn general translation knowledge of each topic by browsing the Japanese articles and their English translations. We also discuss techniques of presenting a large tree-formed hierarchy on a computer screen.
Year
DOI
Venue
1999
10.1016/S0306-4573(98)00056-9
Information Processing & Management
Keywords
Field
DocType
Document clustering,Document retrieval,Automatic document organization
Document classification,Data mining,Information retrieval,Computer science,Document clustering,Information access,Algorithm,Cluster grouping,Document retrieval,Hierarchy,Cluster analysis
Journal
Volume
Issue
ISSN
35
4
0306-4573
Citations 
PageRank 
References 
3
0.38
11
Authors
4
Name
Order
Citations
PageRank
Hideki Tanaka18015.07
Tadashi Kumano2204.23
Noriyoshi Uratani36315.55
Terumasa Ehara49717.21