An efficient document clustering algorithm and its application to a document browser - Citegraph

Paper Info

Title
An efficient document clustering algorithm and its application to a document browser

Abstract
We present an efficient document clustering algorithm that uses a term frequency vector for each document instead of using a huge proximity matrix. The algorithm has the following features: (1) it requires a relatively small amount of memory and runs fast, (2) it produces a hierarchy in the form of a document classification tree and (3) the hierarchy obtained by the algorithm explicitly reveals a collection structure. We confirm these features and thus show the algorithm's feasibility through clustering experiments in which we use two collections of Japanese documents, the sizes of which are 83,099 and 14,701 documents. We also introduce an application of this algorithm to a document browser. This browser is used in our Japanese-to-English translation aid system. The browsing module of the system consists of a huge database of Japanese news articles and their English translations. The Japanese article collection is clustered into a hierarchy by our method. Since each node in the hierarchy corresponds to a topic in the collection, we can use the hierarchy to directly access articles by topic. A user can learn general translation knowledge of each topic by browsing the Japanese articles and their English translations. We also discuss techniques of presenting a large tree-formed hierarchy on a computer screen.

Year	DOI	Venue
1999	10.1016/S0306-4573(98)00056-9	Information Processing & Management
Keywords	Field	DocType
Document clustering,Document retrieval,Automatic document organization	Document classification,Data mining,Information retrieval,Computer science,Document clustering,Information access,Algorithm,Cluster grouping,Document retrieval,Hierarchy,Cluster analysis	Journal
Volume	Issue	ISSN
35	4	0306-4573
Citations	PageRank	References
3	0.38	11
Authors
4

Authors (4 rows)

Cited by (3 rows)

References (11 rows)

Name	Order	Citations	PageRank
Hideki Tanaka	1	80	15.07
Tadashi Kumano	2	20	4.23
Noriyoshi Uratani	3	63	15.55
Terumasa Ehara	4	97	17.21

1