Title
Balanced Word Clusters for Interpretable Document Representation
Abstract
We present Bag-of-Balanced-Concepts (BOBC), a document representation method for fuzzy and interpretable similarity estimation based on word clusters. For this purpose, a k-medoid variant is proposed, which iteratively resamples small clusters to introduce a tendency towards balanced cluster sizes. The necessary inter-word similarities for clustering are computed using GloVe or word2vec word embeddings. In this way, words that often share contexts tend to appear in the same clusters. Those clusters are used to represent documents as normalized probability distributions. Various distance measures acting as document dissimilarity estimators have been evaluated on five datasets. The impact of clustering parameters, input word vectors, and inverse document frequency weighting has been examined in our experiments. Furthermore, a comparison with document similarity estimation baselines has been performed. We demonstrate that, on average, our approach outperforms cosine similarity of both weighted Bag-of-Words vectors (TF-IDF and BM25) and word embedding centroids (Word Centroid Distance).
Year
DOI
Venue
2019
10.1109/ICDARW.2019.40089
2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)
Keywords
Field
DocType
Document Representation,Word Clustering,Inter Document Similarity,Bag-of-Concepts,Word Embeddings
Weighting,Cosine similarity,tf–idf,Pattern recognition,Computer science,Artificial intelligence,Word2vec,Word embedding,Cluster analysis,Centroid,Distance measures
Conference
Volume
ISSN
ISBN
5
1520-5363
978-1-7281-5055-0
Citations 
PageRank 
References 
0
0.34
5
Authors
2
Name
Order
Citations
PageRank
Marco Wrzalik101.35
Dirk Krechel24413.19