Title
Tensor-based Graph Modularity for Text Data Clustering
Abstract
Graphs are used in several applications to represent similarities between instances. For text data, we can represent texts by different features such as bag-of-words, static embeddings (Word2vec, GloVe, etc.), and contextual embeddings (BERT, RoBERTa, etc.), leading to multiple similarities (or graphs) based on each representation. The proposal posits that incorporating the local invariance within every graph and the consistency across different graphs leads to a consensus clustering that improves the document clustering. This problem is complex and challenged with the sparsity and the noisy data included in each graph. To this end, we rely on the modularity metric, which effectively evaluates graph clustering in such circumstances. Therefore, we present a novel approach for text clustering based on both a sparse tensor representation and graph modularity. This leads to cluster texts (nodes) while capturing information arising from the different graphs. We iteratively maximize a Tensor-based Graph Modularity criterion. Extensive experiments on benchmark text clustering datasets are performed, showing that the proposed algorithm referred to as Tensor Graph Modularity -TGM- outperforms other baseline methods in terms of clustering task. The source code is available at https://github.com/TGMclustering/TGMclustering.
Year
DOI
Venue
2022
10.1145/3477495.3531834
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Keywords
DocType
Citations 
Text clustering, Tensor, Graphs, NLP, Word embedding
Conference
0
PageRank 
References 
Authors
0.34
11
5
Name
Order
Citations
PageRank
Rafika Boutalbi131.76
Mira Ait-Saada200.34
Anastasiia Iurshina300.34
Steffen Staab46658593.89
Mohamed Nadif536453.19