Title
A Meaningful Information Extraction System for Interactive Analysis of Documents
Abstract
This paper is related to a project aiming at discovering weak signals from different streams of information, possibly sent by whistleblowers. The study presented in this paper tackles the particular problem of clustering topics at multi-levels from multiple documents, and then extracting meaningful descriptors, such as weighted lists of words for document representations in a multi-dimensions space. In this context, we present a novel idea which combines Latent Dirichlet Allocation and Word2vec (providing a consistency metric regarding the partitioned topics) as potential method for limiting the "a priori" number of cluster K usually needed in classical partitioning approaches. We proposed 2 implementations of this idea, respectively able to: (1) finding the best K for LDA in terms of topic consistency; (2) gathering the optimal clusters from different levels of clustering. We also proposed a non-traditional visualization approach based on a multi-agents system which combines both dimension reduction and interactivity.
Year
DOI
Venue
2019
10.1109/ICDAR.2019.00024
2019 International Conference on Document Analysis and Recognition (ICDAR)
Keywords
Field
DocType
weak signal,clustering topics,word embedding,multi-agent system,vizualisation
Latent Dirichlet allocation,Dimensionality reduction,Information retrieval,Pattern recognition,Computer science,Visualization,Multi-agent system,Information extraction,Artificial intelligence,Word2vec,Word embedding,Cluster analysis
Conference
ISSN
ISBN
Citations 
1520-5363
978-1-7281-3015-6
0
PageRank 
References 
Authors
0.34
0
5
Name
Order
Citations
PageRank
Julien Maitre100.34
Michel Ménard200.34
Guillaume Chiron382.27
Alain Bouju49315.32
Nicolas Sidere5247.00