A Hidden Topic-Based Framework toward Building Applications with Short Web Documents - Citegraph

Paper Info

Title
A Hidden Topic-Based Framework toward Building Applications with Short Web Documents

Abstract
This paper introduces a hidden topic-based framework for processing short and sparse documents (e.g., search result snippets, product descriptions, book/movie summaries, and advertising messages) on the Web. The framework focuses on solving two main challenges posed by these kinds of documents: 1) data sparseness and 2) synonyms/homonyms. The former leads to the lack of shared words and contexts among documents while the latter are big linguistic obstacles in natural language processing (NLP) and information retrieval (IR). The underlying idea of the framework is that common hidden topics discovered from large external data sets (universal data sets), when included, can make short documents less sparse and more topic-oriented. Furthermore, hidden topics from universal data sets help handle unseen data better. The proposed framework can also be applied for different natural languages and data domains. We carefully evaluated the framework by carrying out two experiments for two important online applications (Web search result classification and matching/ranking for contextual advertising) with large-scale universal data sets and we achieved significant results.

Year	DOI	Venue
2011	10.1109/TKDE.2010.27	IEEE Trans. Knowl. Data Eng.
Keywords	Field	DocType
product descriptions,ranking,search result snippets,hidden topic-based framework,sparse data,hidden topic based framework,short web documents,web mining,proposed framework,data sparseness,information retrieval,large external data set,building applications,unseen data,hidden topic,internet,matching,large-scale universal data set,data domain,universal data set,contextual advertising.,classification,natural language processing,advertising messages,sparse documents,document handling,common hidden topic,hidden topic analysis,advertising,predictive models,information security,contextual advertising,prediction model,natural language,data mining	Data mining,Contextual advertising,Web mining,Data domain,Ranking,Information retrieval,Computer science,Natural language,Homonym,Sparse matrix,The Internet	Journal
Volume	Issue	ISSN
23	7	1041-4347
Citations	PageRank	References
50	1.72	24
Authors
6

Authors (6 rows)

Cited by (50 rows)

References (24 rows)

Name	Order	Citations	PageRank
Xuan-Hieu Phan	1	322	21.37
Cam-Tu Nguyen	2	139	12.40
Dieu-Thu Le	3	68	4.85
Le-Minh Nguyen	4	287	22.43
Susumu Horiguchi	5	1002	113.41
Quang-Thuy Ha	6	219	27.89

1