Title
PARADISE Based Search Engine at TREC 2009 Web Track
Abstract
In this paper, we introduce the PARADISE search engine in TREC09 Web track. PARADISE is the abbreviation for Platform for Applying, Research and Developing Intelligent Search Engine, which is a search engine platform developed by SEWM group, Peking University. The system is designed to support both English and Chinese information retrieval. This system preprocessed and indexed the five hundred million web pages for this year's Web Track. In the preprocessing stage, the templates were removed, the encoding were identified and unified, and the anchor texts and InLink information are extracted with the mapreduce framework (using Hadoop in this system). In retrieval, our runs used an extension of BM25. This model distinguishes terms from different fields and integrated both term counts and position information. Furthermore, some web based features are also considered.
Year
Venue
Keywords
2009
TREC
term proximity.,information retrieval,system design,search engine,internet,preprocessing,anchor text,social communication,templates,web pages,indexation,china,information sciences
Field
DocType
Citations 
Web search engine,Data mining,Web page,Computer science,Natural language processing,Artificial intelligence,Web application,Text processing,The Internet,Search engine,Information retrieval,Information science,Preprocessor
Conference
0
PageRank 
References 
Authors
0.34
6
4
Name
Order
Citations
PageRank
Dongdong Shan11286.11
Dongsheng Zhao2646.88
Jing He353719.00
Hongfei Yan476335.67