Title
Web crawler middleware for search engine digital libraries: a case study for citeseerX
Abstract
Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database. This middleware is designed to be extensible as it provides a universal interface to the crawl database. It is designed to support input from multiple open source crawlers and archival formats, e.g., ARC, WARC. It can also import files downloaded via FTP. To use this middleware for another crawler, the user only needs to write a new log parser which returns a resource object with the standard metadata attributes and tells the middleware how to access downloaded files. When importing documents, users can specify document mime types and obtain text extracted from PDF/postscript documents. The middleware can adaptively identify academic research papers based on document context features. We developed a web user interface where the user can submit importing jobs. The middleware package can also work on supplemental jobs related to the crawl database and respository. Though designed for the CiteSeerX search engine, we feel this design would be appropriate for many search engine web crawling systems.
Year
DOI
Venue
2012
10.1145/2389936.2389949
WIDM
Keywords
Field
DocType
digital library citeseerx crawl,document context feature,middleware package,document mime type,postscript document,web user interface,web crawler middleware,search engine web,citeseerx search engine,search engine digital library,crawl database,case study,associated metadata,web crawling,middleware,search engine,ingestion,information retrieval
Middleware,Data mining,Computer science,Digital library,File Transfer Protocol,Metadata,Middleware (distributed applications),World Wide Web,Information retrieval,Parsing,User interface,Web crawler,Database
Conference
Citations 
PageRank 
References 
3
0.39
2
Authors
9
Name
Order
Citations
PageRank
Jian Wu1226.11
Pradeep Teregowda2573.94
Madian Khabsa323718.81
Stephen Carman4151.08
Douglas Jordan5261.66
Jose San Pedro Wandelmer630.39
Xin Lu758627.15
Prasenjit Mitra82439167.89
C. Lee Giles9111541549.48