Title
Optimizing apache nutch for domain specific crawling at large scale
Abstract
Focused crawls are key to acquiring data at large scale in order to implement systems like domain search engines and knowledge databases. Focused crawls introduce non trivial problems to the already difficult problem of web scale crawling; To address some of these issues, BCube ¿ a building block of the National Science Foundation's EarthCube program ¿ has developed a tailored version of Apache Nutch for data and web services discovery at scale. We describe how we started with a vanilla version of Apache Nutch and how we optimized and scaled it to reach gigabytes of discovered links and almost half a billion documents of interest crawled so far.
Year
DOI
Venue
2015
10.1109/BigData.2015.7363976
Big Data
Keywords
Field
DocType
focused crawl, big data, Apache Nutch, data discovery
Data mining,Data discovery,World Wide Web,Crawling,Web Services Discovery,Computer science,Gigabyte,Big data,Distributed web crawling
Conference
Citations 
PageRank 
References 
0
0.34
6
Authors
3
Name
Order
Citations
PageRank
Luis A. Lopez100.34
Ruth E. Duerr29112.97
Siri Jodha Singh Khalsa3428.59