Optimizing apache nutch for domain specific crawling at large scale - Citegraph

Paper Info

Title
Optimizing apache nutch for domain specific crawling at large scale

Abstract
Focused crawls are key to acquiring data at large scale in order to implement systems like domain search engines and knowledge databases. Focused crawls introduce non trivial problems to the already difficult problem of web scale crawling; To address some of these issues, BCube ¿ a building block of the National Science Foundation's EarthCube program ¿ has developed a tailored version of Apache Nutch for data and web services discovery at scale. We describe how we started with a vanilla version of Apache Nutch and how we optimized and scaled it to reach gigabytes of discovered links and almost half a billion documents of interest crawled so far.

Year	DOI	Venue
2015	10.1109/BigData.2015.7363976	Big Data
Keywords	Field	DocType
focused crawl, big data, Apache Nutch, data discovery	Data mining,Data discovery,World Wide Web,Crawling,Web Services Discovery,Computer science,Gigabyte,Big data,Distributed web crawling	Conference
Citations	PageRank	References
0	0.34	6
Authors
3

Authors (3 rows)

Cited by (0 rows)

References (6 rows)

Name	Order	Citations	PageRank
Luis A. Lopez	1	0	0.34
Ruth E. Duerr	2	91	12.97
Siri Jodha Singh Khalsa	3	42	8.59

1