Title
An Intelligent System For Focused Crawling From Big Data Sources
Abstract
Nowadays, the proper management of data is a key business enabler and booster for companies, so as to increase their competitiveness. Typically, companies hold massive amounts of data within their servers, which might include previously offered services, proposals, bids, and so on. They rely on their expert managers to manually analyse them in order to make strategic decisions. However, given the huge amount of information to be analysed and the necessity of making timely decisions, they often exploit a small amount of the available data, which often does not yield effective choices. For instance, this happens in the context of the e-procurement domain, where bids for new calls for tender are often formulated by looking at some past proposals from a company. Driven by an extensive experience on the e-procurement domain, in this paper we propose an intelligent system to support organisations in the focused crawling of artefacts (calls for tender, BIMs, equipment, policies, market trends, and so on) of interest from the web, semantically matching them against internal Big Data and knowledge sources, so as to let companies analysts make better strategic decisions. The novel contribution consists of a proper extension of the K-means algorithm used by a web crawler within the proposed system, and a semantic module exploiting search patterns to find relevant data within the crawled artefacts. The proposed solution has been implemented and extensively assessed in the e-procurement domain. It has been successively extended to other domains, such as robot programming, cloud providing, and several other domains. Since to the best of our knowledge in the literature do not exists similar systems, in order to prove its effectiveness we have compared its crawling component against similar crawlers, by plugging them within our system.
Year
DOI
Venue
2021
10.1016/j.eswa.2021.115560
EXPERT SYSTEMS WITH APPLICATIONS
Keywords
DocType
Volume
Big Data analytics, Focused crawling, Intelligent system, Natural language processing, Data clustering, Big Data visualisation
Journal
184
ISSN
Citations 
PageRank 
0957-4174
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
I Bifulco100.34
Stefano Cirillo202.37
Christian Esposito356954.78
R Guadagni400.34
Giuseppe Polese526338.68