Title
Implementation of hybrid P2P networking distributed web crawler using AWS for smart work news big data
Abstract
Web crawlers collect and index the vast amount of data available online to gather specific types of objective data such as news that researchers or practitioners need. As big data are increasingly used in a variety of fields and web data are exponentially growing each year, the importance of web crawlers is growing as well. Web servers that currently handle high traffic, such as portal news servers, have safeguards against security threats such as distributed denial-of-service (DDoS) attacks. In particular, the crawler, which causes a large amount of traffic to the Web server, has a very similar nature to DDoS attacks, so the crawler’s activities tend to be blocked from the web server. A peer-to-peer (P2P) crawler can be used to solve these problems. However, the limitations with the pure P2P crawler is that it is difficult to maintain the entire system when network traffic increases or errors occur. Therefore, in order to overcome these limitations, we would like to propose a hybrid P2P crawler that can collect web data using the cloud service platform provided by Amazon Web Services (AWS). The hybrid P2P networking distributed web crawler using AWS (HP2PNC-AWS) is applied to collecting news on Korea’s current smart work lifestyle from three portal sites. In Portal A where the target server does not block crawling, the HP2PNC-AWS is faster than the general web crawler (GWC) and slightly slower than the server/client distributed web crawler (SC-DWC), but it has a similar performance to the SC-DWC. However, in both Portal B and C where the target server blocks crawling, the HP2PNC-AWS performs better than other methods, with the collection rate and the number of data collected at the same time. It was also confirmed that the hybrid P2P networking system could work efficiently in web crawler architectures.
Year
DOI
Venue
2020
10.1007/s12083-019-00841-0
Peer-to-Peer Networking and Applications
Keywords
Field
DocType
Hybrid P2P networking, Distributed web crawler, Amazon web server, Big data, Smart work
Crawling,Denial-of-service attack,Computer science,Server,Computer network,Amazon web services,Web crawler,Big data,Web server,Cloud computing
Journal
Volume
Issue
ISSN
13
2
1936-6450
Citations 
PageRank 
References 
1
0.35
0
Authors
4
Name
Order
Citations
PageRank
Yong-Young Kim110.35
Yong-Ki Kim222.12
Dae-Sik Kim310.35
Mihye Kim47414.31