Title
Integrating product data from websites offering microdata markup
Abstract
Large numbers of websites have started to markup their content using standards such as Microdata, Microformats, and RDFa. The marked-up content elements comprise descriptions of people, organizations, places, events, products, ratings, and reviews. This development has accelerated in last years as major search engines such as Google, Bing and Yahoo! use the markup to improve their search results. Embedding semantic markup facilitates identifying content elements on webpages. However, the markup is mostly not as fine-grained as desirable for applications that aim to integrate data from large numbers of websites. This paper discusses the challenges that arise in the task of integrating descriptions of electronic products from several thousand e-shops that offer Microdata markup. We present a solution for each step of the data integration process including Microdata extraction, product classification, product feature extraction, identity resolution, and data fusion. We evaluate our processing pipeline using 1.9 million product offers from 9240 e-shops which we extracted from the Common Crawl 2012, a large public Web corpus.
Year
DOI
Venue
2014
10.1145/2567948.2579704
WWW (Companion Volume)
Keywords
Field
DocType
content element,semantic markup,data fusion,microdata markup,electronic product,Microdata markup,marked-up content element,data integration process,Microdata extraction,large number,Integrating product data,large public Web corpus
Data integration,Data mining,World Wide Web,Information retrieval,Web page,Computer science,Feature extraction,Information extraction,Semantic HTML,Microdata (HTML),Product classification,Markup language
Conference
Citations 
PageRank 
References 
9
0.90
7
Authors
3
Name
Order
Citations
PageRank
Petar Petrovski1544.00
Volha Bryl218014.46
Christian Bizer38448524.93