Title
Lossless Separation of Web Pages into Layout Code and Data
Abstract
A modern web page is often served by running layout code on data, producing an HTML document that enhances the data with front/back matters and layout/style operations. In this paper, we consider the opposite task: separating a given web page into a data component and a layout program. This separation has various important applications: page encoding may be significantly more compact (reducing web traffic), data representation is normalized across web designs (facilitating wrapping, retrieval and extraction), and repetitions are diminished (expediting site updates and redesign). We present a framework for defining the separation task, and devise an algorithm for synthesizing layout code from a web page while distilling its data in a lossless manner. The main idea is to synthesize layout code hierarchically for parts of the page, and use a combined program-data representation cost to decide whether to align intermediate programs. When intermediate programs are aligned, they are transformed into a single program, possibly with loops and conditionals. At the same time, differences between the aligned programs are captured by the data component such that executing the layout code on the data results in the original page. We have implemented our approach and conducted a thorough experimental study of its effectiveness. Our experiments show that our approach features state of the art (and higher) performance in both size compression and record extraction.
Year
Venue
Field
2016
KDD
Data mining,Web traffic,External Data Representation,Program synthesis,Web page,Computer science,Artificial intelligence,Data extraction,Comprehensive layout,JSON,Machine learning,Lossless compression
DocType
Citations 
PageRank 
Conference
3
0.40
References 
Authors
30
4
Name
Order
Citations
PageRank
Adi Omari1211.66
Benny Kimelfeld2103471.63
Sharon Shoham334226.67
Eran Yahav4170679.49