Title
Web Page Representations and Data Extraction with BERyL.
Abstract
The web contains a huge amount of data, which can be primarily accessed with the use of web data extraction technology. With increasing complexity of the web development stack and the source code, a web page visual representation rendered by the browser is often the only source reflecting the semantics, functional role, and logical structure of elements. Thus, modern automatic approaches typically target visual cues and structures (e.g., DOM and CSSOM) constructed by the web browser. In this paper, we briefly analyse different representations of web pages, generic approaches, and introduce Open image in new window , a novel framework and language, which can consolidate two “worlds”, two main approaches: the rule-based approach and machine learning. The rule-based approach is used for feature engineering and pattern recognition, whilst machine learning is used for classification based on the inferred features. This is achieved through three stages including (1) feature computation, pattern construction, and application, (2) machine learning, and (3) refinement.
Year
DOI
Venue
2018
10.1007/978-3-030-03056-8_3
ICWE Workshops
Field
DocType
Citations 
Sensory cue,World Wide Web,Web page,Information retrieval,Computer science,Source code,Feature engineering,Structure (mathematical logic),Data extraction,Semantics,Computation
Conference
1
PageRank 
References 
Authors
0.35
15
3
Name
Order
Citations
PageRank
Andrey Kravchenko1332.78
Ruslan R. Fayzrakhmanov211.36
Emanuel Sallinger37120.76