Title
Automatic Wrapper Generation for Multilingual Web Resources
Abstract
We present a wrapper generation system to extract contents of semi-structured documents which contain instances of a record. The generation is done automatically using general assumptions on the structure of instances. It outputs a set of pairs of left and right delimiters surrounding instances of a field. In addition to input documents, our system also receives a set of symbols with which a delimiter must begin or end. Our system treats semi-structured documents just as strings so that it does not depend on markup and natural languages. It does not require any training examples which show where instances are. We show experimental results on both static and dynamic pages which are gathered from 13 Web sites, markuped in HTML or XML, and written in four natural languages. In addition to usual contents, generated wrappers extract useful information hidden in comments or tags which are ignored by other wrapper generation algorithms. Some generated delimiters contain whitespaces or multibyte characters.
Year
DOI
Venue
2002
10.1007/3-540-36182-0_33
Discovery Science
Keywords
DocType
Volume
generic algorithm,natural language
Conference
2534
ISSN
ISBN
Citations 
0302-9743
3-540-00188-3
6
PageRank 
References 
Authors
0.53
8
6
Name
Order
Citations
PageRank
Yasuhiro Yamada15210.97
Daisuke Ikeda2527.95
Sachio Hirokawa321658.68
泰寛 山田4111.40
大輔 池田5111.40
佐千男 廣川6111.40