TEX: An efficient and effective unsupervised Web information extractor - Citegraph

Paper Info

Title
TEX: An efficient and effective unsupervised Web information extractor

Abstract
The World Wide Web is an immense information resource. Web information extraction is the task that transforms human friendly Web information into structured information that can be consumed by automated business processes. In this article, we propose an unsupervised information extractor that works on two or more web documents generated by the same server side template. It finds and removes shared token sequences amongst these web documents until finding the relevant information that should be extracted from them. The technique is completely unsupervised and does not require maintenance, it allows working on malformed web documents, and does not require the relevant information to be formatted using repetitive patterns. Our complexity analysis reveals that our proposal is computationally tractable and our empirical study on real-world web documents demonstrates that it performs very fast and has a very high precision and recall.

Year	DOI	Venue
2013	10.1016/j.knosys.2012.10.009	Knowl.-Based Syst.
Keywords	Field	DocType
malformed web document,immense information resource,human friendly web information,real-world web document,effective unsupervised web information,relevant information,web information extraction,unsupervised information extractor,structured information,web document,world wide web,information extraction	Data mining,Computer science,Artificial intelligence,Social Semantic Web,Server-side,Business process,Information retrieval,Web mapping,Precision and recall,Data Web,Information extraction,Security token,Machine learning	Journal
Volume	ISSN	Citations
39,	0950-7051	15
PageRank	References	Authors
0.53	60	2

Authors (2 rows)

Cited by (15 rows)

References (60 rows)

Name	Order	Citations	PageRank
Hassan A. Sleiman	1	103	8.33
Rafael Corchuelo	2	389	49.87

1