Unstructured data extraction of Chinese expert web page - Citegraph

Paper Info

Title
Unstructured data extraction of Chinese expert web page

Abstract
Aiming at the problem of requiring a lot of human intervention in the process of unstructured data extraction from expert page based on traditional extraction methods, this paper proposes a method which detects data template automatically based on similarities and differences between HTML tags and strings, uses the lattice theory to find the location of the data grid region storing unstructured expert data, thus accesses to unstructured expert data. Firstly, with the help of the classifier on Chinese Expert Entity Homepages, a lot of expert pages are acquired by expert web crawler. Secondly, divide the expert pages into two types, list type and document type, then extract respectively the unstructured data from the two different types. Lastly, the extraction experiments are conducted on different types of web pages by improving open source code of Roadrunner. Experimental results show that, in the case of unsupervised, this method performs effectively on extraction of unstructured web data from Chinese expert pages.

Year	DOI	Venue
2014	10.1504/IJWMC.2014.059709	IJWMC
Keywords	Field	DocType
chinese expert web page,extraction experiment,chinese expert page,expert page,unstructured data,different type,unstructured web data,expert web crawler,data grid region,unstructured expert data,unstructured data extraction,lattice theory	HTML element,Data mining,Information retrieval,Web page,Computer science,Data grid,Unstructured data,Roadrunner,Classifier (linguistics),Web crawler,Document type definition,Distributed computing	Journal
Volume	Issue	Citations
7	2	0
PageRank	References	Authors
0.34	5	5

Authors (5 rows)

Cited by (0 rows)

References (5 rows)

Name	Order	Citations	PageRank
Xudong Hong	1	6	3.53
Tao Shen	2	0	0.34
Longhua Shen	3	0	0.68
Zhengtao Yu	4	460	69.08
Jianyi Guo	5	20	10.99

1