A Template Independent Method for Large Online News Content Extraction - Citegraph

Paper Info

Title
A Template Independent Method for Large Online News Content Extraction

Abstract
Online news provides a convenient way for users to read novel news. Building online news corpus is important to many text mining and data mining issues. The creation of web news data required to construct a set of HTML parsing rules to identify content text. When a website rapidly change the layout style, the parsing rules (wrapper) should be reconstructed. In this paper, we address this issue and propose a news content recognition algorithm that is portable to different language and various domains. Our method first scans the entire HTML document and detects a set of candidate blocks. Second, the proposed weighted scoring function that combines stop word language models and HTML penalty functions is used to rank the importance of each candidate. We then check the block which obtains the highest score and a predefined threshold value. To validate the approach, we conduct experiments by using 533 online news HTML files from 24 web sites. The empirical study shows that our method achieves ~95% macro F-measure rate in recognizing news content.

Year	DOI	Venue
2012	10.1109/IIAI-AAI.2012.58	IIAI-AAI
Keywords	Field	DocType
online news html file,html penalty function,news content,entire html document,html parsing rule,web news data,news content recognition algorithm,online news,online news corpus,template independent method,large online news content,novel news,grammars,mathematical model,language model,html,data mining,testing,text analysis,text mining,information extraction	Rule-based machine translation,Text mining,Information retrieval,Computer science,Information extraction,Parsing,Macro,Language model,Stop words,Empirical research	Conference
Citations	PageRank	References
0	0.34	13
Authors
2

Authors (2 rows)

Cited by (0 rows)

References (13 rows)

Name	Order	Citations	PageRank
Yu-Chieh Wu	1	247	23.16
Jie-Chi Yang	2	350	43.91

1