Title
Constructing Specialised Corpora through Analysing Domain Representativeness of Websites
Abstract
The role of the Web for text corpus construction is becoming increasingly significant. However, the contribution of the Web is largely confined to building a general virtual corpus or low quality specialised corpora. In this paper, we introduce a new technique called SPARTAN for constructing specialised corpora from the Web by systematically analysing website contents. Our evaluations show that the corpora constructed using our technique are independent of the search engines employed. In particular, SPARTAN-derived corpora outperform all corpora based on existing techniques for the task of term recognition.
Year
DOI
Venue
2011
10.1007/s10579-011-9141-4
Language Resources and Evaluation
Keywords
Field
DocType
Corpus construction,Specialised corpus,Web-derived corpus,Virtual corpus,Website ranking,Boilerplate removal,Term recognition
Spartan,Language for specific purposes,Information retrieval,Computer science,Representativeness heuristic,Word recognition,Text corpus,Speech recognition,Artificial intelligence,Natural language processing,Corpus linguistics,The Internet
Journal
Volume
Issue
ISSN
45
2
1574-020X
Citations 
PageRank 
References 
2
0.41
23
Authors
3
Name
Order
Citations
PageRank
W. Wong120.41
W. Liu220.75
M. Bennamoun33197167.23