Constructing Specialised Corpora through Analysing Domain Representativeness of Websites - Citegraph

Paper Info

Title
Constructing Specialised Corpora through Analysing Domain Representativeness of Websites

Abstract
The role of the Web for text corpus construction is becoming increasingly significant. However, the contribution of the Web is largely confined to building a general virtual corpus or low quality specialised corpora. In this paper, we introduce a new technique called SPARTAN for constructing specialised corpora from the Web by systematically analysing website contents. Our evaluations show that the corpora constructed using our technique are independent of the search engines employed. In particular, SPARTAN-derived corpora outperform all corpora based on existing techniques for the task of term recognition.

Year	DOI	Venue
2011	10.1007/s10579-011-9141-4	Language Resources and Evaluation
Keywords	Field	DocType
Corpus construction,Specialised corpus,Web-derived corpus,Virtual corpus,Website ranking,Boilerplate removal,Term recognition	Spartan,Language for specific purposes,Information retrieval,Computer science,Representativeness heuristic,Word recognition,Text corpus,Speech recognition,Artificial intelligence,Natural language processing,Corpus linguistics,The Internet	Journal
Volume	Issue	ISSN
45	2	1574-020X
Citations	PageRank	References
2	0.41	23
Authors
3

Authors (3 rows)

Cited by (2 rows)

References (23 rows)

Name	Order	Citations	PageRank
W. Wong	1	2	0.41
W. Liu	2	2	0.75
M. Bennamoun	3	3197	167.23

1