Title | ||
---|---|---|
Constructing Specialised Corpora through Analysing Domain Representativeness of Websites |
Abstract | ||
---|---|---|
The role of the Web for text corpus construction is becoming increasingly significant. However, the contribution of the Web
is largely confined to building a general virtual corpus or low quality specialised corpora. In this paper, we introduce a
new technique called SPARTAN for constructing specialised corpora from the Web by systematically analysing website contents.
Our evaluations show that the corpora constructed using our technique are independent of the search engines employed. In particular,
SPARTAN-derived corpora outperform all corpora based on existing techniques for the task of term recognition. |
Year | DOI | Venue |
---|---|---|
2011 | 10.1007/s10579-011-9141-4 | Language Resources and Evaluation |
Keywords | Field | DocType |
Corpus construction,Specialised corpus,Web-derived corpus,Virtual corpus,Website ranking,Boilerplate removal,Term recognition | Spartan,Language for specific purposes,Information retrieval,Computer science,Representativeness heuristic,Word recognition,Text corpus,Speech recognition,Artificial intelligence,Natural language processing,Corpus linguistics,The Internet | Journal |
Volume | Issue | ISSN |
45 | 2 | 1574-020X |
Citations | PageRank | References |
2 | 0.41 | 23 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
W. Wong | 1 | 2 | 0.41 |
W. Liu | 2 | 2 | 0.75 |
M. Bennamoun | 3 | 3197 | 167.23 |