Abstract | ||
---|---|---|
In order to exploit the huge volume of information being published in the blogosphere, it is essential to provide techniques
such as clustering, which can automatically analyze and classify their contents. However these typically can produce better
results when dealing with wide domain full-text documents. In most cases however, blogs can be considered to be “short texts”,
i.e., they are not extensive documents and exhibit undesirable characteristics from a clustering perspective such as low frequency
terms, short vocabulary size and vocabulary overlapping of some domains. Furthermore, their characteristics vary widely depending
on the specific interests of the writer, their linguistic style, and the volume of texts that they produce.
|
Year | DOI | Venue |
---|---|---|
2009 | 10.1007/978-3-642-12550-8_28 | Applications of Natural Language to Data Bases |
Keywords | Field | DocType |
low frequency term,short text,short vocabulary size,clustering perspective,better result,undesirable characteristic,linguistic style,specific interest,weblog corpus,extensive document,huge volume,low frequency | Computer science,Exploit,Artificial intelligence,Natural language processing,Blogosphere,Cluster analysis,Vocabulary | Conference |
Volume | ISSN | ISBN |
5723 | 0302-9743 | 3-642-12549-2 |
Citations | PageRank | References |
3 | 0.48 | 2 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
fernando pereztellez | 1 | 29 | 5.00 |
david pinto | 2 | 26 | 7.99 |
john cardiff | 3 | 3 | 0.48 |
paolo rosso | 4 | 1831 | 188.74 |