Title | ||
---|---|---|
A model for estimating the occurrence of same-frequency words and the boundary between high- and low-frequency words in texts |
Abstract | ||
---|---|---|
A simpler model is proposed for estimating the frequency of any same-frequency words and identifying the boundary point between high-frequency words and low-frequency words in a text, The model, based on a "maximum ranking method," assigns ranks to the words and estimates word frequency by the formula: Int[(-1 + (1 + 4D/In+1)(1/2))/2] > n* Int[(-1 + (1 + 4D/I-n)(1/2))/2]. The boundary value between high-frequency and low-frequency words is obtained by taking the square root of the number of different words in the text: n* = (D)(1/2). This straightforward model was used successfully with both English and Chinese texts, demonstrating that the frequency of words and the number of same-frequency words are dependent only on the vocabulary of a text (the number of different words) but not on its length. Like Zipf's Law, the model may be universally applicable. |
Year | DOI | Venue |
---|---|---|
1999 | 3.3.CO;2-8" target="_self" class="small-link-text"10.1002/(SICI)1097-4571(1999)50:33.3.CO;2-8 | JASIS |
Keywords | DocType | Volume |
low frequency,chinese,word frequency,english,research methodology,information science | Journal | 50 |
Issue | ISSN | Citations |
3 | 0002-8231 | 2 |
PageRank | References | Authors |
0.57 | 4 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Qinglan Sun | 1 | 4 | 0.96 |
Debora Shaw | 2 | 57 | 7.37 |
Charles H. Davis | 3 | 5 | 1.49 |