Title
Statistical digram and trigram analysis of Turkish in terms of coverage and entropy for possible language and speech based applications
Abstract
In this study two frameworks, made up of digrams and trigrams, are built for a complete coverage of the Turkish language. In addition, character, digram and trigram entropy values for Turkish, English and Spanish are compared. Examining meaningful Turkish texts, we have achieved the result that, there are 3 major digram clusters which constitute slightly more than 60% of Turkish texts. Similar to digram distributions, there are 3 major trigram clusters which cover almost 40% of Turkish texts. The statistics show that, for 99% coverage of Turkish, 391 (of 841 theoretical) digrams and 3,396 (of 24,389 theoretical) trigrams are sufficient. The results of this study would constitute a general roadmap for rapid coverage to researchers who would like to work on Turkish language and speech based applications. As an application, the results could lead to a general framework for setting up the rules of prioritization in duration modeling in concatenative text-to-speech synthesis systems.
Year
Venue
Keywords
2010
Aalborg
entropy,natural language processing,speech processing,speech synthesis,text analysis,english,spanish,turkish language based applications,turkish texts,concatenative text-to-speech synthesis systems,coverage,digram distributions,digram entropy values,duration modeling,speech based applications,statistical digram analysis,statistical trigram analysis,trigram entropy values,electronic publishing,signal processing
Field
DocType
ISSN
Turkish,Computer science,Trigram,Prioritization,Natural language processing,Artificial intelligence,Electronic publishing
Conference
2219-5491
Citations 
PageRank 
References 
0
0.34
3
Authors
3
Name
Order
Citations
PageRank
Ibrahim Baran Uslu100.34
Asim Egemen Yilmaz262.86
H. G. Ilk3176.13