Title
Investigating Efficient Feature Representation Methods And Training Objective For Blstm-Based Phone Duration Prediction
Abstract
Accurate modeling and prediction of speech-sound durations are important in generating natural synthetic speech. This paper focuses on both feature and training objective aspects to improve the performance of the phone duration model for speech synthesis system. In feature aspect, we combine the feature representation from gradient boosting decision tree (GBDT) and phoneme identity embedding model (which is realized by the jointly training of phoneme embedded vector (PEV) and word embedded vector (WEV)) for BLSTM to predict the phone duration. The PEV is used to replace the one-hot phoneme identity, and GBDT is utilized to transform the traditional contextual features. In the training objective aspect, a new training objective function which taking into account of the correlation and consistency between the predicted utterance and the natural utterance is proposed. Perceptual tests indicate the proposed methods could improve the naturalness of the synthetic speech, which benefits from the proposed feature representation methods could capture more precise contextual features, and the proposed training objective function could tackle the over-averaged problem for the generated phone durations.
Year
DOI
Venue
2017
10.21437/Interspeech.2017-1086
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION
Keywords
Field
DocType
Phone duration modeling, BLSTM, feature representation methods, training objective, speech synthesis
Pattern recognition,Computer science,Speech recognition,Phone,Artificial intelligence,Machine learning
Conference
ISSN
Citations 
PageRank 
2308-457X
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Yibin Zheng13815.13
Jianhua Tao2848138.00
Zhengqi Wen38624.41
Ya Li43611.21
Bin Liu519135.02