On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis - Citegraph

Paper Info

Title
On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis

Abstract
Deep Neural Network (DNN), which can model a long-span, intricate transform compactly with a deep-layered structure, has recently been investigated for parametric TTS synthesis with a fairly large corpus (33,000 utterances) [6]. In this paper, we examine DNN TTS synthesis with a moderate size corpus of 5 hours, which is more commonly used for parametric TTS training. DNN is used to map input text features into output acoustic features (LSP, F0 and V/U). Experimental results show that DNN can outperform the conventional HMM, which is trained in ML first and then refined by MGE. Both objective and subjective measures indicate that DNN can synthesize speech better than HMM-based baseline. The improvement is mainly on the prosody, i.e., the RMSE of natural and generated F0 trajectories by DNN is improved by 2 Hz. This benefit is likely from the key characteristics of DNN, which can exploit feature correlations, e.g., between F0 and spectrum, without using a more restricted, e.g. diagonal Gaussian probability family. Our experimental results also show: the layer-wise BP pre-training can drive weights to a better starting point than random initialization and result in a more effective DNN; state boundary info is important for training DNN to yield better synthesized speech; and a hyperbolic tangent activation function in DNN hidden layers yields faster convergence than a sigmoidal one.

Year	DOI	Venue
2014	10.1109/ICASSP.2014.6854318	ICASSP
Keywords	Field	DocType
subjective measure,deep neural network,objective measure,tts,acoustic features,hmm,backpropagation,layer-wise bp pretraining,speech synthesis,diagonal gaussian probability family,feature extraction,dnn training,parametric tts synthesis,text features,dnn,sigmoidal function,text-to-speech synthesis,hyperbolic tangent activation function,neural nets,hidden markov model,speech processing,speech,acoustics,decision support systems	Convergence (routing),Pattern recognition,Computer science,Activation function,Speech recognition,Parametric statistics,Gaussian,Artificial intelligence,Initialization,Artificial neural network,Hidden Markov model,Sigmoid function	Conference
ISSN	Citations	PageRank
1520-6149	48	1.65
References	Authors
14	4

Authors (4 rows)

Cited by (48 rows)

References (14 rows)

Name	Order	Citations	PageRank
Qian Yao	1	527	51.55
Yuchen Fan	2	332	17.14
Wenping Hu	3	82	6.77
Frank K. Soong	4	1395	268.29

1