Popis: |
In this work, we implement a deep neural network for the text-to-speech system. We have tried different parameter settings for the DNN layers and units, and find that the three-layer DNN works better than the four-layer ones. We also pre-trained the best three-layer system (1000-1000-1000), and both objective and subjective test results show significant improvement in the synthesizing quality after pre-training. The final pre-trained system obtains an average linear spectral pair (LSP) root mean square error (RMSE) of 0.179096, beating the DNN-TTS benchmark of 0.187225. |