Designing target cost function based on prosody of speech database

Autor:	Hiroshi Saruwatari, Kazuki Adachi, Hiromichi Kawanami, Tomoki Toda, Kiyohiro Shikano
Jazyk:	angličtina
Rok vydání:	2005
Předmět:	Relation (database) Computer science media_common.quotation_subject Speech recognition computer.software_genre prosody modification Artificial Intelligence Quality (business) Electrical and Electronic Engineering Prosody Set (psychology) speech database Target costing media_common STRAIGHT Voice activity detection Database business.industry unit selection speech synthesis PSQM Speech processing ComputingMethodologies_PATTERNRECOGNITION Hardware and Architecture Computer Vision and Pattern Recognition Artificial intelligence perceptual evaluation business computer Software Natural language processing
Zdroj:	IEICE Transactions on Information and Systems. (3):519-524
ISSN:	0916-8532
Popis:	This research aims to construct a high-quality Japanese TTS (Text-to-Speech) system that has high flexibility in treating prosody. Many TTS systems have implemented a prosody control system but such systems have been fundamentally designed to output speech with a standard pitch and speech rate. In this study, we employ a unit selection-concatenation method and also introduce an analysis-synthesis process to provide precisely controlled prosody in output speech. Speech quality degrades in proportion to the amount of prosody modification, therefore a target cost for prosody is set to evaluate prosodic difference between target prosody and speech candidates in such a unit selection system. However, the conventional cost ignores the original prosody of speech segments, although it is assumed that the quality deterioration tendency varies in relation to the pitch or speech rate of original speech. In this paper, we propose a novel cost function design based on the prosody of speech segments. First, we recorded nine databases of Japanese speech with different prosodic characteristics. Then with respect to the speech databases, we investigated the relationships between the amount of prosody modification and the perceptual degradation. The results indicate that the tendency of perceptual degradation differs according to the prosodic features of the original speech. On the basis of these results, we propose a new cost function design, which changes a cost function according to the prosody of a speech database. Results of preference testing of synthetic speech show that the proposed cost functions generate speech of higher quality than the conventional method.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::569374349bf69d0fc095f85eee018ab7 http://hdl.handle.net/10061/7802 Zobrazit plný text záznamu