Dimensions of quality for state of the art synthetic speech

Autor: Seebauer, Fritz Michael, Wagner, Petra, Bruggeman, Anna, Ludusan, Bogdan, ARRAY(0xa805788)
Jazyk: angličtina
Rok vydání: 2022
Předmět:
Zdroj: Phonetik und Phonologie im deutschsprachigen Raum
Popis: Synthetic speech has a long standing tradition of being employed for experiments in phonetics and laboratory phonology. The choice of synthesis method and system is commonly made by the researcher(s) to fit the specific quality criteria and study design. The overall quality of a given system, however, remains as a confound that is difficult to control for [1]. In speech technology newly proposed systems are usually compared across specific dimensions e.g., ‘Intelligibility’ and ‘Naturalness’. These dimensions have already been extensively studied and evaluated within the context of old diphone and formant synthesis networks [2]. We contend, however, that these tradi- tional dimensions need to be re-examined in the context of state of the Art Text-to-Speech (TTS) systems, as those newer models exhibit different quality deteriorations. Our work aims to bridge the conflicting demands for quality criteria that are easily computed and applied during TTS de- velopment, while at the same time remaining descriptive and meaningful for phonetic research. As a first step in this endeavor, we carried out an experiment to find suitable dimensions of TTS quality with a bottom-up approach based on descriptions provided by 11 participants (phonetic experts). The participants were instructed to label speech samples generated by 8 different state of the art Text-to-speech systems (varieties of English). Each system produced a stimulus consisting of two sentences of the phonetically balanced ‘caterpillar story’ [3]. In order to ensure that all systems were evaluated across different phonetic contexts in a balanced way, the sentences were rotated between participants so that each participant heard the complete story but with different parts read by different systems. The experimental setup is loosely based on the work in [4]. The participants were instructed to write down nouns, adjectives or sentences describing the quality of a given stimulus. Using embeddings generated by a pretrained BERT model [5] for semantic distances, we determined which of the participants terms were semantically similar. A subsequent affinity propagation clustering revealed there to be 39 meaningfully different clusters, each rep- resenting a dimension of quality for synthetic voices. Keeping in mind that these dimensions are later to be used for ratings in actual evaluation experiments, it was decided to reduce the num- ber of clusters to a more practical number of 10 and re-calculate the spectral clustering with a precomputed cosine affinity matrix. The resulting clusters and their respective quality descrip- tions are depicted in fig. 1. A manual analysis of the resulting dimensions led to the following descriptive labels: ‘artificiality/voice quality’, ‘intonation/noise/prosody’, ‘voice/audio quality’, ‘audio cuts’, ‘style/recording quality’, ‘emotion/voice quality/attitude’, ‘engagedness’, ‘human likeness’, ‘hyperarticulation’. From the assigned cluster descriptions it is evident, that the se- mantic embeddings sometimes conflated several seemingly unrelated quality features into single dimensions (e.g. prosody and background noise), while occasionally splitting almost synonymous terms into multiple clusters (e.g. ‘artificiality’, ‘roboticness’ and ‘metallicness’). To evaluate these shortcomings of the semantic model, two independent manual clusterings were carried out. They were both limited to 10 clusters and reported a modified jaccard agreement index of 63,44, while agreeing with the automatic computed clusters with 54.48 and 57.93, respectively. The low in- terrater agreement between the manual clusters suggests that a panel decision process might be needed to determine the final quality dimensions. Subsequent research will evaluate clusters cre- ated by na ̈ıve listeners and quality dimensions of different sub-tasks in synthetic speech, such as voice conversion.
Databáze: OpenAIRE