Exploring the relationship between voice similarity estimates by listeners and by an automatic speaker recognition system incorporating phonetic features

Autor:	Finnian Kelly, Kirsty McDougall, Francis Nolan, Anil Alexander, Linda Gerlach
Přispěvatelé:	Nolan, Francis [0000-0002-8302-5726], Apollo - University of Cambridge Repository
Rok vydání:	2020
Předmět:	Linguistics and Language Computer science First language Speech recognition British English 02 engineering and technology 01 natural sciences Language and Linguistics Likert scale German 0103 physical sciences Similarity (psychology) 0202 electrical engineering electronic engineering information engineering Speaker similarity Multidimensional scaling 010301 acoustics Human voice Communication Perceived voice similarity Earwitness evidence 020206 networking & telecommunications Phonetics Automatic speaker recognition Voice parades language.human_language Computer Science Applications Modeling and Simulation language Computer Vision and Pattern Recognition Software
Zdroj:	Speech Communication. 124:85-95
ISSN:	0167-6393
DOI:	10.1016/j.specom.2020.08.003
Popis:	The present study investigates relationships between voice similarity ratings made by human listeners and comparison scores produced by an automatic speaker recognition system that includes phonetic, perceptually-relevant features in its modelling. The study analyses human voice similarity ratings of pairs of speech samples from unrelated speakers from an accent-controlled database (DyViS, Standard Southern British English) and the comparison scores from an i-vector-based automatic speaker recognition system using ‘auto-phonetic’ (automatically extracted phonetic) features. The voice similarity ratings were obtained from 106 listeners who each rated the voice similarity of pairings of ten speakers on a Likert scale via an online test. Correlation analysis and Multidimensional Scaling showed a positive relationship between listeners’ judgements and the automatic comparison scores. A separate analysis of the subsets of listener responses from English and German native speaker groups showed that a positive relationship was present for both groups, but that the correlation was higher for the English listener group. This work has key implications for forensic phonetics through highlighting the potential to automate part of the process of selecting foil voices in voice parade construction for which the collection and processing of human judgements is currently needed. Further, establishing that it is possible to use automatic voice comparisons using phonetic features to select similar-sounding voices has important applications in ‘voice casting’ (finding voices that are similar to a given voice) and ‘voice banking’ (saving one's voice for future synthesis in case of an operation or degenerative disease).
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::f6a6e30e5aa9ee825044b5483e59e64a https://doi.org/10.1016/j.specom.2020.08.003 Zobrazit plný text záznamu Full Text from ScienceDirect