Lexical speaker identification in TV shows

Autor:	Anindya Roy, Claude Barras, Viet Bac Le, Jean-Luc Gauvain, William Hartmann, Hervé Bredin
Přispěvatelé:	Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris-Sud - Paris 11 (UP11)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université Paris Saclay (COmUE), Vocapia Research [Orsay], Vocapia
Rok vydání:	2014
Předmět:	Topic model Computer Networks and Communications Computer science Speech recognition 02 engineering and technology computer.software_genre Cepstrum 0202 electrical engineering electronic engineering information engineering Media Technology [INFO]Computer Science [cs] tf–idf business.industry [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM] 020206 networking & telecommunications Speaker diarisation Support vector machine Hardware and Architecture Identity (object-oriented programming) Speaker identification NIST 020201 artificial intelligence & image processing Artificial intelligence business computer Software Natural language processing
Zdroj:	Multimedia Tools and Applications Multimedia Tools and Applications, Springer Verlag, 2015, 74 (4), pp.1377-1396. ⟨10.1007/s11042-014-1940-3⟩
ISSN:	1573-7721 1380-7501
DOI:	10.1007/s11042-014-1940-3
Popis:	The final publication is available at https://link.springer.com/article/10.1007/s11042-014-1940-3; International audience; It is possible to use lexical information extracted from speech transcripts for speaker identification (SID), either on its own or to improve the performance of standard cepstral-based SID systems upon fusion. This was established before typically using isolated speech from single speakers (NIST SRE corpora, parliamentary speeches). On the contrary, this work applies lexical approaches for SID on a different type of data. It uses the REPERE corpus consisting of unsegmented multiparty conversations, mostly debates, discussions and Q&A sessions from TV shows. It is hypothesized that people give out clues to their identity when speaking in such settings which this work aims to exploit. The impact on SID performance of the diarization front-end required to pre-process the unsegmented data is also measured. Four lexical SID approaches are studied in this work, including TFIDF, BM25 and LDA-based topic modeling. Results are analysed in terms of TV shows and speaker roles. Lexical approaches achieve low error rates for certain speaker roles such as anchors and journalists, sometimes lower than a standard cepstral-based Gaussian Supervector-Support Vector Machine (GSV-SVM) system. Also, in certain cases, the lexical system shows modest improvement over the cepstral-based system performance using score-level sum fusion. To highlight the potential of using lexical information not just to improve upon cepstral-based SID systems but as an independent approach in its own right, initial studies on crossmedia SID is briefly reported. Instead of using 2 Anindya Roy et al. speech data as all cepstral systems require, this approach uses Wikipedia texts to train lexical speaker models which are then tested on speech transcripts to identify speakers.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::c04531b35677c95eb43184f93862fb0f https://doi.org/10.1007/s11042-014-1940-3 Zobrazit plný text záznamu Full text from SpringerLink