Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identification in TV broadcast

Autor: Viet Bac Le, Anindya Roy, Hervé Bredin, Claude Barras
Přispěvatelé: Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11), Vocapia Research [Orsay], Vocapia
Jazyk: angličtina
Rok vydání: 2014
Předmět:
Zdroj: International Journal of Multimedia Information Retrieval
International Journal of Multimedia Information Retrieval, Springer, 2014, 3 (3), pp.161-175. ⟨10.1007/s13735-014-0055-y⟩
ISSN: 2192-6611
2192-662X
DOI: 10.1007/s13735-014-0055-y⟩
Popis: The final publication is available at https://link.springer.com/article/10.1007/s13735-014-0055-y; International audience; This work introduces a unified framework for mono-, cross-and multi-modal person recognition in multimedia data. Dubbed Person Instance Graph, it models the person recognition task as a graph mining problem: i.e. finding the best mapping between person instance vertices and identity vertices. Practically, we describe how the approach can be applied to speaker identification in TV broadcast. Then, a solution to the above-mentioned mapping problem is proposed. It relies on Integer Linear Programming to model the problem of clustering person instances based on their identity. We provide an in-depth theoretical definition of the optimization problem. Moreover, we improve two fundamental aspects of our previous related work: the problem constraints and the optimized objective function. Finally, a thorough experimental evaluation of the proposed framework is performed on a publicly available benchmark database. Depending on the graph configuration (i.e. the choice of its vertices and edges), we show that multiple tasks can be addressed interchangeably (e.g. speaker diarization, supervised or unsuper-vised speaker identification), significantly outperform-ing state-of-the-art mono-modal approaches.
Databáze: OpenAIRE