Using audio-visual information to understand speaker activity: Tracking active speakers on and off screen

Autor:	Ian Sturdy, Sourish Chaudhuri, Caroline Pantofaru, Malcolm Slaney, Ken Hoover
Rok vydání:	2018
Předmět:	Beamforming Computer science Speech recognition 020206 networking & telecommunications 02 engineering and technology Image segmentation Coherence (statistics) Task (project management) Visualization 030507 speech-language pathology & audiology 03 medical and health sciences 0202 electrical engineering electronic engineering information engineering Task analysis Natural (music) 0305 other medical science
Zdroj:	ICASSP
DOI:	10.1109/icassp.2018.8461891
Popis:	We present a system that associates faces with voices in a video by fusing information from the audio and visual signals. The thesis underlying our work is that an extreme simple approach to generating (weak) speech clusters can be combined with strong visual signals to effectively associate faces and voices by aggregating statistics across a video. This approach does not need any training data specific to this task and leverages the natural coherence of information in the audio and visual streams. It is particularly applicable to tracking speakers in videos on the web where a priori information about the environment (e.g., number of speakers, spatial signals for beamforming) is not available.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::9838d68b201d0a3de7cc8dd4da88fcef https://doi.org/10.1109/icassp.2018.8461891 Zobrazit plný text záznamu