Self-Supervised Learning from Automatically Separated Sound Scenes

Autor:	Eduardo Fonseca, Aren Jansen, Daniel P. W. Ellis, Scott Wisdom, Marco Tagliasacchi, John R. Hershey, Manoj Plakal, Shawn Hershey, R. Channing Moore, Xavier Serra
Rok vydání:	2021
Předmět:	FOS: Computer and information sciences Computer Science - Machine Learning Sound (cs.SD) contrastive learning Audio and Speech Processing (eess.AS) source separation FOS: Electrical engineering electronic engineering information engineering audio representation learning self-supervision Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing Machine Learning (cs.LG)
DOI:	10.48550/arxiv.2105.02132
Popis:	Comunicació presentada a 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), celebrat del 17 al 20 d'octubre de 2021 a New Paltz, Estats Units. Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::b61bcf54e52a20604b992da9fb020b48 Zobrazit plný text záznamu