TimeScaleNet: A Multiresolution Approach for Raw Audio Recognition Using Learnable Biquadratic IIR Filters and Residual Networks of Depthwise-Separable One-Dimensional Atrous Convolutions

Autor:	Eric Bavu, Alexandre Garcia, Aro Ramamonjy, Hadrien Pujol
Přispěvatelé:	Laboratoire de Mécanique des Structures et des Systèmes Couplés (LMSSC), Conservatoire National des Arts et Métiers [CNAM] (CNAM)
Rok vydání:	2019
Předmět:	Computer science Audio recognition Learnable Biquadratic filters 01 natural sciences Convolution Separable space 03 medical and health sciences Deep Learning 0302 clinical medicine [STAT.ML]Statistics [stat]/Machine Learning [stat.ML] [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing 0103 physical sciences Feature (machine learning) Electrical and Electronic Engineering Multiresolution 030223 otorhinolaryngology 010301 acoustics Infinite impulse response [SPI.ACOU]Engineering Sciences [physics]/Acoustics [physics.class-ph] Signal processing Artificial neural network business.industry Deep learning Machine hearing Time domain modelling Filter bank Signal Processing Artificial intelligence business Algorithm
Zdroj:	IEEE Journal of Selected Topics in Signal Processing IEEE Journal of Selected Topics in Signal Processing, IEEE, 2019, 13 (2), pp.220-235. ⟨10.1109/JSTSP.2019.2908696⟩
ISSN:	1941-0484 1932-4553
DOI:	10.1109/jstsp.2019.2908696
Popis:	In this paper, we show the benefit of a multi-resolution approach that allows us to encode the relevant information contained in unprocessed time-domain acoustic signals. TimeScaleNet aims at learning an efficient representation of a sound, by learning time dependencies both at the sample level and at the frame level. The proposed approach allows us to improve the interpretability of the learning scheme, by unifying advanced deep learning and signal processing techniques. In particular, TimeScaleNet's architecture introduces a new form of recurrent neural layer, which is directly inspired from digital infinite impulse-response (IIR) signal processing. This layer acts as a learnable passband biquadratic digital IIR filterbank. The learnable filterbank allows us to build a time-frequency-like feature map that self-adapts to the specific recognition task and dataset, with a large receptive field and very few learnable parameters. The obtained frame-level feature map is then processed using a residual network of depthwise separable atrous convolutions. This second scale of analysis aims at efficiently encoding relationships between the time fluctuations at the frame timescale, in different learnt pooled frequency bands, in the range of [20 ms ; 200 ms]. TimeScaleNet is tested both using the Speech Commands Dataset and the ESC-10 Dataset. We report a high mean accuracy of $94.87 \pm 0.24 \%$ (macro averaged F1-score : $94.9 \pm 0.24 \%$ ) for speech recognition, and a rather moderate accuracy of $69.71 \pm 1.91 \%$ (macro averaged F1-score : $70.14 \pm 1.57 \%$ ) for the environmental sound classification task.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::fd0e2b22c103ce96a74983ecd659187b https://doi.org/10.1109/jstsp.2019.2908696 Zobrazit plný text záznamu