Explaining Deep Learning Models for Speech Enhancement

Autor:	Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr
Přispěvatelé:	Microsoft Corporation [Redmond], Microsoft Corporation [Redmond, Wash.], Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), This work was made with the support of the French National Research Agency, in the framework of the project VOCADOM 'Robust voice command adapted to the user and to the context for AAL' (ANR-16-CE33-0006). Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as otherorganizations (see https://www.grid5000.fr)., ANR-16-CE33-0006,VOCADOM,Commande vocale robuste adaptée à la personne et au contexte pour l'autonomie à domicile(2016), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Rok vydání:	2021
Předmět:	Artificial neural network business.industry Computer science Deep learning Speech recognition Word error rate 020206 networking & telecommunications Context (language use) 02 engineering and technology explainable AI Speech enhancement 030507 speech-language pathology & audiology 03 medical and health sciences Noise feature attribution [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] Robustness (computer science) [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD] 0202 electrical engineering electronic engineering information engineering Feature (machine learning) speech enhancement Artificial intelligence 0305 other medical science business
Zdroj:	INTERSPEECH 2021 INTERSPEECH 2021, Aug 2021, Brno, Czech Republic INTERSPEECH 2021, Aug 2021, Brno, Czech Republic. ⟨10.21437/Interspeech.2021-1764⟩
DOI:	10.21437/interspeech.2021-1764
Popis:	International audience; We consider the problem of explaining the robustness of neural networks used to compute time-frequency masks for speech enhancement to mismatched noise conditions. We employ the Deep SHapley Additive exPlanations (DeepSHAP) feature attribution method to quantify the contribution of every timefrequency bin in the input noisy speech signal to every timefrequency bin in the output time-frequency mask. We define an objective metric-referred to as the speech relevance scorethat summarizes the obtained SHAP values and show that it correlates with the enhancement performance, as measured by the word error rate on the CHiME-4 real evaluation dataset. We use the speech relevance score to explain the generalization ability of three speech enhancement models trained using synthetically generated speech-shaped noise, noise from a professional sound effects library, or real CHiME-4 noise. To the best of our knowledge, this is the first study on neural network explainability in the context of speech enhancement.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::9f0e8c6b342f8c39c39ebff8ed0ca17e https://doi.org/10.21437/interspeech.2021-1764 Zobrazit plný text záznamu