Filterbank design for end-to-end speech separation

Autor: Emmanuel Vincent, Manuel Pariente, Antoine Deleforge, Samuele Cornell
Přispěvatelé: Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Università Politecnica delle Marche [Ancona] (UNIVPM), Grid5000, Pariente, Manuel
Jazyk: angličtina
Rok vydání: 2020
Předmět:
Signal Processing (eess.SP)
FOS: Computer and information sciences
Masking (art)
Sound (cs.SD)
Computer Science - Machine Learning
Computer science
Speech recognition
02 engineering and technology
Computer Science - Sound
Machine Learning (cs.LG)
Set (abstract data type)
030507 speech-language pathology & audiology
03 medical and health sciences
End-to-end principle
[STAT.ML]Statistics [stat]/Machine Learning [stat.ML]
Audio and Speech Processing (eess.AS)
FOS: Electrical engineering
electronic engineering
information engineering

0202 electrical engineering
electronic engineering
information engineering

Electrical Engineering and Systems Science - Signal Processing
Short-time Fourier transform
020206 networking & telecommunications
Filter bank
Speaker recognition
[STAT.ML] Statistics [stat]/Machine Learning [stat.ML]
[INFO.INFO-SD] Computer Science [cs]/Sound [cs.SD]
[INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD]
0305 other medical science
Electrical Engineering and Systems Science - Audio and Speech Processing
Zdroj: ICASSP 2020-45th International Conference on Acoustics, Speech, and Signal Processing
ICASSP 2020-45th International Conference on Acoustics, Speech, and Signal Processing, May 2020, Barcelona, Spain
ICASSP
Popis: Single-channel speech separation has recently made great progress thanks to learned filterbanks as used in ConvTasNet. In parallel, parameterized filterbanks have been proposed for speaker recognition where only center frequencies and bandwidths are learned. In this work, we extend real-valued learned and parameterized filterbanks into complex-valued analytic filterbanks and define a set of corresponding representations and masking strategies. We evaluate these filterbanks on a newly released noisy speech separation dataset (WHAM). The results show that the proposed analytic learned filterbank consistently outperforms the real-valued filterbank of ConvTasNet. Also, we validate the use of parameterized filterbanks and show that complex-valued representations and masks are beneficial in all conditions. Finally, we show that the STFT achieves its best performance for 2ms windows.
ICASSP 2020
Databáze: OpenAIRE