Autor: |
Da-Hee Yang, Joon-Hyuk Chang |
Jazyk: |
angličtina |
Rok vydání: |
2023 |
Předmět: |
|
Zdroj: |
Journal of King Saud University: Computer and Information Sciences, Vol 35, Iss 3, Pp 202-210 (2023) |
Druh dokumentu: |
article |
ISSN: |
1319-1578 |
DOI: |
10.1016/j.jksuci.2023.02.007 |
Popis: |
In this paper, we propose a joint training framework that efficiently combines time-domain speech enhancement (SE) with an end-to-end (E2E) automatic speech recognition (ASR) system utilizing attention-based latent features. Using the latent feature to train E2E ASR implies that various time-domain SE models can be applied for noise-robust ASR and our modified framework is the first approach. We implement a fully E2E scheme pipelined from SE to ASR without domain knowledge and short-time Fourier transform (STFT) consistency constraints by applying a time-domain SE model. Therefore, using the latent feature of time-domain SE as appropriate features for ASR inputs is the main approach in our framework. Furthermore, we apply an attention algorithm to the time-domain SE model to selectively concentrate on certain latent features to achieve the better relevant feature for the task. Detailed experiments are conducted on the hybrid CTC/attention architecture for E2E ASR, and we demonstrate the superiority of our approach compared to baseline ASR systems trained with Mel filter bank coefficients features as input. Compared to the baseline ASR model trained only on clean data, the proposed joint training method achieves 63.6% and 86.8% relative error reductions on the TIMIT and WSJ “matched” test set, respectively. |
Databáze: |
Directory of Open Access Journals |
Externí odkaz: |
|