Enhancing spoken term detection with deep acoustic word embeddings and cross-modal matching techniques.

Autor: Chantangphol, Pantid, Sakdejayont, Theerat, Chalothorn, Tawunrat
Předmět:
Zdroj: International Journal of Speech Technology; Dec2024, Vol. 27 Issue 4, p875-886, 12p
Abstrakt: This study proposes a new approach to improving spoken term detection by employing Acoustic Word Embeddings. Our model combines CNNs and LSTM networks to capture sequential information and generate fixed-dimensional word-level embeddings. We have introduced a novel deep word discrimination loss to increase the distinctiveness of these embeddings, thereby improving word differentiation. Additionally, we have developed a matching scheme that utilizes a neural network framework alongside a text-to-speech technique to generate acoustic embeddings from text. These embeddings are crucial for effective cross-modal retrieval and audio indexing, especially in detecting unseen words. Our experimental results demonstrate that our method outperforms traditional baselines in word discrimination tasks, achieving higher mean Average Precision scores. Furthermore, our matching scheme significantly enhances spoken term detection for both regular and unseen words, which could pave the way for future advances in audio indexing, cross-modal retrieval, and search functionalities. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index