Popis: |
Spoken term detection (STD) is effectively implemented using fundamental techniques such as automatic speech recognition (ASR) and information retrieval. Through these methods, queried keywords can be identified in the decoded texts and indexed lattices produced by the ASR system. However, this approach relies heavily on the performance of the ASR; it may not produce the desired results when dealing with out-of-vocabulary (OOV) words that are not included in the ASR’s lexicon. To address this limitation, we analyzed the semantic query expansion technique through extensive and reproducible experiments to assess its impact on the search quality for OOV words. We propose an approach to enhance existing spoken content retrieval methods by searching semantically expanded query sets and leveraging the advanced features of search engines. Our experiments, conducted on the Wall Street Journal (WSJ) datasets and top Google frequent queries, demonstrate that the proposed approach significantly improves retrieval accuracy over the traditional word-based STD method for in-vocabulary (IV) terms. Specifically, the Actual Term Weighted Value (ATWV) score improved from 0 to 0.5776 for the trigram query category. Additionally, our approach outperforms the proxy-based method for OOV words. While the proxy-based technique fails to retrieve results for both bigrams and trigrams, the semantic-based approach achieves ATWV scores of 0.7143 and 0.8846 for bigrams and trigrams, respectively. Furthermore, substantial gains are observed when combining semantic-based query expansion with a full-text search engine, improving the performance of the word-based STD system by approximately 3 to 4 times on the bigram and trigram query categories. |