Popis: |
Model-based spoken term detection usually requires huge number of training data with annotation. When lacking enough training data, DTW-based method is a better choice. However, both the model-based and classical DTW-based methods are based on frame by frame template matching. The computation load is heavy and the search efficiency is poor. We propose a fast two-stage-frameworked approach to spoken term detection. Prosodic dynamic features are exploited to rapidly locate hypothesized spoken term regions in the first stage and Gaussian posteriorgrams are exploited to more precisely verify the local hypothesized regions in the second stage. Since each prosodic feature vector only contains three dimensions and represent several continuous frames speech at one time, we can realize segment-based instead of frame-based template matching to accelerate the whole keywords detection process. The two-stage method has fully exploited the long and short time characteristics of speeches. An experiment is conduced to demonstrate our method improves the speed and obtain similar detection performance under the same condition. |