Abstrakt: |
The quality of Speech Recognition systems has improved, with a shift focus from short utterance scenarios like Voice Assistants and Voice Search to extended utterance circumstances such as voice inputting and meeting transcriptions. In short utterance set-ups, speech end-points plays a crucial role in perceiving latency and user experience. For long-tailed circumstances, the prime intention is to generate a well formatted and highly readable transcriptions that aided to substitute typing with keyboard for vital permanent tasks such as writing e-mails or text documents. The significance of punctuation and capitalization becomes equally crucial as recognition errors. In the case of long utterances, valuable processing time, bandwidth, and other resources can be conserved by disregarding unnecessary portion of the audio signal. This optimization ultimately leads to enhance throughput of the system. In this study, we develop a framework called Speech Segments Endpoint Detection which utilizes short-time energy signal features, a simple Mel-spectrogram, and a hybrid Convolution Neural Network-Bidirectional Long short-term Memory (CNN-BiLSTM) model for classification. We conducted experiment using our CNN-BiLSTM classification model on a 35-h audio dataset. This dataset comprised of 16 h of speech data and 19 h of audio containing music and noise. The dataset was split into training and validation sets in an 80:20 ratio. Our model attained an accuracy of 98.67% on the training set and 93.62% on the validation set. |