Popis: |
We might be living in a Screen Age, almost everyday a new object with a bright touch screen is invented. A possible antidote to our screen addiction is voice interface. The available voice assistants are activated by keywords such as “hey Siri” or “okay Google” [1]. For initial detection of these keywords, it is impractical to send the audio data over the Web from all devices all the time, as it would increase the privacy risks and would be costly to maintain. So, voice interfaces run a keyword detection module locally on the device. For independent makers and entrepreneurs, it is hard to build a simple speech detector using free, open data, and code. We have published the result as easy to train “Kaggle notebooks” [2]. With considerable improvement, these models can be used as a substitute for our keypads in touch screens. In this work, we have used convolutional neural networks (CNNs) for detection of the keywords, because of their ability to extract important features, while discarding the unimportant ones. This results in smaller number of parameters for the CNNs as compared to the networks with fully connected layers. The network that we have used on this work is derived from the CNNs that gave state-of-the-art results for image classification, e.g., dense convolutional network (DenseNet) [3], residual learning network (ResNet) [4], squeeze-and-excitation network (SeNet) [5], and VGG [6]. We have discussed the performance of these CNN architectures for keyword recognition. The method for reproducing the result had been suggested as well. These models achieve top one error of ~96–97%, with the ensemble of all achieving ~98%, on the voice command dataset [7]. We have concluded by analyzing the performance of all the ten models and their ensemble. Our models recognize some keywords that were not recognized by human. To promote further research (https://github.com/xiaozhouwang/tensorflow_speech_recognition_solution) contains the code. |