Popis: |
In recent years, there has been a growing interest in applying knowledge distillation (KD) techniques to the connectionist temporal classification (CTC) framework for training more efficient speech recognition models. Although conventional KD approaches have successfully reduced computational burden, they have limitations in dealing with the inconsistency problem caused by dropout regularization, particularly the gap between the training and inference stages. In the context of KD, this inconsistency may hinder the performance improvement of the student model. To overcome this issue, we propose a novel approach, namely Cons-KD, that combines KD and consistency regularization, where the former trains the student model to benefit from the knowledge of the teacher model, and the latter trains the student model to be more robust to the dropout-induced inconsistency. By directly mitigating the inconsistency problem, our KD framework can further improve the student’s performance compared to the vanilla KD. Experimental results on the LibriSpeech dataset demonstrate that Cons-KD significantly outperforms previous KD methods, improving the word error rate (WER) from 5.10 % to 4.13 % on the test-clean subset and from 12.87 % to 10.32 % on the test-other subset, respectively. These improvements correspond to relative error rate reduction (RERR) of 19.02 % and 19.81 %, respectively, implying notable advancements beyond conventional KD methods. Additionally, we conduct an in-depth analysis to verify the effect of each proposed objective. |