General Sequence Teacher–Student Learning
Autor: | Jeremy H. M. Wong, Yu Wang, Mark J. F. Gales |
---|---|
Rok vydání: | 2019 |
Předmět: |
Sequence
Acoustics and Ultrasonics Computer science business.industry Frame (networking) Flexibility (personality) Machine learning computer.software_genre Upper and lower bounds 030507 speech-language pathology & audiology 03 medical and health sciences Computational Mathematics ComputingMethodologies_PATTERNRECOGNITION Computer Science (miscellaneous) State (computer science) Limit (mathematics) Artificial intelligence Electrical and Electronic Engineering Transcription (software) 0305 other medical science business computer Word (computer architecture) |
Zdroj: | IEEE/ACM Transactions on Audio, Speech, and Language Processing. 27:1725-1736 |
ISSN: | 2329-9304 2329-9290 |
DOI: | 10.1109/taslp.2019.2929859 |
Popis: | In automatic speech recognition, performance gains can often be obtained by combining an ensemble of multiple models. However, this can be computationally expensive when performing recognition. Teacher–student learning alleviates this cost by training a single student model to emulate the combined ensemble behaviour. Only this student needs to be used for recognition. Previously investigated teacher–student criteria often limit the forms of diversity allowed in the ensemble, and only propagate information from the teachers to the student at the frame level. This paper addresses both of these issues by examining teacher–student learning within a sequence-level framework, and assessing the flexibility that these approaches offer. Various sequence-level teacher–student criteria are examined in this work, to propagate sequence posterior information. A training criterion based on the Kullback–Leibler KL-divergence between context-dependent state sequence posteriors is proposed that allows for a diversity of state cluster sets to be present in the ensemble. This criterion is shown to be an upper bound to a more general KL-divergence between word sequence posteriors, which places even fewer restrictions on the ensemble diversity, but whose gradient can be expensive to compute. These methods are evaluated on the augmented multi-party interaction AMI meeting transcription and MGB-3 television broadcast audio tasks. |
Databáze: | OpenAIRE |
Externí odkaz: |