ASR for Under-Resourced Languages From Probabilistic Transcription
Autor: | Tyler Kekona, Adrian K. C. Lee, Rose Sloan, Bradley Ekin, Chunxi Liu, Majid Mirbagheri, Daniel McCloy, Preethi Jyothi, Paul Hager, Amit Das, Edmund C. Lalor, Giovanni M. Di Liberto, Vimal Manohar, Nancy F. Chen, Mark Hasegawa-Johnson, Hao Tang |
---|---|
Rok vydání: | 2017 |
Předmět: |
Speech perception
Acoustics and Ultrasonics Computer science media_common.quotation_subject Speech recognition Nonsense 02 engineering and technology Crowdsourcing computer.software_genre 030507 speech-language pathology & audiology 03 medical and health sciences Phone 0202 electrical engineering electronic engineering information engineering Computer Science (miscellaneous) Probability mass function Electrical and Electronic Engineering media_common business.industry Probabilistic logic Computational Mathematics ComputingMethodologies_PATTERNRECOGNITION ComputingMethodologies_DOCUMENTANDTEXTPROCESSING 020201 artificial intelligence & image processing Language model Artificial intelligence 0305 other medical science business computer Natural language processing Coding (social sciences) |
Zdroj: | IEEE/ACM Transactions on Audio, Speech, and Language Processing. 25:50-63 |
ISSN: | 2329-9304 2329-9290 |
DOI: | 10.1109/taslp.2016.2621659 |
Popis: | In many under-resourced languages it is possible to find text, and it is possible to find speech, but transcribed speech suitable for training automatic speech recognition ASR is unavailable. In the absence of native transcripts, this paper proposes the use of a probabilistic transcript: A probability mass function over possible phonetic transcripts of the waveform. Three sources of probabilistic transcripts are demonstrated. First, self-training is a well-established semisupervised learning technique, in which a cross-lingual ASR first labels unlabeled speech, and is then adapted using the same labels. Second, mismatched crowdsourcing is a recent technique in which nonspeakers of the language are asked to write what they hear, and their nonsense transcripts are decoded using noisy channel models of second-language speech perception. Third, EEG distribution coding is a new technique in which nonspeakers of the language listen to it, and their electrocortical response signals are interpreted to indicate probabilities. ASR was trained in four languages without native transcripts. Adaptation using mismatched crowdsourcing significantly outperformed self-training, and both significantly outperformed a cross-lingual baseline. Both EEG distribution coding and text-derived phone language models were shown to improve the quality of probabilistic transcripts derived from mismatched crowdsourcing. |
Databáze: | OpenAIRE |
Externí odkaz: |