Web resources for language modeling in conversational speech recognition

Autor:	Mari Ostendorf, Ivan Bulyko, Özgür Çetin, Andreas Stolcke, Tim Ng, Man-Hung Siu
Rok vydání:	2007
Předmět:	Register (sociolinguistics) Data collection Computer science business.industry Mixture model computer.software_genre Variety (linguistics) Filter (software) Computational Mathematics Computer Science (miscellaneous) Language model Artificial intelligence Web resource business computer Natural language processing Sublanguage
Zdroj:	ACM Transactions on Speech and Language Processing. 5:1-25
ISSN:	1550-4883 1550-4875
DOI:	10.1145/1322391.1322392
Popis:	This article describes a methodology for collecting text from the Web to match a target sublanguage both in style (register) and topic. Unlike other work that estimates n-gram statistics from page counts, the approach here is to select and filter documents, which provides more control over the type of material contributing to the n-gram counts. The data can be used in a variety of ways; here, the different sources are combined in two types of mixture models. Focusing on conversational speech where data collection can be quite costly, experiments demonstrate the positive impact of Web collections on several tasks with varying amounts of data, including Mandarin and English telephone conversations and English meetings and lectures.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::4286f2401b7cd053dcc49e240297c0bf https://doi.org/10.1145/1322391.1322392 Zobrazit plný text záznamu