Improving Trigram Language Modeling with the World Wide Web
Autor: | Roni Rosenfeld, Xiaojin Zhu |
---|---|
Rok vydání: | 2023 |
Předmět: |
FOS: Computer and information sciences
Information retrieval Phrase Computer science business.industry Word error rate computer.software_genre World Wide Web Trigram tagger Test set Web page 89999 Information and Computing Sciences not elsewhere classified Trigram Language model Artificial intelligence business computer 80107 Natural Language Processing Natural language Natural language processing |
Zdroj: | ICASSP Scopus-Elsevier |
DOI: | 10.1184/r1/21710042 |
Popis: | We propose a novel method for using the World Wide Web to acquire trigram estimates for statistical lan- guage modeling. We submit an N-gram as a phrase query to web search engines. The search engines return the number of web pages containing the phrase, from which the N-gram count is estimated. The N-gram counts are then used to form web-based trigram probability estimates. We discuss the properties of such estimates, and methods to interpolate them with traditional corpus based trigram estimates. We show that the interpolated models improve speech recognition word error rate significantly over a small test set. |
Databáze: | OpenAIRE |
Externí odkaz: |