Web Text Data Mining for Building Large Scale Language Modelling Corpus
Autor: | Daniel Soutner, Jan Švec, Jan Hoidekr, Jan Vavruška |
---|---|
Rok vydání: | 2011 |
Předmět: |
Text corpus
Perplexity business.industry Computer science computer.software_genre Consistency (database systems) ComputingMethodologies_PATTERNRECOGNITION Server ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Preprocessor The Internet Artificial intelligence Data mining Architecture business Scale (map) computer Natural language processing |
Zdroj: | Text, Speech and Dialogue ISBN: 9783642235375 TSD |
Popis: | The paper describes a system for collecting a large text corpus from Internet news servers. The architecture and text preprocessing algorithms are described. We also describe the used duplicity detection algorithm. The resulting corpus contains more than 1 billion tokens in more than 3 millions articles with assigned topics and duplicates identified. Corpus statistics like consistency and perplexity are presented. |
Databáze: | OpenAIRE |
Externí odkaz: |