Web Text Data Mining for Building Large Scale Language Modelling Corpus

Autor: Daniel Soutner, Jan Švec, Jan Hoidekr, Jan Vavruška
Rok vydání: 2011
Předmět:
Zdroj: Text, Speech and Dialogue ISBN: 9783642235375
TSD
Popis: The paper describes a system for collecting a large text corpus from Internet news servers. The architecture and text preprocessing algorithms are described. We also describe the used duplicity detection algorithm. The resulting corpus contains more than 1 billion tokens in more than 3 millions articles with assigned topics and duplicates identified. Corpus statistics like consistency and perplexity are presented.
Databáze: OpenAIRE