Web Text Data Mining for Building Large Scale Language Modelling Corpus

Autor:	Daniel Soutner, Jan Švec, Jan Hoidekr, Jan Vavruška
Rok vydání:	2011
Předmět:	Text corpus Perplexity business.industry Computer science computer.software_genre Consistency (database systems) ComputingMethodologies_PATTERNRECOGNITION Server ComputingMethodologies_DOCUMENTANDTEXTPROCESSING Preprocessor The Internet Artificial intelligence Data mining Architecture business Scale (map) computer Natural language processing
Zdroj:	Text, Speech and Dialogue ISBN: 9783642235375 TSD
Popis:	The paper describes a system for collecting a large text corpus from Internet news servers. The architecture and text preprocessing algorithms are described. We also describe the used duplicity detection algorithm. The resulting corpus contains more than 1 billion tokens in more than 3 millions articles with assigned topics and duplicates identified. Corpus statistics like consistency and perplexity are presented.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::a7f89e5ecf8ad4fca6d55ffab46d92f2 https://doi.org/10.1007/978-3-642-23538-2_45 Zobrazit plný text záznamu