Towards acquisition of a thematic Persian corpus from the Tebyan Portal: TebCorp

Autor:	Ali Cholmaghani, Ali Vahdani, Sayed Nasir Khalifehsoltani, Reza Moallemi
Rok vydání:	2010
Předmět:	Information retrieval Exploit Computer science business.industry Pragmatics computer.software_genre language.human_language Information extraction Resource (project management) Thematic map Web mining language Artificial intelligence business computer Natural language Natural language processing Persian
Zdroj:	2010 2nd International Conference on Computer Engineering and Technology.
DOI:	10.1109/iccet.2010.5485685
Popis:	The TebCorp collection is a large thematic modern Persian text collection which consists of 500 MB of text from Tebyan Portal. TebCorp contains more than 93,000 articles in 1097 topics and includes more than 44 million total words and about 550,000 distinct words which is suitable for information retrieval researches. In this paper we tried to exploit Tebyan portal - containing vast amount of prominent Persian articles - as a linguistic resource to build a multipurpose thematic corpus for Persian. We will present particular details on building this corpus including information retrieval and collection assessment. We will then conclude by giving practical information about this corpus.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::97b9c3f600c634f6afe15c7bb3c2f327 https://doi.org/10.1109/iccet.2010.5485685 Zobrazit plný text záznamu