Towards acquisition of a thematic Persian corpus from the Tebyan Portal: TebCorp

Autor: Ali Cholmaghani, Ali Vahdani, Sayed Nasir Khalifehsoltani, Reza Moallemi
Rok vydání: 2010
Předmět:
Zdroj: 2010 2nd International Conference on Computer Engineering and Technology.
DOI: 10.1109/iccet.2010.5485685
Popis: The TebCorp collection is a large thematic modern Persian text collection which consists of 500 MB of text from Tebyan Portal. TebCorp contains more than 93,000 articles in 1097 topics and includes more than 44 million total words and about 550,000 distinct words which is suitable for information retrieval researches. In this paper we tried to exploit Tebyan portal - containing vast amount of prominent Persian articles - as a linguistic resource to build a multipurpose thematic corpus for Persian. We will present particular details on building this corpus including information retrieval and collection assessment. We will then conclude by giving practical information about this corpus.
Databáze: OpenAIRE