Towards acquisition of a thematic Persian corpus from the Tebyan Portal: TebCorp
Autor: | Ali Cholmaghani, Ali Vahdani, Sayed Nasir Khalifehsoltani, Reza Moallemi |
---|---|
Rok vydání: | 2010 |
Předmět: |
Information retrieval
Exploit Computer science business.industry Pragmatics computer.software_genre language.human_language Information extraction Resource (project management) Thematic map Web mining language Artificial intelligence business computer Natural language Natural language processing Persian |
Zdroj: | 2010 2nd International Conference on Computer Engineering and Technology. |
DOI: | 10.1109/iccet.2010.5485685 |
Popis: | The TebCorp collection is a large thematic modern Persian text collection which consists of 500 MB of text from Tebyan Portal. TebCorp contains more than 93,000 articles in 1097 topics and includes more than 44 million total words and about 550,000 distinct words which is suitable for information retrieval researches. In this paper we tried to exploit Tebyan portal - containing vast amount of prominent Persian articles - as a linguistic resource to build a multipurpose thematic corpus for Persian. We will present particular details on building this corpus including information retrieval and collection assessment. We will then conclude by giving practical information about this corpus. |
Databáze: | OpenAIRE |
Externí odkaz: |