The automatic generation of thesauri of related words for English, French, German, and Russian
Autor: | Reinhard Rapp |
---|---|
Rok vydání: | 2008 |
Předmět: |
Text corpus
Linguistics and Language Generalization business.industry Computer science First language InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL Thesaurus computer.software_genre Language and Linguistics language.human_language Human-Computer Interaction German Word lists by frequency Corpus linguistics language Computer Vision and Pattern Recognition Artificial intelligence business computer Software Word (computer architecture) Natural language processing |
Zdroj: | International Journal of Speech Technology. 11:147-156 |
ISSN: | 1572-8110 1381-2416 |
DOI: | 10.1007/s10772-009-9043-7 |
Popis: | A method for the automatic extraction of words with similar meanings is presented which is based on the analysis of word distribution in large monolingual text corpora. It involves compiling matrices of word co-occurrences and reducing the dimensionality of the semantic space by conducting a singular value decomposition. This way problems of data sparseness are reduced and a generalization effect is achieved which considerably improves the results. The method is largely language independent and has been applied to corpora of English, French, German, and Russian, with the resulting thesauri being freely available. For the English thesaurus, an evaluation has been conducted by comparing it to experimental results as obtained from test persons who were asked to give judgements of word similarities. According to this evaluation, the machine generated results come close to native speaker’s performance. |
Databáze: | OpenAIRE |
Externí odkaz: |