The automatic generation of thesauri of related words for English, French, German, and Russian

Autor: Reinhard Rapp
Rok vydání: 2008
Předmět:
Zdroj: International Journal of Speech Technology. 11:147-156
ISSN: 1572-8110
1381-2416
DOI: 10.1007/s10772-009-9043-7
Popis: A method for the automatic extraction of words with similar meanings is presented which is based on the analysis of word distribution in large monolingual text corpora. It involves compiling matrices of word co-occurrences and reducing the dimensionality of the semantic space by conducting a singular value decomposition. This way problems of data sparseness are reduced and a generalization effect is achieved which considerably improves the results. The method is largely language independent and has been applied to corpora of English, French, German, and Russian, with the resulting thesauri being freely available. For the English thesaurus, an evaluation has been conducted by comparing it to experimental results as obtained from test persons who were asked to give judgements of word similarities. According to this evaluation, the machine generated results come close to native speaker’s performance.
Databáze: OpenAIRE