Finding Multiword Term Candidates in Croatian

Autor: Tadić, Marko, Šojat, Krešimir
Jazyk: angličtina
Rok vydání: 2003
Předmět:
Popis: The paper presents the research in the field of statistical processing of a corpus of texts in Croatian with the primary aim of finding statistically significant co-occurrences of n-grams of tokens (digrams , trigrams and tetragrams). The collocations found with this method present the list of candidates for multiword terminological units submitted to terminologists for further processing i.e. manual selecting of the “ ; real terms” ; . The statistical measure of co-occurrence used is mutual information (MI3) accompanied with linguistic filters: stop-words and POS. The results on non-lemmatized material of a highly inflected lan-guage such as Croatian show that MI measure alone is not sufficient to find satisfactory number of multi-word term candidates. In this case the usage of absolute frequency combined with linguistic filtering techniques gives broader list of candidates for real terms.
Databáze: OpenAIRE