Recognition of irrelevant phrases in automatically extracted lists of domain terms

Autor: Piotr Rychlik, Malgorzata Marciniak, Agnieszka Mykowiecka
Rok vydání: 2018
Předmět:
Zdroj: Computational terminology and filtering of terminological information. 24:66-90
ISSN: 1569-9994
0929-9971
Popis: In our paper, we address the problem of recognition of irrelevant phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms or discourse expressions. We defined several methods based on comparison of domain corpora and a method based on contexts of phrases identified in a large corpus of general language. The methods were tested on Polish data. We used six domain corpora and one general corpus. Two test sets were prepared to evaluate the methods. The first one consisted of many presumably irrelevant phrases, as we selected phrases which occurred in at least three domain corpora. The second set mainly consisted of domain terms, as it was composed of the top-ranked phrases automatically extracted from the analyzed domain corpora. The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase ordering varied between methods. The most successful method, with a precision of about 0.75 on half of the tested list, was the context based method using a modified contextual diversity coefficient. Although the methods were tested on Polish, they seems to be language independent.
Databáze: OpenAIRE