Определение семантической близости текстов с использованием инструмента DKPro Similarity

Rok vydání: 2018
Předmět:
Zdroj: Компьютерная лингвистика и вычислительные онтологии. :87-97
ISSN: 2541-9781
DOI: 10.17586/2541-9781-2017-1-87-97
Popis: This paper looks into a problem of computing semantic similarity of texts in Russian. In course of experiments we employ an open-source framework DKPro Similarity, and describe its advantages for this purpose. Our attention is focused on string metrics of computing text similarity. Experiments are carried out on test samples including similar extracts from fiction, research, and news texts. For pairs of texts we use several string-based similarity metrics, implemented in DKPro Similarity, and pass the computed values as features for machine learning algorithms. We also present a method of evaluating similarity measures’ relevance for particular purposes. Results of the research prove that simple string-based metrics contribute to performance of linear models while trying to identify whether texts belong to the same group – with average F-measure value 0,88 for eight datasets. In future we also plan to use semantic text similarity measures which make use of external sources of knowledge, e.g. Wikipedia, and employ more sophisticated machine learning algorithms to improve the performance in some difficult cases.
Databáze: OpenAIRE