Using character n-grams to match a list of publications to references in bibliographic databases

Autor: Wouter Jeuris, Mehmet Ali Abdulhayoglu, Bart Thijs
Rok vydání: 2016
Předmět:
Zdroj: Scientometrics. 109:1525-1546
ISSN: 1588-2861
0138-9130
DOI: 10.1007/s11192-016-2066-3
Popis: For research evaluation, publication lists need to be matched to entries in large bibliographic databases, such as Thomson Reuters Web of Science. This matching process is often done manually, making it very time consuming. This paper presents the use of character n-grams as automated indicator to inform and ease the manual matching process. The similarity of two references was identified by calculating Salton's cosine for their common character n-grams. As a complementary and confirmatory measure, Kondrak's Levenshtein distance score, based on the character n-grams, is used to re-measure the similarity of the top matches resulting from Salton's cosine. These automated matches were compared to results from completely manual matching. Incorrect matches were examined in depth and possible solutions suggested. This method was applied to two independent datasets, to validate the results and inferences drawn. For both datasets, the Salton's score based on character n-grams proves to be a useful indicator to distinguish between correct and incorrect matches. The suggested method is compared with a baseline which is based on word unigrams. Accuracy of the character and word based systems are 96.0 and 94.7 %, respectively. Despite a small difference in accuracy, we observed that the character based system provides more correct matches when the data contains abbreviations, mathematical expressions or erroneous text.
Databáze: OpenAIRE