Using character n-grams to match a list of publications to references in bibliographic databases
Autor: | Wouter Jeuris, Mehmet Ali Abdulhayoglu, Bart Thijs |
---|---|
Rok vydání: | 2016 |
Předmět: |
Measure (data warehouse)
Matching (statistics) Information retrieval Database Computer science Process (computing) General Social Sciences 02 engineering and technology String searching algorithm Library and Information Sciences computer.software_genre Levenshtein distance Computer Science Applications Character (mathematics) Similarity (network science) 020204 information systems 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing computer Word (computer architecture) |
Zdroj: | Scientometrics. 109:1525-1546 |
ISSN: | 1588-2861 0138-9130 |
DOI: | 10.1007/s11192-016-2066-3 |
Popis: | For research evaluation, publication lists need to be matched to entries in large bibliographic databases, such as Thomson Reuters Web of Science. This matching process is often done manually, making it very time consuming. This paper presents the use of character n-grams as automated indicator to inform and ease the manual matching process. The similarity of two references was identified by calculating Salton's cosine for their common character n-grams. As a complementary and confirmatory measure, Kondrak's Levenshtein distance score, based on the character n-grams, is used to re-measure the similarity of the top matches resulting from Salton's cosine. These automated matches were compared to results from completely manual matching. Incorrect matches were examined in depth and possible solutions suggested. This method was applied to two independent datasets, to validate the results and inferences drawn. For both datasets, the Salton's score based on character n-grams proves to be a useful indicator to distinguish between correct and incorrect matches. The suggested method is compared with a baseline which is based on word unigrams. Accuracy of the character and word based systems are 96.0 and 94.7 %, respectively. Despite a small difference in accuracy, we observed that the character based system provides more correct matches when the data contains abbreviations, mathematical expressions or erroneous text. |
Databáze: | OpenAIRE |
Externí odkaz: |