Matching bibliographic data from publication lists with large databases using N-Grams : MSI Working Paper

Autor: Abdulhayoglu, Mehmet Ali, Thijs, Bart, Jeuris, Wouter
Jazyk: angličtina
Rok vydání: 2014
Předmět:
Popis: This paper presents a text matching process for identification and correct assignment of scholarly publications, extracted from publication lists provided by authors or research institutes, in large bibliographic databases such as Thomson Reuters’ Web of Science (WoS). An identification method is implemented by means of overlapping common 3-grams and the results are obtained from the match of the two sources according to the highest score of the applied cosine measure. Levenshtein similarities based on N-grams have been used to measure the closeness between the given CV publication and the retrieved best possible WoS match as a complementary and confirmatory measure. It is shown that the suggested method has an important potential on reducing the manual effort to find out whether a desired publication is indexed in WoS or not. The similarity scores derived by Levenshtein measure show consistency with those derived from Salton’s similarity measure. Incorrect matches are examined in depth and possible thresholds are suggested to decrease the effort for manual cleaning. ispartof: FEB Research Report MSI_1413 nrpages: 29 status: published
Databáze: OpenAIRE