Use of locality sensitive hashing (LSH) algorithm to match Web of Science and Scopus
Autor: | Bart Thijs, Mehmet Ali Abdulhayoglu |
---|---|
Rok vydání: | 2017 |
Předmět: |
Matching (statistics)
Information retrieval business.industry Computer science 05 social sciences Hash function Scopus General Social Sciences Subject (documents) Context (language use) Library and Information Sciences 050905 science studies Computer Science Applications Locality-sensitive hashing Analytics 0509 other social sciences 050904 information & library sciences business Heuristics Algorithm |
Zdroj: | Scientometrics. 116:1229-1245 |
ISSN: | 1588-2861 0138-9130 |
DOI: | 10.1007/s11192-017-2569-6 |
Popis: | A novel hashing algorithm is applied to match two prominent and important bibliographic databases at the paper level. In the literature, such tasks have been studied and conducted many times, but relying only on journal information due to massive volume of indexed publications. As a result of paper based match, missing or erroneous items can be completed from other source or the overlap can be measured more reliably. In this context, we focus on measuring the overlap between Clarivate Analytics Web of Science (WoS) and Elsevier’s Scopus at the paper level. Our focus is on detecting exact matches, that is, no false positives are tolerated at all. To this end, we follow a twofold matching procedure. First, a locality sensitive hashing algorithm is applied, which provides fast approximate nearest neighbours and similarities, in order to obtain WoS-Scopus pair suggestions. Second, for each suggested pair, different heuristics are applied to identify those pair of records that indeed refer to the same publication. We observe that at least 74% of WoS publications are also indexed by Scopus. The percentage increases to 92% when only the cited publications are retained. The overlapped WoS records are also presented based on Institute for Scientific Information subject categories (SC). Of those, three big SCs, whose overlap ratios are relatively low, are chosen and examined in detail. Last but not the least, it takes just about an hour to match 14.2 million versus 19.6 million publications from a publication year range of 2004–2013 in a high performance computer environment. |
Databáze: | OpenAIRE |
Externí odkaz: |