Fast and Precise Convolutional Jaro and Jaro-Winkler Similarity

Autor:	OndЕ™ej Rozinek
Jazyk:	angličtina
Rok vydání:	2024
Předmět:	jaro jaro-winkler approximate string matching string similarity convolution Telecommunication TK5101-6720
Zdroj:	Proceedings of the XXth Conference of Open Innovations Association FRUCT, Vol 35, Iss 1, Pp 613-https://youtu.be/fIl3ebeEg7M (2024)
Druh dokumentu:	article
ISSN:	2305-7254 2343-0737 10516360
DOI:	10.23919/FRUCT61870.2024.10516360
Popis:	In the domain of character-based approximate string matching, edit distances such as Levenshtein have remained predominant despite their quadratic time complexity. This reality has prompted the adoption of more efficient metrics like Jaro and Jaro-Winkler. However, these methods often overlook the significance of character order within the matching window, which can adversely affect accuracy. For the first time, we introduce a novel class of character-based approximate string matching algorithms that leverage a convolutional kernel, surpassing the performance of existing state-of-the-art unsupervised character-based approximate string matching algorithms. This paper presents Convolutional Jaro (ConvJ) and Convolutional Jaro-Winkler (ConvJW), innovative similarity metrics designed to overcome these shortcomings. ConvJ and ConvJW utilize a convolutional approach with Gaussian weighting to effectively capture the positional proximity of matching characters, resulting in a more precise similarity evaluation. This method not only achieves computational efficiency comparable to that of Jaro and Jaro-Winkler but also surpasses the state-of-the-art in terms of F1-score, demonstrating faster execution times compared to the conventional Jaro and Jaro-Winkler implementations across various datasets. Our extensive experimental analysis highlights the exceptional performance of ConvJ and ConvJW across a range of datasets. Remarkably, ConvJ exhibits a 7x faster execution time than the fast Jaro implementation and exceeds the state-of-the-art F1-score by a significant margin of 10% more than Jaro. By setting a new benchmark in unsupervised character-based approximate string matching, our research shows the new way for future exploration and development in this field. The ConvJ and ConvJW algorithms, characterized by their quasilinear time complexity and improved accuracy, provide a solid foundation for the advancement of string matching techniques. These developments hold promise for a broad spectrum of applications in data mining, bioinformatics, and related areas.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/12dce6ff9dd14b7296382e39318988aa Zobrazit plný text záznamu View record in DOAJ