Scalable Feature Matching Across Large Data Collections

Autor:	David Degras
Rok vydání:	2021
Předmět:	Statistics and Probability Methodology (stat.ME) FOS: Computer and information sciences Discrete Mathematics and Combinatorics Statistics Probability and Uncertainty Statistics - Computation Statistics - Methodology Computation (stat.CO)
DOI:	10.48550/arxiv.2101.02035
Popis:	This article is concerned with matching feature vectors in a one-to-one fashion across large collections of datasets. Formulating this task as a multidimensional assignment problem with decomposable costs (MDADC), we develop fast algorithms with time complexity roughly linear in the number n of datasets and space complexity a small fraction of the data size. These remarkable properties hinge on using the squared Euclidean distance as dissimilarity function, which can reduce (n2) matching problems between pairs of datasets to n problems and enable calculating assignment costs on the fly. To our knowledge, no other method applicable to the MDADC possesses these linear scaling and low-storage properties necessary to large-scale applications. In numerical experiments, the novel algorithms outperform competing methods and show excellent computational and optimization performances. An application of feature matching to a large neuroimaging database is presented. The algorithms of this article are implemented in the R package matchFeat available at github.com/ddegras/matchFeat. Supplementary materials for this article are available online.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::25133db9411c3de93cdb28d80fcee8fd Zobrazit plný text záznamu