An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data

Autor:	Wanqing Wu, Wenyu Mao
Jazyk:	angličtina
Rok vydání:	2022
Předmět:	data mining functional dependency distributed computing big data Chemical technology TP1-1185
Zdroj:	Sensors, Vol 22, Iss 10, p 3856 (2022)
Druh dokumentu:	article
ISSN:	22103856 1424-8220
DOI:	10.3390/s22103856
Popis:	A crucial step in improving data quality is to discover semantic relationships between data. Functional dependencies are rules that describe semantic relationships between data in relational databases and have been applied to improve data quality recently. However, traditional functional discovery algorithms applied to distributed data may lead to errors and the inability to scale to large-scale data. To solve the above problems, we propose a novel distributed functional dependency discovery algorithm based on Apache Spark, which can effectively discover functional dependencies in large-scale data. The basic idea is to use data redistribution to discover functional dependencies in parallel on multiple nodes. In this algorithm, we take a sampling approach to quickly remove invalid functional dependencies and propose a greedy-based task assignment strategy to balance the load. In addition, the prefix tree is used to store intermediate computation results during the validation process to avoid repeated computation of equivalence classes. Experimental results on real and synthetic datasets show that the proposed algorithm in this paper is more efficient than existing methods while ensuring accuracy.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/1cf20cb9266049139103a41e9b5958a0 Zobrazit plný text záznamu View record in DOAJ Plný text ve formátu PDF Plný text ve formátu HTML
Nepřihlášeným uživatelům se plný text nezobrazuje	K zobrazení výsledku je třeba se přihlásit.