Joining Distributed Database Summaries

Autor: Bechchi, Mounir, Raschia, Guillaume, Mouaddib, Noureddine
Přispěvatelé: Laboratoire d'Informatique de Nantes Atlantique (LINA), Mines Nantes (Mines Nantes)-Université de Nantes (UN)-Centre National de la Recherche Scientifique (CNRS), Complex data management in distributed systems (ATLAS), Université de Nantes (UN)-Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), INRIA
Jazyk: angličtina
Rok vydání: 2008
Předmět:
Zdroj: [Research Report] RR-6768, INRIA. 2008, pp.29
Popis: The database summarization system coined SaintEtiQ provides multi-level summaries of tabular data stored into a centralized database. Summaries are computed online with a conceptual hierarchical clustering algorithm. However, in many companies, data are distributed among several sites, either homogeneously (i.e. , sites contain data for a common set of features) or heterogeneously (i.e. , sites contain data for different features). Consequently, the current centralized version of SaintEtiQ is either not feasible or even not desirable due to privacy or resource issues. In this paper, we propose two new algorithms for summarizing heterogeneously distributed data without a prior "unification" of the data sources: Subspace-Oriented Join Algorithm (SOJA) and Tree Alignement-based Join Algorithm (TAJA). The main idea of such algorithms consists in applying innovative joins on two local models, computed over two disjoint sets of features, to provide a global summary over the full feature set without scanning the raw data. SOJA takes one of the two input trees as the base model and the other one is processed to complete the first one, whereas TAJA rearranges summaries by levels in a top-down manner. Then, we propose a consistent quality measure to quantify how good our joined hierarchies are. Finally, an experimental study, using synthetic data sets, shows that our joining processes (SOJA and TAJA) result in high quality clustering schemas of the entire distributed data and are very efficient in terms of computational time w.r.t. the centralized approach.
Databáze: OpenAIRE