Discovering Approximate Functional Dependencies using Smoothed Mutual Information
Autor: | Frédéric Pennerath, Panagiotis Mandros, Jilles Vreeken |
---|---|
Přispěvatelé: | Knowledge representation, reasonning (ORPAILLEUR), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), CentraleSupélec, Max Planck Institute for Informatics [Saarbrücken], Helmholtz Center for Information Security [Saarbrücken] (CISPA) |
Jazyk: | angličtina |
Rok vydání: | 2020 |
Předmět: |
Clustering high-dimensional data
Computer science Estimator 02 engineering and technology Mutual information Expected value functional dependencies [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] [STAT.ML]Statistics [stat]/Machine Learning [stat.ML] 020204 information systems Prior probability 0202 electrical engineering electronic engineering information engineering Search problem 020201 artificial intelligence & image processing Additive smoothing Functional dependency Null hypothesis mutual information Algorithm Smoothing Laplace smoothing |
Zdroj: | KDD 2020-26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining KDD 2020-26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Aug 2020, San Diego / Virtual, United States. pp.1254--1264 KDD |
Popis: | International audience; We consider the task of discovering the top-K reliable approximate functional dependencies X → Y from high dimensional data. While naively maximizing mutual information involving high dimensional entropies over empirical data is subject to false discoveries, correcting the empirical estimator against data sparsity can lead to efficient exact algorithms for robust dependency discovery. Previous approaches focused on correcting by subtracting expected values of different null hypothesis models. In this paper, we consider a different correction strategy and counter data sparsity using uniform priors and smoothing techniques, that leads to an efficient and robust estimating process. In addition, we derive an admissible and tight bounding function for the smoothed estimator that allows us to efficiently solve via branch-and-bound the hard search problem for the top-K dependencies. Our experiments show that our approach is much faster than previous proposals, and leads to the discovery of sparse and informative functional dependencies. CCS CONCEPTS • Information systems → Data mining. |
Databáze: | OpenAIRE |
Externí odkaz: |