Generalizing Correspondence Analysis for Applications in Machine Learning
Autor: | Hsiang Hsu, Flavio P. Calmon, Salman Salamatian |
---|---|
Rok vydání: | 2022 |
Předmět: |
FOS: Computer and information sciences
Computer Science - Machine Learning Current (mathematics) Scale (ratio) Computer science Computer Science - Information Theory Boundary (topology) Machine Learning (stat.ML) Correspondence analysis Machine Learning (cs.LG) Machine Learning Statistics - Machine Learning Artificial Intelligence Leverage (statistics) business.industry Information Theory (cs.IT) Applied Mathematics Principal (computer security) Visualization Computational Theory and Mathematics Neural Networks Computer Computer Vision and Pattern Recognition Artificial intelligence business Random variable Algorithm Algorithms Software |
Zdroj: | IEEE Transactions on Pattern Analysis and Machine Intelligence. 44:9347-9362 |
ISSN: | 1939-3539 0162-8828 |
DOI: | 10.1109/tpami.2021.3127870 |
Popis: | Correspondence analysis (CA) is a multivariate statistical tool used to visualize and interpret data dependencies by finding maximally correlated embeddings of pairs of random variables. CA has found applications in fields ranging from epidemiology to social sciences; however, current methods do not scale to large, high-dimensional datasets. In this paper, we provide a novel interpretation of CA in terms of an information-theoretic quantity called the principal inertia components. We show that estimating the principal inertia components, which consists in solving a functional optimization problem over the space of finite variance functions of two random variable, is equivalent to performing CA. We then leverage this insight to design novel algorithms to perform CA at an unprecedented scale. Particularly, we demonstrate how the principal inertia components can be reliably approximated from data using deep neural networks. Finally, we show how these maximally correlated embeddings of pairs of random variables in CA further play a central role in several learning problems including visualization of classification boundary and training process, and underlying recent multi-view and multi-modal learning methods. 30 pages, 7 figures, 6 tables. arXiv admin note: text overlap with arXiv:1902.07828 |
Databáze: | OpenAIRE |
Externí odkaz: |