Scalable Combinatorial Algorithms for Optimal Transport Based Similarity Metrics
Autor: | Zhang, Kaiyi |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2024 |
Předmět: | |
Druh dokumentu: | Dissertation |
Popis: | Optimal Transport (OT), also known as Wasserstein distance, is a valuable metric for comparing probability distributions. Owing to its appealing statistical properties, researchers in various fields, such as machine learning, use OT within applications. However, computing both exact and approximate OT is computationally expensive and impractical for large datasets. Furthermore, OT is sensitive to small noise in the input distributions. In this document, we propose to use combinatorial methods to design scalable and noise-resistant solutions for OT. We present four key contributions in this work. First, we introduce a novel combinatorial parallel algorithm for approximating OT, which achieves a parallel time complexity of $O(log n/varepsilon^2)$, where $n$ is the input size and $varepsilon$ is the addtitive error, Our algorithm outperforms the state-of-the-art in experiments. Second, we propose a new concept, OT-profile, representing the function of minimum partial optimal transport cost $sigma_alpha$ versus the transported mass $alpha$. This can be used to identify outliers in real-world data. The utility of OT-profile is demonstrated in outlier detection and PU-learning jobs and outperforms the state-of-the-art. Third, building upon the OT-profile, we propose a new OT-based metric for comparing distributions that is more robust to noise. This metric preserves desirable properties while reducing its sensitivity to noise for high $p$ values, providing a robust solution for real-world datasets. Lastly, we have developed a Python library that integrates our algorithms and methods into a user-friendly framework, making it easier for practitioners to adopt our methods. Our work enhances the computational efficiency and robustness of OT, making it practical for machine learning applicaitons. Doctor of Philosophy Imagine a task in which you are required to fill up a set of holes with piles of sand using the least amount of effort. The idea behind this is Optimal Transport (OT), also known as the Wasserstein distance. This distance is a popular math tool that measures the similarity (or distance) between two probability distributions. This tool has many applications in different areas, including machine learning and data analysis. For example, it is employed in generative models to measure the difference between generated and real images, or in analyzing real-word datasets containing noise. However, the practice of OT faces two major problems. First, the computation becomes expensive with larger datasets, making it infeasible for large-scale applications. Therefore, it is important to make the computation of OT more efficient in processing a large volume of data. Second, OT is sensitive to outliers or noise in the input distributions, which can significantly influence the results. In this thesis, we try to address these challenges through four main contributions: First, we present novel parallel combinatorial algorithms that reduce the OT computational burden. Our approach achieves a state-of-the-art time complexity and, at the same time, is easy to implement. Second, we introduce an innovative method for computing an 'OT-profile', which enables the to identify the outliers from input distributions. This approach improves the robustness of OT in real-world scenarios where data always contains noise and perturbation. Third, building on the OT-profile concept, we propose a new robust OT-based metric for comparing distributions in the presence of noise. To facilitate the impact of these advancements, we have open-sourced a Python library that implements our methods in a user-friendly interface, making it easier for researchers and practitioners to apply our solutions to their data analysis tasks. This work not only advances the theoretical analysis of OT but also contributes to future practical applications. By addressing the two challenges in scalability and robustness, we made the application of OT more robust and easier with large-scale data. |
Databáze: | Networked Digital Library of Theses & Dissertations |
Externí odkaz: |