Loss-Aware Histogram Binning and Principal Component Analysis for Customer Fleet Analytics

Autor: Kunxiong Ling, Jan Thiele, Thomas Setzer
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: IEEE Open Journal of Intelligent Transportation Systems, Vol 5, Pp 160-173 (2024)
Druh dokumentu: article
ISSN: 2687-7813
DOI: 10.1109/OJITS.2024.3366279
Popis: We propose a method to estimate information loss when conducting histogram binning and principal component analysis (PCA) sequentially, as usually done in practice for fleet analytics. Coarser-grained histogram binning results in less data volume, fewer dimensions, but more information loss. Considering fewer principal components (PCs) results in fewer data dimensions but increased information loss. Although information loss with each step is well understood, little guidance exists on the overall information loss when conducting both steps sequentially. We use Monte Carlo simulations to regress information loss on the number of bins and PCs, given few parameters of a dataset related to its scale and correlation structure. A sensitivity study shows that information loss can be approximated well given sufficiently large datasets. Using the number of bins, PCs, and two correlation measures, we derive an empirical loss model with high accuracy. Furthermore, we demonstrate the benefits of estimating information losses and the representativeness of total loss in evaluating the accuracy of k-means clustering for a real-world customer fleet dataset. For preprocessing sensor data which are aggregated from sufficient number of samples, continuously distributed, and can be represented by Beta-distributions, we recommend not to coarsen the histogram binning before PCA.
Databáze: Directory of Open Access Journals