A Framework for Statistically-Sound Customer Segment Search Authors' Copy

Autor: Amer-Yahia, Sihem, Berti-Equille, Laure, Chibah, Abdelouahab
Přispěvatelé: Laboratoire d'Informatique de Grenoble (LIG), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA), UMR 228 Espace-Dev, Espace pour le développement, Institut de Recherche pour le Développement (IRD)-Université de Perpignan Via Domitia (UPVD)-Avignon Université (AU)-Université de La Réunion (UR)-Université de Montpellier (UM)-Université de Guyane (UG)-Université des Antilles (UA), ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019), Université de Guyane (UG)-Université des Antilles (UA)-Institut de Recherche pour le Développement (IRD)-Université de Perpignan Via Domitia (UPVD)-Avignon Université (AU)-Université de La Réunion (UR)-Université de Montpellier (UM)
Jazyk: angličtina
Rok vydání: 2021
Předmět:
Zdroj: The 8th IEEE International Conference on Data Science and Advanced Analytics
The 8th IEEE International Conference on Data Science and Advanced Analytics, Oct 2021, Porto (virtual), Portugal. ⟨10.1109/DSAA53316.2021.9564199⟩
DOI: 10.1109/DSAA53316.2021.9564199⟩
Popis: International audience; We develop S4, a Statistically-Sound Segment Search framework that combines principled data partitioning and sound statistical testing to verify common hypotheses in retail data and return interpretable customer data segments. Our framework accommodates one-sample, two-sample, and multiple-sample testing, to provide various aggregations and comparisons of customer transactions. To control the proportion of false discoveries in multiple hypothesis testing, we enforce an FDR-controlling procedure and formulate a unified optimization problem that returns customer data segments that satisfy the test for a given significance level, maximize coverage of the input data, and are within a risk capital. We develop a greedy algorithm to explore different data partitions and test multiple hypotheses in a sound manner. Our extensive experiments on four retail data sets examine the interaction between significance, risk and coverage, and demonstrate the expressivity, usefulness, and scalability of S4 in practice.
Databáze: OpenAIRE