Tailoring data source distributions for fairness-aware data integration
Autor: | Fatemeh Nargesian, H. V. Jagadish, Abolfazl Asudeh |
---|---|
Rok vydání: | 2021 |
Předmět: |
FOS: Computer and information sciences
FOS: Media and communications Data source Computer science General Engineering Approximation algorithm Binary number Computation Theory and Mathematics Function (mathematics) computer.software_genre Data set Distribution (mathematics) Library and Information Studies Data mining Representation (mathematics) computer Information Systems Data integration |
Zdroj: | Proceedings of the VLDB Endowment. 14:2519-2532 |
ISSN: | 2150-8097 |
Popis: | Data scientists often develop data sets for analysis by drawing upon sources of data available to them. A major challenge is to ensure that the data set used for analysis has an appropriate representation of relevant (demographic) groups: it meets desired distribution requirements. Whether data is collected through some experiment or obtained from some data provider, the data from any single source may not meet the desired distribution requirements. Therefore, a union of data from multiple sources is often required. In this paper, we study how to acquire such data in the most cost effective manner, for typical cost functions observed in practice. We present an optimal solution for binary groups when the underlying distributions of data sources are known and all data sources have equal costs. For the generic case with unequal costs, we design an approximation algorithm that performs well in practice. When the underlying distributions are unknown, we develop an exploration-exploitation based strategy with a reward function that captures the cost and approximations of group distributions in each data source. Besides theoretical analysis, we conduct comprehensive experiments that confirm the effectiveness of our algorithms. |
Databáze: | OpenAIRE |
Externí odkaz: |