Scalable Iterative Classification for Sanitizing Large-Scale Datasets
Autor: | Bradley A. Malin, Bo Li, Yevgeniy Vorobeychik, Muqun Li |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2016 |
Předmět: |
game theory
Ubiquitous computing Computer science 02 engineering and technology computer.software_genre Machine learning Article Data modeling Set (abstract data type) Text mining 020204 information systems 0202 electrical engineering electronic engineering information engineering weak structured data sanitization Greedy algorithm business.industry Computer Science Applications Privacy preserving Data set Information sensitivity Computational Theory and Mathematics Scalability 020201 artificial intelligence & image processing Artificial intelligence Data mining business Personally identifiable information computer Information Systems |
Zdroj: | IEEE transactions on knowledge and data engineering |
ISSN: | 1558-2191 1041-4347 |
Popis: | Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identifiers from it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk is often acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identifiers. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked identifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limited adversary. Moreover, using five text data sets we illustrate that our algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93 percent of the original data, and completes after at most five iterations. |
Databáze: | OpenAIRE |
Externí odkaz: |