Learning New Words from Keystroke Data with Local Differential Privacy

Autor:	Sungwook Kim, Soo-Hyung Kim, Hyejin Shin, Baek Chunghun, Jun-bum Shin
Rok vydání:	2020
Předmět:	Information privacy Theoretical computer science Computer science String (computer science) Hash function 02 engineering and technology Keystroke logging Computer Science Applications Information sensitivity Computational Theory and Mathematics 020204 information systems Server 0202 electrical engineering electronic engineering information engineering Differential privacy Word (computer architecture) Information Systems
Zdroj:	IEEE Transactions on Knowledge and Data Engineering. 32:479-491
ISSN:	2326-3865 1041-4347
DOI:	10.1109/tkde.2018.2885749
Popis:	Keystroke data collected from smart devices includes various sensitive information about users. Collecting and analyzing such data raise serious privacy concerns. Google and Apple have recently applied local differential privacy (LDP) to address privacy issue on learning new words from users’ keystroke data. However, these solutions require multiple LDP reports for a single word, which result in inefficient use of privacy budget and high computational cost. In this paper, we develop a novel algorithm for learning new words under LDP. Unlike the existing solutions, the proposed method generates only one LDP report for a single word. This enables the proposed method to use full privacy budget for generating a report and brings the benefit that the proposed method provides better utility at the same privacy degree than the existing methods. In our algorithm, each user appends a hash value to new word and sends only one LDP report of an $n$ n -gram selected randomly from the string packed by each new word and its hash value. The server then decodes frequent $n$ n -grams at each position of the string and discovers the candidate words by exploring graph-theoretic links between $n$ n -grams and checking integrity of candidates with hash values. Frequencies of frequent new words discovered are estimated from distribution estimates of $n$ n -grams by robust regression. We theoretically show that our algorithm can recover popular new words even though the server does not know the domain of the raw data. In addition, we theoretically and empirically demonstrate that our algorithm achieves higher accuracy compared to the existing solutions.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::f2a09ce57b3c82c3c951b7ae5c4e0d7e https://doi.org/10.1109/tkde.2018.2885749 Zobrazit plný text záznamu