Learning New Words from Keystroke Data with Local Differential Privacy
Autor: | Sungwook Kim, Soo-Hyung Kim, Hyejin Shin, Baek Chunghun, Jun-bum Shin |
---|---|
Rok vydání: | 2020 |
Předmět: |
Information privacy
Theoretical computer science Computer science String (computer science) Hash function 02 engineering and technology Keystroke logging Computer Science Applications Information sensitivity Computational Theory and Mathematics 020204 information systems Server 0202 electrical engineering electronic engineering information engineering Differential privacy Word (computer architecture) Information Systems |
Zdroj: | IEEE Transactions on Knowledge and Data Engineering. 32:479-491 |
ISSN: | 2326-3865 1041-4347 |
DOI: | 10.1109/tkde.2018.2885749 |
Popis: | Keystroke data collected from smart devices includes various sensitive information about users. Collecting and analyzing such data raise serious privacy concerns. Google and Apple have recently applied local differential privacy (LDP) to address privacy issue on learning new words from users’ keystroke data. However, these solutions require multiple LDP reports for a single word, which result in inefficient use of privacy budget and high computational cost. In this paper, we develop a novel algorithm for learning new words under LDP. Unlike the existing solutions, the proposed method generates only one LDP report for a single word. This enables the proposed method to use full privacy budget for generating a report and brings the benefit that the proposed method provides better utility at the same privacy degree than the existing methods. In our algorithm, each user appends a hash value to new word and sends only one LDP report of an $n$ n -gram selected randomly from the string packed by each new word and its hash value. The server then decodes frequent $n$ n -grams at each position of the string and discovers the candidate words by exploring graph-theoretic links between $n$ n -grams and checking integrity of candidates with hash values. Frequencies of frequent new words discovered are estimated from distribution estimates of $n$ n -grams by robust regression. We theoretically show that our algorithm can recover popular new words even though the server does not know the domain of the raw data. In addition, we theoretically and empirically demonstrate that our algorithm achieves higher accuracy compared to the existing solutions. |
Databáze: | OpenAIRE |
Externí odkaz: |