EHR-Safe: Generating High-Fidelity and Privacy-Preserving Synthetic Electronic Health Records

Autor: Jinsung Yoon, Michel Mizrahi, Nahid Ghalaty, Thomas Jarvinen, Ashwin Ravi, Peter Brune, Fanyu Kong, Dave Anderson, George Lee, Arie Meir, Farhana Bandukwala, Elli Kanal, Sercan Arik, Tomas Pfister
Rok vydání: 2022
Popis: Privacy concerns often arise as the key bottleneck for the sharing of data between consumers and data holders, particularly for sensitive data such as Electronic Health Records (EHR). This impedes the application of data analytics and ML-based innovations with tremendous potential. One promising approach to avoid such privacy concerns is to instead use synthetic data. We propose a novel generative modeling framework, EHR-Safe, for generating highly realistic and privacy-preserving synthetic EHR data. EHR-Safe is based on a two-stage model that consists of sequential encoder-decoder networks and generative adversarial networks. Our innovations focus on the key challenging aspects of real-world EHR data: the data are heterogeneous, consisting of numerical and categorical features with distinct characteristics; they contain time-varying features with highly-varying sequence lengths; and the features are often highly sparse. Under numerous evaluations, we demonstrate that the fidelity of EHR-Safe is very high, i.e. it has almost-identical properties with real data while yielding almost-ideal performance in practical privacy metrics.
Databáze: OpenAIRE