Automatic privacy and utility evaluation of anonymized documents via deep learning
Autor: | Manzanares Salor, Benet |
---|---|
Přispěvatelé: | Universitat Politècnica de Catalunya. Universitat Rovira i Virgili, Moreno Ribas, Antonio, Sánchez Ruenes, David |
Jazyk: | angličtina |
Rok vydání: | 2023 |
Předmět: |
Utility assessment
Artificial intelligence Text anonymization Contingut d'informació Intel·ligència artificial Neural language models Anonimització de text Deep learning Publicació de dades que preserven la privadesa Risc de reidentificació Re-identification risk Record linkage Privacy-preserving data publishing Informàtica::Intel·ligència artificial [Àrees temàtiques de la UPC] Enllaç de registres Information content Models de llenguatge neuronal Avaluació de la utilitat Aprenentatge profund |
Popis: | Text anonymization methods are evaluated by comparing their outputs with human-based anonymizations through standard information retrieval (IR) metrics. On the one hand, the residual disclosure risk is quantified with the recall metric, which gives the proportion of re-identifying terms successfully detected by the anonymization algorithm. On the other hand, the preserved utility is measured with the precision metric, which accounts the proportion of masked terms that were also annotated by the human experts. Nevertheless, because these evaluation metrics were meant for information retrieval rather than privacy-oriented tasks, they suffer from several drawbacks. First, they assume a unique ground truth, and this does not hold for text anonymization, where several masking choices could be equally valid to prevent re-identification. Second, annotation-based evaluation relies on human judgements, which are inherently subjective and may be prone to errors. Finally, both metrics weight terms uniformly, thereby ignoring the fact that the influence on the disclosure risk or on utility preservation of some terms may be much larger than of others. To overcome these drawbacks, in this thesis we propose two novel methods to evaluate both the disclosure risk and the utility preserved in anonymized texts. Our approach leverages deep learning methods to perform this evaluation automatically, thereby not requiring human annotations. For assessing disclosure risks, we propose using a re-identification attack, which we define as a multi-class classification task built on top of state-of-the art language models. To make it feasible, the attack has been designed to capture the means and computational resources expected to be available at the attacker's end. For utility assessment, we propose a method that measures the information loss incurred during the anonymization process, which relies on a neural masked language modeling. We illustrate the effectiveness of our methods by evaluating the disclosure risk and retained utility of several well-known techniques and tools for text anonymization on a common dataset. Empirical results show significant privacy risks for all of them (including manual anonymization) and consistently proportional utility preservation. |
Databáze: | OpenAIRE |
Externí odkaz: |