Web-based text anonymization with Node.js: Introducing NETANOS (Named entity-based Text Anonymization for Open Science)
Autor: | Bennett Kleinberg, Maximilian Mozes |
---|---|
Přispěvatelé: | Klinische Psychologie (Psychologie, FMG) |
Rok vydání: | 2017 |
Předmět: |
Information retrieval
business.industry Computer science 05 social sciences String (computer science) 050109 social psychology Context (language use) computer.software_genre JavaScript 050105 experimental psychology Software Named-entity recognition Index (publishing) Node (computer science) Web application 0501 psychology and cognitive sciences business computer computer.programming_language |
Zdroj: | The Journal of Open Source Software, 2(14):293 |
ISSN: | 2475-9066 |
DOI: | 10.21105/joss.00293 |
Popis: | Netanos (Named Entity-based Text ANonymization for Open Science) is a natural language processing software that anonymizes texts by identifying and replacing named entities.The key feature of NETANOS is that the anonymization preserves critical context that allows for secondary linguistic analyses on anonymized texts.Consider the example string “Max and Ben spent more than 1000 hours on writing the software. They started in August 2016 in Amsterdam.” While coarse anonymization such as simple “XXX” replacement would suffice to mask the true content of the string, essential text properties are lost that are needed for secondary analyses. For example, content-based deception detection approaches rely on the number of specific times and dates to differentiate between deceptive and truthful texts (Warmelink et al. 2013).The architecture of NETANOS relies on two software libraries capable of identifying named entities. (1) The Stanford Named Entity Recognizer (NER) (Finkel, Grenager, and Manning 2005) integrated with the ner Node.js package (Srivastava 2016), and (2) the NLP-compromise JavaScript frontend-library (Kelly 2016). Both libraries are used in a layered architecture to identify persons (e.g. “Max”, “Ben”), locations (e.g. “Amsterdam”, “Munich”), organizations (e.g. “Google”), dates (e.g. “August 2016”), and values (e.g. “42”).Specifically, the text anonymization is achieved with the following stepwise procedure: The input string is analyzed by Stanford’s NER, identifying organizations, locations, persons, and dates. All identified entities are replaced with their context-preserving anonymized versions. NLP-compromise’s named entity recognition tool is applied to identify potentially remaining, unrecognized entities.Besides the key feature of context preserving text anonymization, Netanos also provides three alternative anonymization types.• Context-preserving anonymization (key feature): Identified named entity types are replaced with a composite string consisting of the entity type and the corresponding index of occurrence. “[PERSON_1] and [PERSON_2] spent more than [DATE/TIME_1] on writing the software. They started in [DATE/TIME_2] in [LOCATION_1].”• Named entity-based replacement: Identified entities are replaced with a different, randomly chosen named entity of the same type. “Barry and Rick spent more than 997 hours on writing the software. They started in January 14 2016 in Odessa.”• Non-context preserving anonymization: This replacement type is inspired by the anonymization procedure suggested by the UK Data Service (Service, n.d.). It replaces all strings having a capital first letter and all numeric values with XXX. “XXX and XXX spent more than XXX hours on writing the software. XXX started in XXX XXX in XXX.”• Combined, non-context preserving anonymization: The context-preserving replacement is used to identify candidates for replacement that are then replaced with the procedure of the non-context preserving replacement “XXX and XXX spent more than XXX XXX on writing the software. XXX started in XXX XXX in XXX.”Note that all replacements are applied globally across the input string. |
Databáze: | OpenAIRE |
Externí odkaz: |