Proposal and Comparison of Health Specific Features for the Automatic Assessment of Readability
Autor: | Carla Teixeira Lopes, Hélder Antunes |
---|---|
Rok vydání: | 2020 |
Předmět: |
020205 medical informatics
business.industry Computer science Health literacy 02 engineering and technology computer.software_genre Readability Domain (software engineering) 0202 electrical engineering electronic engineering information engineering Artificial intelligence Language model business computer Natural language processing |
Zdroj: | SIGIR |
Popis: | Looking for health information is one of the most popular activities online. However, the specificity of language on this domain is frequently an obstacle to comprehension, especially for the ones with lower levels of health literacy. For this reason, search engines should consider the readability of health content and, if possible, adapt it to the user behind the search. In this work, we explore methods to assess the readability of health content automatically. We propose features capable of measuring the specificity of a medical text and estimate the knowledge necessary to comprehend it. The features are based on information retrieval metrics and the log-likelihood of a text with lay and medico-scientific language models. To evaluate our methods, we built and used a dataset composed of health articles of Simple English Wikipedia and the respective documents in ordinary Wikipedia. We achieved a maximum accuracy of 88% in binary classifications (easy versus hard-to-read). We found out that the machine learning algorithm does not significantly interfere with performance. We also experimented and compared different features combinations. The features using the values of the log-likelihood of a text with lay and medico-scientific language models perform better than all the others. |
Databáze: | OpenAIRE |
Externí odkaz: |