Proposal and Comparison of Health Specific Features for the Automatic Assessment of Readability

Autor: Carla Teixeira Lopes, Hélder Antunes
Rok vydání: 2020
Předmět:
Zdroj: SIGIR
Popis: Looking for health information is one of the most popular activities online. However, the specificity of language on this domain is frequently an obstacle to comprehension, especially for the ones with lower levels of health literacy. For this reason, search engines should consider the readability of health content and, if possible, adapt it to the user behind the search. In this work, we explore methods to assess the readability of health content automatically. We propose features capable of measuring the specificity of a medical text and estimate the knowledge necessary to comprehend it. The features are based on information retrieval metrics and the log-likelihood of a text with lay and medico-scientific language models. To evaluate our methods, we built and used a dataset composed of health articles of Simple English Wikipedia and the respective documents in ordinary Wikipedia. We achieved a maximum accuracy of 88% in binary classifications (easy versus hard-to-read). We found out that the machine learning algorithm does not significantly interfere with performance. We also experimented and compared different features combinations. The features using the values of the log-likelihood of a text with lay and medico-scientific language models perform better than all the others.
Databáze: OpenAIRE