Preprocessing of unstructured medical data: the impact of each preprocessing stage on classification

Autor:	Mariya Kashina, Georgy Kopanitsa, Iuliia Lenivtceva
Rok vydání:	2020
Předmět:	Normalization (statistics) Stop words Computer science business.industry 020206 networking & telecommunications Pattern recognition 02 engineering and technology Logistic regression 0202 electrical engineering electronic engineering information engineering General Earth and Planetary Sciences Preprocessor 020201 artificial intelligence & image processing Artificial intelligence business Error detection and correction Classifier (UML) General Environmental Science
Zdroj:	Procedia Computer Science. 178:284-290
ISSN:	1877-0509
DOI:	10.1016/j.procs.2020.11.030
Popis:	Nowadays, it is still important to develop methods for processing data, in particular medical texts, in Russian. In this paper, we checked how each stage of text pre-processing affects the result of the classifier. The paper analyzed 269923 records of allergic anamnesis of patients, 11670 of which were placed for further processing. We consider the main stages of pre-processing: tokenization, deletion of stop words, error correction, document cropping, normalization, class harmonization, and vectorization. To vectorize the data, we have selected the Bag-of-Words. The method of logistic regression was chosen for classification, since it has easy reproducibility and interpretation. Precision, recall and F-measure were selected as evaluation metrics. The results (F = 88.12%) showed that the most effective was the stage of normalization and error correction.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::567435b09fd0dba0999cf16234354dae https://doi.org/10.1016/j.procs.2020.11.030 Zobrazit plný text záznamu