Popis: |
In the domain of NLP (Natural Language Processing), there are studies on automated ICD coding from text data. Due to the sparsity of the training data, it is difficult to process codes that represent rare diseases. Therefore, in many previous studies, the number of targeted codes was often limited by the frequency in the training data. In this work, we propose a method for generating ICD-10 codes by using sequence-to-sequence (seq2seq) model. A seq2seq model can automatically learn the regularity, more specifically hierarchical structure, found on the character sequence of ICD codes, so that it makes use of the interactions and relations among the codes during training, which conventional classifier fails to use. As a result, the proposed method can predict even the codes not appeared in training data. In our experiment, we randomly divided the all pairs of ICD-10 code and its description written in natural language into equal halves. By using the first half, we trained a seq2seq model that convert a token sequence of ICD-10 description into a character sequence representing ICD-10 code. Then, we tested the model on the latter half, which are not included in the training data. We found it successfully assigned the corresponding correct code to 80% of the descriptions, while the conventional classifier trained on the same data did not assign any correct code. In addition, we experimented ICD code assignment with medical test data. Our proposed model showed higher accuracy score than the conventional classifier model. Moreover, we implemented a seq2seq model with attention. Investigating the attended tokens during prediction, we analyzed the performance of our seq2seq model. |