Popis: |
Data, information, and knowledge processing systems, in the domain of healthcare, are currently plagued by heterogeneity at various levels. Current solutions have focused on developing a standard-based, manual intervention mechanism, which requires a large number of human resources and necessitates the realignment of existing systems. State-of-the-art methodologies in the field of natural language processing and machine learning can help to partially automate this process, reducing the resource requirements and providing a relatively good multi-class-based classification algorithm. We present a novel methodology for bridging the gap between various healthcare data management solutions by leveraging the strength of transformer-based machine learning models, to create mappings between the data elements. Additionally, the annotated data, collected against five medical schemas and labeled by four annotators is made available for helping future researchers. Our results indicate, that for biased, dependent multi-class text classification, transformer-based models provide better results than linguistic and other classical models. In particular, the Robustly Optimized BERT Pretraining Approach (RoBERTa) provides the best schema matching performance by achieving a Cohen’s kappa score of 0.47 and Matthews Correlation Coefficient (MCC) score of 0.48, with human-annotated data. |