SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks.
Autor: | Oliveira LESE; Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901, Brazil. kunkaweb@gmail.com., Peters AC; Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901, Brazil., da Silva AMP; Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901, Brazil., Gebeluca CP; Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901, Brazil., Gumiel YB; Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901, Brazil., Cintho LMM; Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901, Brazil., Carvalho DR; Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901, Brazil., Al Hasan S; AI Lab, Philips Research North America, Cambridge, MA, USA., Moro CMC; Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901, Brazil. |
---|---|
Jazyk: | angličtina |
Zdroj: | Journal of biomedical semantics [J Biomed Semantics] 2022 May 08; Vol. 13 (1), pp. 13. Date of Electronic Publication: 2022 May 08. |
DOI: | 10.1186/s13326-022-00269-1 |
Abstrakt: | Background: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. Methods: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. Results: This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores. Conclusion: The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus. (© 2022. The Author(s).) |
Databáze: | MEDLINE |
Externí odkaz: | |
Nepřihlášeným uživatelům se plný text nezobrazuje | K zobrazení výsledku je třeba se přihlásit. |