Transformer-based structuring of free-text radiology report databases.

Autor: Nowak S; Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Venusberg-Campus 1, 53127, Bonn, Germany. sebastian.nowak@ukbonn.de., Biesner D; Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, Sankt Augustin, Germany., Layer YC; Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Venusberg-Campus 1, 53127, Bonn, Germany., Theis M; Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Venusberg-Campus 1, 53127, Bonn, Germany., Schneider H; Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, Sankt Augustin, Germany., Block W; Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Venusberg-Campus 1, 53127, Bonn, Germany., Wulff B; Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, Sankt Augustin, Germany., Attenberger UI; Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Venusberg-Campus 1, 53127, Bonn, Germany., Sifa R; Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, Sankt Augustin, Germany., Sprinkart AM; Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Venusberg-Campus 1, 53127, Bonn, Germany.
Jazyk: angličtina
Zdroj: European radiology [Eur Radiol] 2023 Jun; Vol. 33 (6), pp. 4228-4236. Date of Electronic Publication: 2023 Mar 11.
DOI: 10.1007/s00330-023-09526-y
Abstrakt: Objectives: To provide insights for on-site development of transformer-based structuring of free-text report databases by investigating different labeling and pre-training strategies.
Methods: A total of 93,368 German chest X-ray reports from 20,912 intensive care unit (ICU) patients were included. Two labeling strategies were investigated to tag six findings of the attending radiologist. First, a system based on human-defined rules was applied for annotation of all reports (termed "silver labels"). Second, 18,000 reports were manually annotated in 197 h (termed "gold labels") of which 10% were used for testing. An on-site pre-trained model (T mlm ) using masked-language modeling (MLM) was compared to a public, medically pre-trained model (T med ). Both models were fine-tuned on silver labels only, gold labels only, and first with silver and then gold labels (hybrid training) for text classification, using varying numbers (N: 500, 1000, 2000, 3500, 7000, 14,580) of gold labels. Macro-averaged F1-scores (MAF1) in percent were calculated with 95% confidence intervals (CI).
Results: T mlm,gold (95.5 [94.5-96.3]) showed significantly higher MAF1 than T med,silver (75.0 [73.4-76.5]) and T mlm,silver (75.2 [73.6-76.7]), but not significantly higher MAF1 than T med,gold (94.7 [93.6-95.6]), T med,hybrid (94.9 [93.9-95.8]), and T mlm,hybrid (95.2 [94.3-96.0]). When using 7000 or less gold-labeled reports, T mlm,gold (N: 7000, 94.7 [93.5-95.7]) showed significantly higher MAF1 than T med,gold (N: 7000, 91.5 [90.0-92.8]). With at least 2000 gold-labeled reports, utilizing silver labels did not lead to significant improvement of T mlm,hybrid (N: 2000, 91.8 [90.4-93.2]) over T mlm,gold (N: 2000, 91.4 [89.9-92.8]).
Conclusions: Custom pre-training of transformers and fine-tuning on manual annotations promises to be an efficient strategy to unlock report databases for data-driven medicine.
Key Points: • On-site development of natural language processing methods that retrospectively unlock free-text databases of radiology clinics for data-driven medicine is of great interest. • For clinics seeking to develop methods on-site for retrospective structuring of a report database of a certain department, it remains unclear which of previously proposed strategies for labeling reports and pre-training models is the most appropriate in context of, e.g., available annotator time. • Using a custom pre-trained transformer model, along with a little annotation effort, promises to be an efficient way to retrospectively structure radiological databases, even if not millions of reports are available for pre-training.
(© 2023. The Author(s).)
Databáze: MEDLINE