Bidirectional long short-term memory for surgical skill classification of temporally segmented tasks.
Autor: | Kelly JD; Department of Mechanical Engineering, University of Minnesota, Minneapolis, MN, USA. kell1917@umn.edu., Petersen A; Division of Biostatistics, University of Minnesota, Minneapolis, MN, USA., Lendvay TS; Department of Urology, Seattle Children's Hospital, Seattle, WA, USA., Kowalewski TM; Department of Mechanical Engineering, University of Minnesota, Minneapolis, MN, USA. |
---|---|
Jazyk: | angličtina |
Zdroj: | International journal of computer assisted radiology and surgery [Int J Comput Assist Radiol Surg] 2020 Dec; Vol. 15 (12), pp. 2079-2088. Date of Electronic Publication: 2020 Sep 30. |
DOI: | 10.1007/s11548-020-02269-x |
Abstrakt: | Purpose: The majority of historical surgical skill research typically analyzes holistic summary task-level metrics to create a skill classification for a performance. Recent advances in machine learning allow time series classification at the sub-task level, allowing predictions on segments of tasks, which could improve task-level technical skill assessment. Methods: A bidirectional long short-term memory (LSTM) network was used with 8-s windows of multidimensional time-series data from the Basic Laparoscopic Urologic Skills dataset. The network was trained on experts and novices from four common surgical tasks. Stratified cross-validation with regularization was used to avoid overfitting. The misclassified cases were re-submitted for surgical technical skill assessment to crowds using Amazon Mechanical Turk to re-evaluate and to analyze the level of agreement with previous scores. Results: Performance was best for the suturing task, with 96.88% accuracy at predicting whether a performance was an expert or novice, with 1 misclassification, when compared to previously obtained crowd evaluations. When compared with expert surgeon ratings, the LSTM predictions resulted in a Spearman coefficient of 0.89 for suturing tasks. When crowds re-evaluated misclassified performances, it was found that for all 5 misclassified cases from peg transfer and suturing tasks, the crowds agreed more with our LSTM model than with the previously obtained crowd scores. Conclusion: The technique presented shows results not incomparable with labels which would be obtained from crowd-sourced labels of surgical tasks. However, these results bring about questions of the reliability of crowd sourced labels in videos of surgical tasks. We, as a research community, should take a closer look at crowd labeling with higher scrutiny, systematically look at biases, and quantify label noise. |
Databáze: | MEDLINE |
Externí odkaz: |