Predicting Duplicate in Bug Report Using Topic-Based Duplicate Learning With Fine Tuning-Based BERT Algorithm

Autor:	Taemin Kim, Geunseok Yang
Jazyk:	angličtina
Rok vydání:	2022
Předmět:	BERT feature selection topic modeling bug duplicate software evolution Electrical engineering. Electronics. Nuclear engineering TK1-9971
Zdroj:	IEEE Access, Vol 10, Pp 129666-129675 (2022)
Druh dokumentu:	article
ISSN:	2169-3536
DOI:	10.1109/ACCESS.2022.3226238
Popis:	As the usage and coverage of software increase, various functional improvements and bugs are occurring. The Eclipse, Mozilla open-source projects receive more than about 300 bug reports per day. Usually, when a user finds a bug, they write a bug report. The developer assigned to the bug reads the content of the bug, and if it has already been fixed, the developer marks it as a duplicate bug report. However, if duplicate bug reports are submitted, the developer must manually identify the same bug, and this process requires a lot of effort by the developer. If redundancies in bug reports can be identified automatically, unnecessary effort on the part of the developer can be reduced. To resolve this problem, this paper predicts redundancy using the BERT (Bidirectional Encoder Representations from the Transformer) algorithm and topic-based duplicate/non-duplicate feature extraction. First, a bug report by bug status is extracted from the bug repository, and topic models are constructed by status by applying topic modeling to each status. In each topic, feature selection is performed using the non-duplicate status and the duplicate status. It learns the extracted features as inputs to the BERT algorithm and predicts duplicate bug reports. In this paper, Precision, Recall, F-measure, and Accuracy were used to evaluate the proposed model, and Eclipse, Mozilla, Apache, and KDE open sources were used. The proposed model shows about 87.67%, 89.85%, 87.03%, and 88.95% performance in Eclipse, Mozilla, Apache, and KDE, respectively. In addition, performance comparison with baselines (Naïve Bayes, Randomforest, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Convolutional Neural Networks-Long Short-Term Memory Networks (CNN-LSTM)) in Eclipse, Mozilla, Apache, and KDE about 36.33%, 44.46%, 47.77%, and 45.17%, improvement, respectively, showed that the proposed model is better at detecting duplicates than the baselines.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/c286853873284a95bbec00144e1f50d7 Zobrazit plný text záznamu View record in DOAJ