CMCS: contrastive-metric learning via vector-level sampling and augmentation for code search

Autor: Qihong Song, Haize Hu, Tebo Dai
Jazyk: angličtina
Rok vydání: 2024
Předmět:
Zdroj: Scientific Reports, Vol 14, Iss 1, Pp 1-19 (2024)
Druh dokumentu: article
ISSN: 2045-2322
DOI: 10.1038/s41598-024-64205-2
Popis: Abstract Code search aims to search for code snippets from large codebase that are semantically related to natural query statements. Deep learning is a valuable method for solving code search tasks in which the quality of training data directly impacts the performance of deep-learning models. However, most existing deep-learning models for code search research have overlooked the critical role of training data within batches, particularly hard negative samples, in optimizing model parameters. In this paper, we propose contrastive-metric learning CMCS for code search based on vector-level sampling and augmentation. Specifically, we propose a sampling method to obtain hard negative samples based on the K-means algorithm and a hardness-controllable sample augmentation method to obtain positive and hard negative samples based on vector-level augmentation techniques. We then design an optimization objective composed of metric learning and multimodal contrastive learning using obtained positive and hard negative samples. Extensive experiments were conducted on the large-scale dataset CodeSearchNet using seven advanced code search models. The results show that our proposed method significantly enhances the training efficiency and search performance of code search models, which is conducive to promoting software engineering development.
Databáze: Directory of Open Access Journals
Nepřihlášeným uživatelům se plný text nezobrazuje