PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs

Autor:	Yadav, Ankit, Beniwal, Himanshu, Singh, Mayank
Rok vydání:	2024
Předmět:	Computer Science - Computation and Language Computer Science - Artificial Intelligence
Druh dokumentu:	Working Paper
Popis:	Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks, potentially inflating model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs.
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2401.03855 Zobrazit plný text záznamu View this record from Arxiv