A Machine Learning-Based Approach For Detecting Malicious PyPI Packages

Autor:	Samaana, Haya, Costa, Diego Elias, Shihab, Emad, Abdellatif, Ahmad
Rok vydání:	2024
Předmět:	Computer Science - Software Engineering
Druh dokumentu:	Working Paper
Popis:	Background. In modern software development, the use of external libraries and packages is increasingly prevalent, streamlining the software development process and enabling developers to deploy feature-rich systems with little coding. While this reliance on reusing code offers substantial benefits, it also introduces serious risks for deployed software in the form of malicious packages - harmful and vulnerable code disguised as useful libraries. Aims. Popular ecosystems, such PyPI, receive thousands of new package contributions every week, and distinguishing safe contributions from harmful ones presents a significant challenge. There is a dire need for reliable methods to detect and address the presence of malicious packages in these environments. Method. To address these challenges, we propose a data-driven approach that uses machine learning and static analysis to examine the package's metadata, code, files, and textual characteristics to identify malicious packages. Results. In evaluations conducted within the PyPI ecosystem, we achieved an F1-measure of 0.94 for identifying malicious packages using a stacking ensemble classifier. Conclusions. This tool can be seamlessly integrated into package vetting pipelines and has the capability to flag entire packages, not just malicious function calls. This enhancement strengthens security measures and reduces the manual workload for developers and registry maintainers, thereby contributing to the overall integrity of the ecosystem.
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2412.05259 Zobrazit plný text záznamu View this record from Arxiv