Data augmentation and transfer learning to classify malware images in a deep learning context

Autor:	Mila Dalla Preda, Niccolò Marastoni, Roberto Giacobazzi
Rok vydání:	2021
Předmět:	Computer science Feature extraction 02 engineering and technology Overfitting Machine learning computer.software_genre Malware Convolutional neural network 020204 information systems 0202 electrical engineering electronic engineering information engineering Computer Science (miscellaneous) Artificial neural network business.industry Deep learning Binaries 020206 networking & telecommunications Computational Theory and Mathematics Hardware and Architecture Test set Artificial intelligence Transfer of learning business computer Software
Zdroj:	Journal of Computer Virology and Hacking Techniques. 17:279-297
ISSN:	2263-8733
Popis:	In the past few years, malware classification techniques have shifted from shallow traditional machine learning models to deeper neural network architectures. The main benefit of some of these is the ability to work with raw data, guaranteed by their automatic feature extraction capabilities. This results in less technical expertise needed while building the models, thus less initial pre-processing resources. Nevertheless, such advantage comes with its drawbacks, since deep learning models require huge quantities of data in order to generate a model that generalizes well. The amount of data required to train a deep network without overfitting is often unobtainable for malware analysts. We take inspiration from image-based data augmentation techniques and apply a sequence of semantics-preserving syntactic code transformations (obfuscations) to a small dataset of programs to generate a larger dataset. We then design two learning models, a convolutional neural network and a bi-directional long short-term memory, and we train them on images extracted from compiled binaries of the newly generated dataset. Through transfer learning we then take the features learned from the obfuscated binaries and train the models against two state of the art malware datasets, each containing around 10 000 samples. Our models easily achieve up to 98.5% accuracy on the test set, which is on par or better than the present state of the art approaches, thus validating the approach.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::a376a210742a6120f68a7dad6a8af476 https://doi.org/10.1007/s11416-021-00381-3 Zobrazit plný text záznamu Full text from SpringerLink