Data augmentation and transfer learning to classify malware images in a deep learning context
Autor: | Mila Dalla Preda, Niccolò Marastoni, Roberto Giacobazzi |
---|---|
Rok vydání: | 2021 |
Předmět: |
Computer science
Feature extraction 02 engineering and technology Overfitting Machine learning computer.software_genre Malware Convolutional neural network 020204 information systems 0202 electrical engineering electronic engineering information engineering Computer Science (miscellaneous) Artificial neural network business.industry Deep learning Binaries 020206 networking & telecommunications Computational Theory and Mathematics Hardware and Architecture Test set Artificial intelligence Transfer of learning business computer Software |
Zdroj: | Journal of Computer Virology and Hacking Techniques. 17:279-297 |
ISSN: | 2263-8733 |
Popis: | In the past few years, malware classification techniques have shifted from shallow traditional machine learning models to deeper neural network architectures. The main benefit of some of these is the ability to work with raw data, guaranteed by their automatic feature extraction capabilities. This results in less technical expertise needed while building the models, thus less initial pre-processing resources. Nevertheless, such advantage comes with its drawbacks, since deep learning models require huge quantities of data in order to generate a model that generalizes well. The amount of data required to train a deep network without overfitting is often unobtainable for malware analysts. We take inspiration from image-based data augmentation techniques and apply a sequence of semantics-preserving syntactic code transformations (obfuscations) to a small dataset of programs to generate a larger dataset. We then design two learning models, a convolutional neural network and a bi-directional long short-term memory, and we train them on images extracted from compiled binaries of the newly generated dataset. Through transfer learning we then take the features learned from the obfuscated binaries and train the models against two state of the art malware datasets, each containing around 10 000 samples. Our models easily achieve up to 98.5% accuracy on the test set, which is on par or better than the present state of the art approaches, thus validating the approach. |
Databáze: | OpenAIRE |
Externí odkaz: |