Popis: |
Finding similarities in a group of sequences often involves studying their common subsequences or their common substrings. In our case, Android malware detection/classification, we study the event sequences coming from the dynamic analysis of applications. For several reasons, these sequences are mostly comprised of benign events. This specific set up makes classic sequence similarity criteria useless without any machine learning. The sequence membership to a group is characterized by subsequences of any length. Heuristic algorithms for extracting short subsequences already exist, but no attempt to solve the problem systematically has been proposed. We propose a new algorithm for building the Embedding Antichain from the set of common subsequences (noted AΓ). We show that this mathematical representation is very compact and embed all common subsequences of a sequence set. It is a tool for characterizing a group of sequences. The construction of this representation reveals several complex subproblems. A few of them are solved in this article, along with practical implementations. Moreover, we solved different reduced problems and provided suboptimal solutions for the others. This article opens a new path that has cross-domain applications. Specifically, in the malware detection/classification domain the Systematic Characterization of Sequence Groups is a tool that can be used for automatic generation of malware family signatures and detection heuristics. We experimented AΓ for building an Android malware family detector, on the sequences of executed Android API calls and it yields an accuracy of 97.74%. |