Popis: |
Binary code similarity detection has been widely used in vulnerability searching,malware detection,advanced program analysis and other fields in recent years,while program code is similar to natural language in a degree,researchers start to use pre-training and other natural language processing related technologies to improve accuracy.A binary code similarity detection method based on pre-training assembly instruction representation is proposed to deal with the accuracy bottleneck due to insufficient consideration of instruction probability features.It includes tokenization method for multi-arch assembly instructions,and pre-trai-ning tasks that considering control flow,data flow,instruction logic and probability of occurrence,to achieve better vectorized representation of instructions.Downstream binary code similarity detection task is improved by combining pre-training method to gain accuracy boost.Experiments show that,compared with the existing methods,the proposed method improves instruction representing performance by 23.7% at the maximum,and improves block searching ability and similarity detection performance by up to 33.97% and 400% respectively. |