Abstract | ||
---|---|---|
Clustering is an important problem in malware research, as the number of malicious samples that appear every day makes manual analysis impractical. Although these samples belong to a limited number of malware families, it is difficult to categorize them automatically as obfuscation is involved. By extracting relevant features we can apply clustering algorithms, then only analyze a couple of representatives from each cluster. However, classic clustering algorithms that compute the similarity between each pair of samples are slow when a large collection is involved. In this paper, the features will be strings of operation codes extracted from the binary code of each sample. With a modified suffix tree data structure we can find long enough substrings that correspond to portions of a program's code. These substrings must be filtered against a database of known substrings so that common library code will be ignored. The items that have common substrings above a certain threshold will be grouped into the same cluster. Our algorithm was tested with data extracted from real-world malware and constructed quality clusters. |
Year | DOI | Venue |
---|---|---|
2016 | 10.1007/s11416-014-0227-6 | J. Computer Virology and Hacking Techniques |
Field | DocType | Volume |
Edit distance,Malware research,Data mining,Data structure,Substring,Tree traversal,Computer science,Suffix tree,Cluster analysis,Malware | Journal | 12 |
Issue | ISSN | Citations |
1 | 2263-8733 | 1 |
PageRank | References | Authors |
0.35 | 14 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Ciprian Oprisa | 1 | 16 | 5.48 |
George Cabau | 2 | 4 | 1.48 |
Gheorghe Sebestyen | 3 | 5 | 6.25 |