An Empirical Evaluation of Automated Machine Learning Techniques for Malware Detection - Citegraph

Paper Info

Title
An Empirical Evaluation of Automated Machine Learning Techniques for Malware Detection

Abstract
ABSTRACTNowadays, it is increasingly difficult even for a machine learning expert to incorporate all of the recent best practices into their modeling due to the fast development of state-of-the-art machine learning techniques. For the applications that handle big data sets, the complexity of the problem of choosing the best performing model with the best hyper-parameter setting becomes harder. In this work, we present an empirical evaluation of automated machine learning (AutoML) frameworks or techniques that aim to optimize hyper-parameters for machine learning models to achieve the best achievable performance. We apply AutoML techniques to the malware detection problem, which requires achieving the true positive rate as high as possible while reducing the false positive rate as low as possible. We adopt two AutoML frameworks, namely AutoGluon-Tabular and Microsoft Neural Network Intelligence (NNI) to optimize hyper-parameters of a Light Gradient Boosted Machine (LightGBM) model for classifying malware samples. We carry out extensive experiments on two data sets. The first data set is a publicly available data set (EMBER data set), that has been used as a benchmarking data set for many malware detection works. The second data set is a private data set we have acquired from a security company that provides recently-collected malware samples. We provide empirical analysis and performance comparison of the two AutoML frameworks. The experimental results show that AutoML frameworks could identify the set of hyper-parameters that significantly outperform the performance of the model with the known best performing hyper-parameter setting and improve the performance of a LightGBM classifier with respect to the true positive rate from $86.8%$ to $90%$ at $0.1%$ of false positive rate on EMBER data set and from $80.8%$ to $87.4%$ on the private data set.

Year	DOI	Venue
2021	10.1145/3445970.3451155	CODASPY
DocType	Citations	PageRank
Conference	1	0.39
References	Authors
0	3

Authors (3 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
Partha Pratim Kundu	1	1	0.39
Lux Anatharaman	2	1	0.39
Tram Truong-Huu	3	1	0.39

1