Unknown malcode detection and the imbalance problem

Robert Moskovitch, Dima Stopel, Clint Feher, Nir Nissim, Nathalie Japkowicz, Yuval Elovici

Journal in computer virology 5, 295-308, 2009

The recent growth in network usage has motivated the creation of new malicious code for various purposes. Today’s signature-based antiviruses are very accurate for known malicious code, but can not detect new malicious code. Recently, classification algorithms were used successfully for the detection of unknown malicious code. But, these studies involved a test collection with a limited size and the same malicious: benign file ratio in both the training and test sets, a situation which does not reflect real-life conditions. We present a methodology for the detection of unknown malicious code, which examines concepts from text categorization, based on n-grams extraction from the binary code and feature selection. We performed an extensive evaluation, consisting of a test collection of more than 30,000 files, in which we investigated the class imbalance problem. In real-life scenarios, the malicious file content …