ALDOCX: Detection of Unknown Malicious Microsoft Office Documents Using Designated Active Learning Methods Based on New Structural Feature Extraction Methodology

Nir Nissim ; Aviad Cohen ; Yuval Elovici

IEEE Transactions on Information Forensics and Security ( Volume: 12, Issue: 3, March 2017 ) Page(s): 631 - 646

Attackers increasingly take advantage of innocent users who tend to casually open email messages assumed to be benign, carrying malicious documents. Recent targeted attacks aimed at organizations utilize the new Microsoft Word documents (*.docx). Anti-virus software fails to detect new unknown malicious files, including malicious docx files. In this paper, we present ALDOCX, a framework aimed at accurate detection of new unknown malicious docx files that also efficiently enhances the framework’s detection capabilities over time. Detection relies upon our new structural feature extraction methodology (SFEM), which is performed statically using meta-features extracted from docx files. Using machine-learning algorithms with SFEM, we created a detection model that successfully detects new unknown malicious docx files. In addition, because it is crucial to maintain the detection model’s updatability and incorporate new malicious files created daily, ALDOCX integrates our active-learning (AL) methods, which are designed to efficiently assist anti-virus vendors by better focusing their experts’ analytical efforts and enhance detection capability. ALDOCX identifies and acquires new docx files that are most likely malicious, as well as informative benign files. These files are used for enhancing the knowledge stores of both the detection model and the anti-virus software. The evaluation results show that by using ALDOCX and SFEM, we achieved a high detection rate of malicious docx files (94.44% TPR) compared with the anti-virus software (85.9% TPR)-with very low FPR rates (0.19%). ALDOCX’s AL methods used only 14% of the labeled docx files, which led to a reduction of 95.5% in security experts’ labeling efforts compared with the passive learning and the support vector machine (SVM)-Margin (existing active-learning method). Our AL methods also showed a significant improvement of 91% in number of unknown docx malware acquired, compared with the passive learning and the SVM-Margin, thus providing an improved updating solution for the detection model, as well as the anti-virus software widely used within organizations.