Sec-lib: Protecting scholarly digital libraries from infected papers using active machine learning framework

Nir Nissim, Aviad Cohen, Jian Wu, Andrea Lanzi, Lior Rokach, Yuval Elovici, Lee Giles

IEEE Access 7, 110050-110073, 2019

Researchers from academia and the corporate-sector rely on scholarly digital libraries to access articles. Attackers take advantage of innocent users who consider the articles’ files safe and thus open PDF-files with little concern. In addition, researchers consider scholarly libraries a reliable, trusted, and untainted corpus of papers. For these reasons, scholarly digital libraries are an attractive-target and inadvertently support the proliferation of cyber-attacks launched via malicious PDF-files. In this study, we present related vulnerabilities and malware distribution approaches that exploit the vulnerabilities of scholarly digital libraries. We evaluated over two-million scholarly papers in the CiteSeerX library and found the library to be contaminated with a surprisingly large number (0.3–2%) of malicious PDF documents (over 55% were crawled from the IPs of US-universities). We developed a two layered detection …