Detecting adversarial perturbations through spatial behavior in activation spaces

Ziv Katzir, Yuval Elovici

2019 International Joint Conference on Neural Networks (IJCNN), 1-9, 2019

Although neural network-based classifiers outperform humans in a range of tasks, they are still prone to manipulation through adversarial perturbations. Prior research has resulted in the identification of effective defense mechanisms for many reported attack methods, however a defense against the C&W attack, as well as a holistic defense mechanism capable of countering multiple different attack methods, are still missing. All attack methods reported so far share a common goal. They aim to avoid detection by limiting the allowed perturbation magnitude, and still trigger incorrect classification. As a result, small perturbations cause classification to shift from one class to another. We coined the term activation spaces to refer to the hyperspaces formed by the activation values of the different network layers. We then use activation spaces to capture the differences in spatial dynamics between normal and adversarial …