Gradients cannot be tamed: Behind the impossible paradox of blocking targeted adversarial attacks

Ziv Katzir, Yuval Elovici

IEEE Transactions on Neural Networks and Learning Systems 32 (1), 128-138, 2020

Despite their accuracy, neural network-based classifiers are still prone to manipulation through adversarial perturbations. These perturbations are designed to be misclassified by the neural network while being perceptually identical to some valid inputs. The vast majority of such attack methods rely on white-box conditions, where the attacker has full knowledge of the attacked network’s parameters. This allows the attacker to calculate the network’s loss gradient with respect to some valid inputs and use this gradient in order to create an adversarial example. The task of blocking white-box attacks has proved difficult to address. While many defense methods have been suggested, they have had limited success. In this article, we examine this difficulty and try to understand it. We systematically explore the capabilities and limitations of defensive distillation, one of the most promising defense mechanisms against …