Explaining Image Misclassification in Deep Learning via Adversarial Examples

Autor: Haffar R, Jebreel NM, Domingo-Ferrer J, Sánchez D
Přispěvatelé: Enginyeria Informàtica i Matemàtiques, Universitat Rovira i Virgili
Rok vydání: 2021
Předmět:
Medicina ii
Biotecnología
Planejamento urbano e regional / demografia
Matemática / probabilidade e estatística
Ciências agrárias i
Comunicació i informació
General o multidisciplinar
Educação física
Medicina iii
convolutional neural networks
Engenharias i
Medicina veterinaria
Geociências
Computer science
artificial intelligence

Ciências sociais aplicadas i
Geografía
Computer science
theory & methods

Engenharias ii
Computer science (all)
Biodiversidade
Astronomia / física
Química
Engenharias iv
Farmacia
Arquitetura e urbanismo
Arquitetura
urbanismo e design

Saúde coletiva
Comunicação e informação
Educação
Linguística e literatura
Adversarial examples
Materiais
Ciências biológicas i
Ciência da computação
Direito
Image classification
General computer science
Odontología
Medicina i
Administração pública e de empresas
ciências contábeis e turismo

Theoretical computer science
Ciências biológicas iii
Ciências ambientais
Engenharias iii
Computer Science
Artificial Intelligence
Computer Science
Theory & Methods

deep learning
Interdisciplinar
Explainability
Psicología
Ensino
Ciências biológicas ii
Administração
ciências contábeis e turismo

Artes
Zdroj: Lecture Notes In Computer Science
Lecture Notes In Computer Science. 12898 LNAI 323-334
DOI: 10.1007/978-3-030-85529-1_26
Popis: With the increasing use of convolutional neural networks (CNNs) for computer vision and other artificial intelligence tasks, the need arises to interpret their predictions. In this work, we tackle the problem of explaining CNN misclassification of images. We propose to construct adversarial examples that allow identifying the regions of the input images that had the largest impact on the CNN wrong predictions. More specifically, for each image that was incorrectly classified by the CNN, we implemented an inverted adversarial attack consisting on modifying the input image as little as possible so that it becomes correctly classified. The changes made to the image to fix classification errors explain the causes of misclassification and allow adjusting the model and the data set to obtain more accurate models. We present two methods, of which the first one employs the gradients from the CNN itself to create the adversarial examples and is meant for model developers. However, end users only have access to the CNN model as a black box. Our second method is intended for end users and employs a surrogate model to estimate the gradients of the original CNN model, which are then used to create the adversarial examples. In our experiments, the first method achieved 99.67% success rate at finding the misclassification explanations and needed on average 1.96 queries per misclassified image to build the corresponding adversarial example. The second method achieved 73.08% success rate at finding the explanations with 8.73 queries per image on average.
Databáze: OpenAIRE