Formal Interpretability with Merlin-Arthur Classifiers

Autor: Wäldchen, Stephan, Sharma, Kartikey, Zimmer, Max, Turan, Berkant, Pokutta, Sebastian
Jazyk: angličtina
Rok vydání: 2022
Předmět:
Popis: We propose a new type of multi-agent interactive classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of bounds on the mutual information of the features selected by this classifier. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups we do not rely on optimal agents or on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. %relates the information carried by sets of features to one of the individual features. We test our results through numerical experiments on two small-scale datasets where high mutual information can be verified explicitly.
26 pages, 14 figures, 2 tables, 1 algorithm
Databáze: OpenAIRE