Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

Autor:	Costarelli, Anthony, Allen, Mat, Field, Severin, Clymer, Joshua
Rok vydání:	2024
Předmět:	Computer Science - Machine Learning Computer Science - Artificial Intelligence
Druh dokumentu:	Working Paper
Popis:	As Large Language Models (LLMs) become increasingly integrated into our daily lives, the potential harms from deceptive behavior underlie the need for faithfully interpreting their decision-making. While traditional probing methods have shown some effectiveness, they remain best for narrowly scoped tasks while more comprehensive explanations are still necessary. To this end, we investigate meta-models-an architecture using a "meta-model" that takes activations from an "input-model" and answers natural language questions about the input-model's behaviors. We evaluate the meta-model's ability to generalize by training them on selected task types and assessing their out-of-distribution performance in deceptive scenarios. Our findings show that meta-models generalize well to out-of-distribution tasks and point towards opportunities for future research in this area. Comment: 11 pages, 2 figures
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2410.02472 Zobrazit plný text záznamu View this record from Arxiv