Evaluating large language models in theory of mind tasks.

Autor: Kosinski M; Graduate School of Business, Stanford University, Stanford, CA 94305.
Jazyk: angličtina
Zdroj: Proceedings of the National Academy of Sciences of the United States of America [Proc Natl Acad Sci U S A] 2024 Nov 05; Vol. 121 (45), pp. e2405460121. Date of Electronic Publication: 2024 Oct 29.
DOI: 10.1073/pnas.2405460121
Abstrakt: Eleven large language models (LLMs) were assessed using 40 bespoke false-belief tasks, considered a gold standard in testing theory of mind (ToM) in humans. Each task included a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. An LLM had to solve all eight scenarios to solve a single task. Older models solved no tasks; Generative Pre-trained Transformer (GPT)-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of 6-y-old children observed in past studies. We explore the potential interpretation of these results, including the intriguing possibility that ToM-like ability, previously considered unique to humans, may have emerged as an unintended by-product of LLMs' improving language skills. Regardless of how we interpret these outcomes, they signify the advent of more powerful and socially skilled AI-with profound positive and negative implications.
Competing Interests: Competing interests statement:The author declares no competing interest.
Databáze: MEDLINE