Challenging large language models' " intelligence " with human tools: A neuropsychological investigation in Italian language on prefrontal functioning.
Autor: | Loconte R; Molecular Mind Lab, IMT School of Advanced Studies Lucca, Lucca, Italy., Orrù G; University of Pisa, Pisa, Italy., Tribastone M; Molecular Mind Lab, IMT School of Advanced Studies Lucca, Lucca, Italy., Pietrini P; Molecular Mind Lab, IMT School of Advanced Studies Lucca, Lucca, Italy., Sartori G; Department of General Psychology, University of Padova, Padova, Italy. |
---|---|
Jazyk: | angličtina |
Zdroj: | Heliyon [Heliyon] 2024 Oct 03; Vol. 10 (19), pp. e38911. Date of Electronic Publication: 2024 Oct 03 (Print Publication: 2024). |
DOI: | 10.1016/j.heliyon.2024.e38911 |
Abstrakt: | The Artificial Intelligence (AI) research community has used ad-hoc benchmarks to measure the " intelligence " level of Large Language Models (LLMs). In humans, intelligence is closely linked to the functional integrity of the prefrontal lobes, which are essential for higher-order cognitive processes. Previous research has found that LLMs struggle with cognitive tasks that rely on these prefrontal functions, highlighting a significant challenge in replicating human-like intelligence. In December 2022, OpenAI released ChatGPT, a new chatbot based on the GPT-3.5 model that quickly gained popularity for its impressive ability to understand and respond to human instructions, suggesting a significant step towards intelligent behaviour in AI. Therefore, to rigorously investigate LLMs' level of " intelligence ," we evaluated the GPT-3.5 and GPT-4 versions through a neuropsychological assessment using tests in the Italian language routinely employed to assess prefrontal functioning in humans. The same tests were also administered to Claude2 and Llama2 to verify whether similar language models perform similarly in prefrontal tests. When using human performance as a reference, GPT-3.5 showed inhomogeneous results on prefrontal tests, with some tests well above average, others in the lower range, and others frankly impaired. Specifically, we have identified poor planning abilities and difficulty in recognising semantic absurdities and understanding others' intentions and mental states. Claude2 exhibited a similar pattern to GPT-3.5, while Llama2 performed poorly in almost all tests. These inconsistent profiles highlight how LLMs' emergent abilities do not yet mimic human cognitive functioning. The sole exception was GPT-4, which performed within the normative range for all the tasks except planning. Furthermore, we showed how standardised neuropsychological batteries developed to assess human cognitive functions may be suitable for challenging LLMs' performance. Competing Interests: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. (© 2024 The Author(s).) |
Databáze: | MEDLINE |
Externí odkaz: |