Content area

Abstract

As Large Language Models (LLMs) become increasingly integrated into everyday life as general-purpose multimodal AI systems, their capabilities to simulate human understanding are under examination. This study investigates LLMs’ ability to interpret linguistic pragmatics, which involves context and implied meanings. Using Grice’s communication principles, we evaluated both LLMs (GPT-2, GPT-3, GPT-3.5, GPT-4, and Bard) and human subjects (N = 147) on dialogue-based tasks. Human participants included 71 primarily Serbian students and 76 native English speakers from the United States. Findings revealed that LLMs, particularly GPT-4, outperformed humans. GPT-4 achieved the highest score of 4.80, surpassing the best human score of 4.55. Other LLMs performed well: GPT-3.5 scored 4.10, Bard 3.75, and GPT-3 3.25; GPT-2 had the lowest score of 1.05. The average LLM score was 3.39, exceeding the human cohorts’ averages of 2.80 (Serbian students) and 2.34 (U.S. participants). In the ranking of all 155 subjects (including LLMs and humans), GPT-4 secured the top position, while the best human ranked second. These results highlight significant progress in LLMs’ ability to simulate understanding of linguistic pragmatics. Future studies should confirm these findings with more dialogue-based tasks and diverse participants. This research has important implications for advancing general-purpose AI models in various communication-centered tasks, including potential application in humanoid robots in the future.

Details

Title
Does GPT-4 surpass human performance in linguistic pragmatics?
Pages
794
Publication year
2025
Publication date
Dec 2025
Publisher
Springer Nature B.V.
e-ISSN
2662-9992
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3217384541
Copyright
Copyright Palgrave Macmillan Dec 2025