Does GPT-4 surpass human performance in

Abstract

As Large Language Models (LLMs) become increasingly integrated into everyday life as general-purpose multimodal AI systems, their capabilities to simulate human understanding are under examination. This study investigates LLMs’ ability to interpret linguistic pragmatics, which involves context and implied meanings. Using Grice’s communication principles, we evaluated both LLMs (GPT-2, GPT-3, GPT-3.5, GPT-4, and Bard) and human subjects (N = 147) on dialogue-based tasks. Human participants included 71 primarily Serbian students and 76 native English speakers from the United States. Findings revealed that LLMs, particularly GPT-4, outperformed humans. GPT-4 achieved the highest score of 4.80, surpassing the best human score of 4.55. Other LLMs performed well: GPT-3.5 scored 4.10, Bard 3.75, and GPT-3 3.25; GPT-2 had the lowest score of 1.05. The average LLM score was 3.39, exceeding the human cohorts’ averages of 2.80 (Serbian students) and 2.34 (U.S. participants). In the ranking of all 155 subjects (including LLMs and humans), GPT-4 secured the top position, while the best human ranked second. These results highlight significant progress in LLMs’ ability to simulate understanding of linguistic pragmatics. Future studies should confirm these findings with more dialogue-based tasks and diverse participants. This research has important implications for advancing general-purpose AI models in various communication-centered tasks, including potential application in humanoid robots in the future.

Details

Title

Does GPT-4 surpass human performance in linguistic pragmatics?

Pages

794

Publication year

2025

Publication date

Dec 2025

Publisher

Springer Nature B.V.

e-ISSN

2662-9992

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1057/s41599-025-04912-x

ProQuest document ID

3217384541

Does GPT-4 surpass human performance in linguistic pragmatics?

Abstract

Details

Full text options

Suggested sources

Does GPT-4 surpass human performance in linguistic pragmatics?

Content area

Abstract

Details

Full text options

Suggested sources