“It is important to consult” a linguist:

Full text

Turn on search term navigation

Introduction

In this paper, we compare language patterns in text from LLM-powered chatbot products (henceforth “AI-generated text”) to those found in human writing (henceforth “human-written text”). Much of the research comparing these two concerns research ethics and academic integrity (e.g., [1–4]) or broader issues of societal change or misuse (e.g., [5–9]). There is some ambiguity in this literature over the degree to which AI-generated text and human writing can be differentiated. Obviously, the commercial success of tools like ChatGPT, Gemini, and Claude is premised on a close resemblance, and some published research suggests this is true in certain ways. For example, Casal and Kessler [1], found that AI-generated research article abstracts in Applied Linguistics based on the content of a research paper were not reliably distinguishable from the published, human-composed abstracts by researchers in the field. In a different manner, Gorenz and Schwarz [10] found that humans rated human and AI-generated satirical joke headlines (based on The Onion) to be similarly funny, with some AI-generated jokes performing even better than human jokes. These works suggest that differentiating AI-generated text from human-written text can be challenging in at least some cases, in spite of human’s potential for creativity [11].

In contrast, research on signals that allow for the detection of AI-generated text suggests that differentiating between the two may actually be straightforward. This research generally consists of two kinds of approaches: “black-box” methods that rely on linguistic or statistical patterns in text and “white-box” methods that require access to the language model for techniques like watermarking [12]. A huge variety of signals have been used in published studies of black-box methods, including lexico-grammatical [5] and stylistic or macro-structural features [6]. While linguistic patterns “... serve as valuable features for detecting LLM-generated text”, it appears that black-box methods “will gradually become less viable as language model capabilities advance and ultimately become infeasible” [9]. Recent innovations like “SynthID-Text” [13] suggest that efficient, accurate and low-latency watermarking may soon become commonplace. These may further enhance the detectability of AI-generated text.

While these two currents may appear contradictory, we suggest that AI-generated text both accurately mimics some communicative aspects of human-written text and differs from it in readily detectable ways when responding to brief prompts. Berber Sardinha [5] uses brief prompts, such as “write a conversation between people of about 1000 words in length,” such that the vast majority of the final context is generated text. In comparison, Casal and Kessler [1] use a complicated prompting scheme that involves using the AI to summarize various sections of the article and then combining these summaries into a prompt that asks for an AI-generated abstract with particular parameters. In this scheme the generated text makes up a much smaller portion of the final context. The LLM-powered chatbots rely on contextualized co-occurrence probabilities of tokens, which are likely to resonate with collocational, colligational, and other phraseological relationships found in contextualized human language. This allows the models to mimic formal and, to a lesser degree, functional linguistic competence [14] by imitating human language production. Sometimes this is achieved by reproducing chunks of models’ training data [15]. In most cases, this involves conforming to very specific goals, such as the instantiation of a goal to “follow the user’s instructions helpfully and safely” via instruction tuning and reinforcement learning with human feedback [16].

From a theoretical linguistic point of view, this perspective aligns better with models that emphasize the role of phraseological relationships in representing semantic information than those which separate lexicon and grammar into separate systems. That is, explaining the similarity of AI-generated and human-written texts is straightforward under a Construction Grammar (CxG) perspective, which theorizes language knowledge as a series of form-meaning mappings generalized over time through use. In this framework, the schematic meanings of constructions are theorized as being meaningful in their own right, reflecting regularities in language use patterns. Though the local selection of tokens or words by models like ChatGPT may not match the choices of humans, if they are based on actual language usage, they will be likely to resonate with broader constructional patterns. Accounting for such patterns is not straightforward in linguistic theories that downplay the role of experience and usage with language, instead emphasizing innate knowledge and a strict boundary between lexical items and syntactic rules. While research showing that transformer-based language models contain representations of construction-like verbal patterns (e.g., [17]) suggests that LLMs can ‘tap into’ these regularities, there is a dearth of research on how LLM-powered chatbots produce such constructions. This would allow us to better understand how humans and LLM-powered chatbots use language to achieve different ends.

In this paper we adopt a Usage-Based Construction Grammar perspective, particularly focusing on Verb-Argument Constructions, to compare human and AI-generated language. To fully explore the topic, this comparison includes three GPT models (3.5, 4, and 4o) and, for GPT 3.5, two interfaces (Web vs. API) in a single broad genre (providing solicited advice) across two domains (Finance and Medicine). We use a widely used tool to identify Verb-Argument Constructions statistically in the data and probe the verbs which occupy the most frequent of such constructions to compare human text to AI-generated text on matched queries.

Literature review

The language of large language models

The majority of research on AI-generated language focuses on lexico-semantic differences with text written by humans. These studies have found, for instance, that AI-generated text tends to have more positive or neutral sentiment [7,9,18] and show less variable discourse markers [19]. Other studies have shown that AI-generated text contains more passive voice [20] and longer sentences with less variable length [6] in comparison to human writing. These studies offer some evidence of how AI-generated text differs from human-produced language, but they do not provide concrete insights into the extent to which AI-generated texts resonate with human conceptions of the meaning potentials associated with linguistic resources. Much of the research involves very short prompts, which contrasts with the mega prompts that are currently favored in real world applications [21].

In a step towards this goal, an emerging body of research has investigated the capacity of LLMs to re-produce context-appropriate language behaviors. One example is communicative competence, the language behaviors that humans use to achieve particular goals in different interactive contexts. For example, a speaker may say “it’s cold in here” with the intent of getting a listener to turn on the heat or close a window. A response like “Yes” with no further action may suggest linguistic competence, but in some cultural contexts it misses the more subtle intent of the speaker that a communicatively competent hearer will likely notice. The ability of LLMs to respond to such instances of “conversational implicature” was assessed in [22]. Testing different foundation models, tunings and prompting strategies showed that while LLM-powered chatbots were generally poor at resolving implicature, fine-tuning via instructions dramatically improved performance.

More directly, there is evidence for construction-like verb-argument structures in the representations stored in transformer-based language models [17,23,24]. These studies argue that such evidence lends credence to the view that lexico-semantic representations alone are inadequate for capturing all the information stored in LLMs and, by extension, the human language capacity. These constructions range in frequency from common constructions that occur in sentences like “Rita passed the note to the teacher,” to much rarer, more idiomatic constructions like the English Article + Adjective + Numeral + Noun (AANN), (e.g., “a discerning several thousand judgements”; “a lovely five days”) [25]. The models are even able to productively encode constructions that only occur a handful of times in a corpus of billions of tokens [26]. We discuss a particular type of constructions which play a prominent role in human language in detail in the next section.

Overall, this research provides an intriguing glimpse into a similarity between language representations in LLMs and human language production. It also raises the question of how LLM-powered chatbots produce the constructions captured in these representations when responding to user queries at scale. This is a difficult question to approach for several reasons. Perhaps most importantly, the output of these applications may involve additional pre- and post- processing steps that obscure what input the underlying model receives and, possibly, what output it produces. Furthermore, the underlying models have undergone alignment, which typically involves both supervised finetuning (additional training on curated language data), and often reinforcement learning from human feedback, the last of which explicitly shifts output away from that of a strict language model.

Verb-Argument constructions

Usage-based theories of language posit that language use is the result of context-dependent cognitive processing and that linguistic knowledge is a set of form-meaning mappings that emerge and coalesce over time though use. These form-meaning mappings, referred to as constructions, range from sequences of morphemes to broader, more schematic structures. They are theorized as meaningful units in and of themselves, and form the core of human language in Construction Grammar [27,28]. Humans learn constructions over time by encountering and using them for particular functional potentials in specific social contexts. In this sense, constructions are built up from and anchored to words in constructional contexts in use. Therefore, a usage-based CxG approach to linguistics does not attempt to sever lexicon from grammar, as other linguistic theories do. Rather, CxG posits a continuum of constructions with varying degrees of complexity.

Verb-Argument Constructions (VACs) focus on verbs and their corresponding arguments in usage-based constructions. Common examples include more abstract constructions, such as the transitive ([verb][direct object]) and ditransitive ([verb][indirect object][direct object]), as well as more narrowly defined constructions, such as ‘[verb] across [noun phrase]’. As an illustration of the form-meaning mappings at the core of these constructions, a ditransitive construction has a core meaning of transfer of an object that is both built up from and reinforced through recurring instances of the construction in usage. Simultaneously, the meaning that is sedimented in the VAC allows for emergent production and interpretation of novel forms in context. Thus, a meaning can be deduced from the sentence ‘Paula crutched Tony an apple’ (example based on [29]), where the verb “inherits its interpretation from the echoes of the verbs that occupy this VAC” [30].

Considerable research has been conducted on first and second language VAC knowledge and learning (e.g., [31–40]). Other research has provided key insights into the potential psychological reality of VACs through corpus-based and experimental evidence of verb-construction associations and online construction meaning access (e.g., [35,41–43]). Findings related to VAC characteristics highlight that the most frequent verbs in many VACs are semantically somewhat generic and highly predictable, but that strong associations between VACs and lower frequency verbs also represent prominent communicative resources [31,33]. From a learning perspective, second language learners at lower levels of proficiency demonstrate reliance on more prototypical verbs in high frequency VACs, with more proficient learners demonstrating more dynamic and varied use of such resources [36]. Together, this scholarship provides evidence that VACs are psychologically real for both first and second language users.

Importantly, this research also resonates with broader usage-based claims that linguistic knowledge develops through sensitivity to frequency, contingency, and formulaicity in input [32,38]. In this sense, contextual aspects of usage (such as genre and community) are essential components of an individuals’ knowledge of a construction, as constructional knowledge reflects a language user’s linguistic experiences, which are always embedded within a social situation. To probe the context-sensitivity of VACs and VAC profiles, Casal et al. [44] extended Ellis et al.’s [33] analysis of VACs in the British National Corpus by comparing general domain usage patterns to usage in an academic research corpus and disciplinary sub corpora. Casal et al. [44] found notable consistency in the VACs themselves used across general domain and academic English contexts and across disciplines, but considerable variation in the verbs used within such constructions. That is, the language produced across contexts demonstrated consistent use of core schematic VAC resources, but the local level instantiations of these VACs were context specific. Once more, this underscores the central role of VACs in human language, as well as the prominent effect of context and purpose on linguistic choices.

The present study

The present study compares VACs in AI-produced to human-written text on matched queries, adding a usage-based lens of specific language behaviors to the existing literature. Theoretically, VACs simultaneously capture consistency through the recurrence of schematic frames and prototypical forms and variation through the productivity of such constructional resources in usage. This affords an important window into both formal and functional patterns of language across both AI-generated and human written text.

At the same time, given what empirical research suggests regarding constructions, VAC distributions and instantiations are highly contextual, reflecting the communicative demands, expectations, and conventions of community genre practices. Thus, a VAC perspective affords researchers windows into how effectively AI-based sources of language vary their production based on purpose, topic, and domain. To our knowledge, this is the first study comparing VACs in AI-generated vs. human-written text on matched prompts, as well as the first to trace changes in AI-generated language across models.

RQ1: To what extent do frequency ranks and distributions of Verb-Argument Constructions in AI-generated text in the Medicine and Finance domains reflect the language produced by humans in similar tasks?

RQ2: To what extent do the verb frequency ranks and distributions of the top Verb-Argument Constructions in AI-generated texts in these domains reflect the language produced by humans in similar tasks?

Methods

The corpus

The materials used for these analyses are a subset of the Human ChatGPT Comparison Corpus HC3 [7], which contains “nearly 40K questions and their corresponding answers from humans and ChatGPT covering… open-domain, computer science, Finance, Medicine, law, and psychology” (p. 2). It was compiled when GPT3.5 was the LLM powering ChatGPT. Question-answer pairs from human experts were taken from publicly available datasets and Wikitext. To obtain answers to these same questions from ChatGPT, the authors presented the same questions manually via OpenAI’s site. While HC3 has both English and Chinese-language data, we only draw from the English language data. Our analysis includes language from the Finance and Medical domains. Our analysis was conducted on a version of the corpus we accessed on May 10, 2024. Individual participants were not identifiable based on the information present in the human responses.

The corpus required some cleaning due to the presence of system-level responses, e.g., “too many requests in 1 hour”, and multiple AI-generated responses for some queries. We eliminated the system level responses and only included the first ChatGPT response in instances with multiple entries. This yields 3,932 question-answer pairs (human and ChatGPT web) in the Finance domain and 1,237 question-answer pairs (human and ChatGPT web) in the Medical domain. The questions were diverse, but we provide examples here. One relatively short Finance question is: “My medical bill went to a collection agency. Can I pay it directly to the hospital?” The Medical domain questions often contained more narrative details and structure, for example: “What could it be if child is having intermittent cough inspite of taking medication? 18 month old boy has a recurring dry cough, mainly at night. It will last 3-4 days then loosen and disappear only to return about 5-7 days later. I have tried everything including vaporizers, humidifiers, vaporub, glycerin and honey etc. What could this be?!”

We then presented the same question text to ChatGPT 3.5, GPT-4 and GPT-4o via the OpenAI API. We set the temperature to 0.7, the OpenAI “default” and the suspected temperature setting for the web interface. We therefore have two responses from ChatGPT 3.5, which differ in the way that the question text was presented to the model. We hypothesize that queries that use the web interface get extensive preprocessing before being presented to the model. Moreover, though the original paper did not note the particular variant of the GPT-3.5 model that was used, we are certain that the variant of the GPT-3.5 (gpt-3.5-turbo-0125) that we used for our GPT-3.5 API corpus is newer than that used for the GPT-3.5 Web. The cleaned and expanded dataset can be accessed via the Harvard Dataverse through Stewart, Windsor, and Casal [45].

Tables 1–3 provide the summary statistics for the two corpora. Table 1 summarizes the questions across Medicine and Finance domains. Table 2 provides information about the Medicine responses and Table 3 does the same for the Finance responses.

[Figure omitted. See PDF.]

We note that the questions are significantly longer in the Medical domain, with both more and longer sentences. In contrast, questions in the Finance domain are typically a single short sentence. Therefore, comparisons across domains are not a focus of the analysis.

In Guo et al.’s [7] analysis of the language patterns in AI-generated text and human experts in this corpus, they noted that it was easier to identify text output by ChatGPT when naïve raters could compare it to a similar text written by a human expert. They observe that although raters generally found answers from ChatGPT to be more “helpful”, this was not the case in the Medical domain, which the authors hypothesized was due to the ChatGPT’s preference for lengthy responses in contrast to Medical professionals’ more succinct, straightforward answers. The text written by experts was less literal in its interpretation of the question and showed more individuality than output by ChatGPT.

When comparisons across models and interfaces (API vs. WEB) are made (see Table 4), we see that responses are dramatically longer for GPT-4 and GPT-4o than for GPT-3.5. GPT-4o is the most verbose though it favors slightly shorter sentences than GPT-4. The differences within GPT responses are consistent across the two corpora and all are significant at the 0.001 level in a one-sided paired sample t-test.

[Figure omitted. See PDF.]

Verb-Argument construction analysis

Our goal is a descriptive and comparative analysis across AI-generated and human-written text in two domains of discourse: Finance and Medicine. Verb-Argument Construction types, frequencies, and variants were identified automatically using the Tool for the Automatic Analysis of Syntactic Sophistication and Complexity 1.3.8 [46]. TAASSC is a freely available text complexity analyzer that includes traditional holistic and global measure of syntactic complexity (e.g., length of production unit measures), fine-grained measures of both phrasal and clausal complexity, and a number of measures aligned with usage-based approaches to language including those associated with VACs. In computing VAC-based indices in relation to VAC profiles in the Corpus of Contemporary American English [47], TAASSC generates “clause database” files with each finite clause labeled with a VAC and the lemmatized verb occupying the verb slot indicated. The authors used a modified version of TAASSC that indicated both verb and lemmatized verb (the modification is only necessary for the ConcordanceCompass tool). These “clause database” files were combined and tallied to produce a summary that listed each VAC’s proportion and raw frequency, as well as a lemmatized verb list for each VAC targeted for follow-up analysis. Identified VACs in the TAASSC output were not further collapsed or manually grouped in any way.

Results

Verb-Argument constructions across subcorpora

Table 5 presents the top VACs analyzed in this paper with definitions and examples drawn from human-produced and ChatGPT-generated text. Tables 6 and 7 present the top ten VACs extracted from prompt responses to ChatGPT and written by humans across Financial and Medical domains of discourse. Also presented is the percent of overall VACs that each VAC represents in the corresponding subcorpus. There were at least 250 VAC types which occurred 5 times or more in each subcorpus for the Medicine domain and 759 which met this condition in the Finance domain. VAC types were not combined or modified from the TAASSC output.

[Figure omitted. See PDF.]

Considering the number of diverse VACs present in the dataset, it is striking that only 16 unique types account for the top ten VACs in all ten subcorpora (100 possible VAC slots) and that six VACs occur in all ten lists (v-dobj, nsubj-vcop-acomp, nsubj-v-dobj, nsub-vcop-ncomp, m-nsubj-v-dobj, and v-ccomp). Among these, transitive VACs, where an action is directed toward and affects the object, (v-dobj and nsubj-v-dobj) surface as key structures across contexts, with v-dobj occupying the number one slot in all subcorpora except the 3.5 Web Medical corpus. In the 3.5 Web Medical corpus the number one slot is nsubj-vcop-acomp and 45.8% of this are uses of the form “it is important” or “it’s important”, many of which direct the user to a medical professional. In the human-to-human comparison across Medical and Financial domains of discourse, only two top ten VACs are not shared across lists, although the relative importance varies, with somewhat more variation observable in the human-to-GPT comparisons and across GPT versions. Nevertheless, the consistency and pervasiveness of these VACs points to the communicative value of these linguistic structures and their meaning potentials across discourse, the extent to which AI-generated text broadly resembles human writing, and the usefulness of a VAC perspective for examining structural resources in linguistic data.

Examining the VAC distributions more closely (see also Fig 1), the greater reliance on top VACs in AI-generated responses is striking. While human responses top out at 6.9% (Finance) and 8.4% (Medicine) of discourse represented by the top VAC (v-dobj), the top construction with AI-generated data represents over 10.6% of all verbal constructions (and over 12.6% in three cases). Examination of these figures also reveals that the VAC frequencies in AI-generated approximate the distributions in human-written text by around rank ten, with Finance beginning to align somewhat earlier than Medicine. This means that while AI-generated text makes use of largely the same VAC resources as humans do in both contexts, they tend to rely more heavily on a small number of highly prototypical forms. Once more, this reinforces the idea that the text generated by LLM-powered chatbots structurally resembles human writing, even in the relative importance of schematic constructions such as VACs, but it is different in measurable and meaningful ways.

[Figure omitted. See PDF.]

We note that these comparisons are made without respect to any normalizations for the vastly different lengths of text. At the sample size and observed proportions of the top VACs these are insensitive to corpus size (the standard error is estimated from above by Investigating these observations, while important and interesting, is beyond the scope of this paper.

Two comparisons of the VACs themselves are made here in brief: across domains of discourse (Medicine and Finance) and models/interfaces. The human-written text demonstrates a relative consistency in the VACs used across Medical and Financial domains, with eight of ten VACs appearing in both lists with differences in the frequency ranks and percent of overall verbal constructions they represent. One key difference is that the transitive constructions (v-dobj and nsubj-v-dobj) account for a notably larger percent of overall VACs in Medical domains than they do in Finance (perhaps due to the physical nature of actions and entities in medicine, in comparison to the abstraction inherent to financial concepts), with some of the other constructions representing nearly identical percentages. In contrast, the transitive constructions are markedly more common in GPT-produced discourse, with a high reliance on the transitive v-dobj construction. And while the VACs in ChatGPT 3.5 Web present more consistent frequency ranks across discourse domains, there is clearly more domain-based variation in the corpora produced by the other GPT conditions than in the human-written text. Across underlying LLMs and interfaces (and as compared to humans), two key observations are present at the broad VAC level. One is that, as previously stated, AI-generated text shows abundant presence of a similar set of core verbal structures present in human texts, albeit with notably greater density of many such structures and some key differences. For example, many of the subcorpora of AI-generated text contain the v construction commonly, which is relatively infrequent in human discourse overall. Also, the nsubj-v-ccomp VAC, which is ranked sixth in human-written text for both domains, only occurs in the top ten with the most recent 4o model. This relates to the second key observation: text produced by ChatGPT models differs from each other in frequency ranks and relative distributions at the top.

Verbs in Verb-Argument constructions across corpora

The verbs which occur within each VAC were also identified and compared across AI-generated and human-written text. Interestingly, the clear pattern of ChatGPT’s overreliance on top structures, which are nonetheless similar to structures in human-produced discourse, is not as consistent in the verb-type profile, with more VAC-specific patterns surfacing. Some, such as the VACs examined from a verb-frequency perspective in Tables 8 and 9, demonstrate small and variable differences across AI-generated and human-written text subcorpora comparisons, while others, such as those examined from a verb-frequency perspective in Tables 10 and 11, show marked disparities.

[Figure omitted. See PDF.]

Lemmatized verbs in v-dobj

Tables 8 and 9 present the top 15 verb types for the most prevalent VAC overall: v-dobj. A cross-domain examination highlights that, with the notable exception of ChatGPT 3.5 web, the verbs which most commonly communicate transitivity in v-dobj vary by domain. Restricting ourselves to VACs in human writing, only six of the top 15 verbs are common in the Finance and Medicine domains, roughly consistent with the AI-generated text conditions.

AI-generated text in the Finance domain makes use of roughly two-thirds of the same verbs in the top portion of the frequency lists for this VAC, but a few interesting differences are present. First, verbs “have,” “pay,” and “do” are not nearly as frequent in AI-generated text for this domain. Second, it appears that ChatGPT and humans tend to place verbs with different semantic connotations in this transitive construction. The more frequent verbs in AI-generated text appear to be more associated with analysis and reflection (e.g., “determine,” “consider”) or managing behaviors (e.g., “improve,” “follow,” “manage,” “ensure,” “avoid”). ChatGPT 3.5 web, and to a lesser extent its API homologue, present many verbs that uniquely appear in the top 15 verbs, but both 4 and 4o rely on six of the top fifteen verbs, each of which are less common in human discourse. We note that the verb “make”, which is the most frequent verb in v-dobj for all the GPT Finance subcorpora and the second most frequent in human text, shows some strong distinctions. Of the 3500 uses of make in AI-generated text, 571 of them are “before making any,” 338 are “making a decision”, and 245 are “make informed,” whereas these same patterns cover only three out of 301 instances of “make” in human-written text. Contextually, these all place conditions on decision-making and serve a hedging and distancing function.

The lemmatized verbs used in the v-dobj VACin AI-generated text in the Medical domain are notably distinct from human discourse in terms of the transitive verbs used. The top 6 verbs in the human corpus are either not used in the top 15 or are used drastically less in the all of the AI-generated text subcorpora (e.g., “take” is found in 8% of the VACs in human writing vs. roughly 1/3–1/4 as often in VACs in all AI-generated text), and other key frequency differences are found throughout. We see 14.1% of the uses of “take” in human text being “take care regards” and “ok and take care,” which frame the advice within a human interaction. These same constructions never appear in AI-generated text. On the other hand, we see 8.1% of AI-generated text being “before taking any” which once more places conditions on advice. This construction does not appear in human text. Within v-dobj the lemmatized verb “consult” is used 35 times in human text vs. 714 times in AI-generated text. Out of those 714 occurrences, 368 (51.5%) occur in the phrase “consult a healthcare [profession word]” which never appears in human-written text. Humans similarly recommended seeking the care of specialists, but in dynamic and locally responsive ways (e.g., “Honestly speaking, this is serious health issue and this can worsen with the age. So consult cardiologist as soon as possible for this”), rather than through formulaic language. A further 77 of the 714 instances are covered by either “consult your healthcare” or “consult a specialist”, which again never appear in human-written text. Similar to the AI-generated vs. human-written text comparisons in Finance, the GPT models show a reliance on more analytical or managerial verbs than those related to actions. This is reflected not only in ChatGPT’s lower interactional framing and lower provision of direct advice, but also in ChatGPT’s frequent hedging and distancing of advice through discussion of the processes of decision-making, rather than concrete verbs of action.

Considering the top VAC in the corpus overall, the findings indicate that AI-generated texts are generated using many of the same macro-level VAC resources as those that humans used, and these VACs construe many similar actions and meanings. However, the key differences in the types of meanings and relationships represented in AI-generated texts suggest important divergences in the way that ChatGPT provides advice and presents information.

Lemmatized verbs in v-ccomp

This is further evidenced by an examination of the v-ccomp VAC. This VAC appeared in a position between 4 and 7 in all ten subcorpora, indicating a remarkable consistency in the importance of this schematic pattern. However, the top verbs in AI-generated texts do not resemble those in human-written texts in either Medical or Financial domains. In human writing in the Medicine domain, the lemmatized verb “let” represents 51.2% of all occurrences of the v-ccomp VAC, with the trigram “let me know” accounting for 98.8% of all of those occurrences. In human writing in the Finance domain, “let” represents 7.1% of the verbs that occur in this VAC. However, it is relatively rare (a maximum of 1.7%) in all other subcorpora except ChatGPT 3.5 Web Finance, where it accounts for 3.8% of the lemmatized verbs occurring in v-ccomp. In contrast, the verbs “help” and “ensure” occupy a combined 41.5% to 55.7% of all instances of v-ccomp in all the AI-generated text subcorpora except ChatGPT 3.5 Web where they together account for only 30.1%., but only 3.9% of human-written instances of v-ccomp in Medicine. and 4.5% of human-written instances of v-ccomp in Finance. Interestingly, while ChatGPT is positioned and marketed as a chatbot, the abundant use of “let me know” reflects a more overt dialogic and and ongoing interaction in human discourse that contrasts the more monologic tone which is pervasive in GPT-generated text and reinforced by GPT’s verbosity.

In Medicine, the trigrams “to help manage”, “to help reduce”, and “to help alleviate” account for 43% of ChatGPT uses of the word help in the v-ccomp VAC. These same collocations never occur in the human-written text, where a third of the occurrences are “to help you.” The pattern in the Finance domain is not quite as stark, but 20.8% of all uses of help by GPT (or 412 times out of 1,982 uses of help) are covered by the “to help reduce,” “to help manage,” “considerations to help you make,” “to help alleviate,” “to help determine,” and “considerations to help you decide”. None of these appear in the human-written text.

Usage of nsubj-vcop-acomp

Even in VACs that afford little room for variation in verbs, such as those which are copula-be based, which refers to the use of ‘be’ to connect a subject and a complement, a usage-pattern analysis reveals key differences in how even key schematic meaning-making resources are employed. For example, nsubj-vcop-acomp occurs in the top three VACs of all subcorpora nearly exclusively with the lemmatized verb “be”, but ConcordanceCompass reveals key linguistic differences. In the Medicine domain, 30.5% of all uses of the copula be in nsubj-vcop-acomp in AI-generated texts were either “it is important” or “it’s important” whereas for human written texts these covered only 1.1% of the uses of the copula be in nsubj-vcop-acomp. Of the AI-generated “it is important” or “it’s important” sentences, 36.3% of the uses are covered by “important to speak”, “important to seek”, “important to see”, “important to consult”, or “important to discuss”, which are all directions to seek medical expertise, but these never appear in the human written text. However, 10.6% of the uses of the copula be in nsubj-vcop-acomp by humans in the Medical domain are covered by “will be happy” with 53 of the 57 uses being either “I will be happy to help” or “I will be happy to answer”, which resonates with other findings of politeness and dialogic framing in human responses. ChatGPT’s uses of the copula be in nsubj-vcop-acomp in the Medical domain never contain “will be happy”.

Discovering contextual comparisons

Even restricting to the level of a VAC and lemmatized verb pair, our corpus is large enough that there could be hundreds or even thousands (in the case of the copula verb) of sentences containing the pair. This makes manually examining the key words in context, a traditional approach to understanding word usage, problematic. In order to provide insight into the differences in how words are used between corpora, we have developed a tool called “ConcordanceCompass”, that compares the relative frequencies of contexts (windows around a target word) within the two corpora and ranks them according to a variant of the metric which was introduced by Andrew Hardie as Log Ratio Score in a 2014 blog post. This metric can be derived as the difference in the Pointwise Mutual Information (PMI) between the indicators of context presence and the indicator of corpus membership. If a term is present in only one corpus then the metric is infinite. Hastie resolves this by assigning an absolute frequency of 0.5 to terms which do not appear. As with usual rankings involving PMI, this scoring tends to select rare terms. With PMI rankings, one possible solution is to use the PMI^k family of metrics but when maximizing the difference they lead to a scalar multiple of the existing score and do not affect the rankings. We adopt an approach that does two things:

1. Bounds that maximum ratio from above. If we have a term that is present in only one corpus then it is assigned this maximum ratio rather than infinity. In our case, .,

2. We multiply the ratio by the maximum proportion raised to a power ɑ, which is our case is 0.2.

This approach leads us to the modified metric

where

This modified metric has the property that for a fixed ratio more frequent contexts are preferred.

Discussion

This study adopts a Usage-Based Construction Grammar perspective to compare human and AI generated medical and financial advice through the most common VACs, the most common verbs within them, and prominent phraseological patterns of local use. Our analysis demonstrates the potential richness of using a usage-based construction approach in comparisons between human-written and AI-generated text. Across GPT models and interfaces, the similarity in VAC ranks between ChatGPT and human writers suggests that the way ChatGPT produces language reflects constructions like VACs that humans draw on to communicate. This suggests that transformer-based LLMs not only contain construction-like representations, but also that they can reproduce them at scale on a variety of topics, at least within this genre. We also see more subtle similarities. For example, while telehealth queries tend to be longer, answers to finance questions contain more words and sentences for both ChatGPT and human experts. This topic domain effect does not hold for sentence length. Neither ChatGPT nor human experts tend to vary sentence length notably in their responses to medical and financial queries. We hypothesize that the broad genre of giving advice may impose requirements about sentence length that ChatGPT has captured from its training data, possibly as part of its mandate to “follow the user’s instructions helpfully and safely” [16]. Importantly, we also observe clear evidence that both humans (as has been widely demonstrated through genre and register analysis) and to some extent ChatGPT both vary their language to meet the demands of a communicative task.

We also observe both obvious and subtle differences in the language used by human writers and produced by ChatGPT. The most salient is verbosity. Leaving aside broad similarities in sentence length, ChatGPT produces significantly more words per response than human experts (perhaps becoming long-winded), with successive models growing ever more wordy (see Table 3). The VAC analysis shows a more subtle, though quite pervasive, difference. Although VAC ranks are broadly similar between human writers and ChatGPT, which suggests that ChatGPT uses similar constructional resources as humans do, there are clear differences in the prominence of individual VACs within these ranks (see Tables 6 and 7). For one thing, ChatGPT’s distribution for top VACs is significantly more head-heavy. In the Finance domain, the most common VAC in human-written text accounts (v-dobj) accounts for 6.9% of all VACs, compared to ChatGPT’s 10% (3.5 Web), 11.9% (3.5 API), 11.5% (4 Web) and 12.7% (4o Web) a trend which becomes even more significant when we normalize for differences in text length. The top VAC in text written by telehealth experts (v-dobj) accounts for 8.4% of all VACs vs. ChatGPT’s 12.6% (3.5 Web), 12.8% (3.5 API), 10.6% (4 Web) and 11.6% (4o Web). This reliance on prototypical construction patterns may result from the requirement that ChatGPT’s output be simple and clear. That goal may translate to a narrow set of rhetorical stances compared to telehealth providers and financial advisers, resulting in a preference for a smaller array of core meaning-making resources. Highly frequent, prototypical constructional resources are hypothesized to reoccur and be reinstantiated because they are entrenched in the minds of language users and afford important and recognizable communicative functions [35,41–43], which also supports clarity in ChatGPT produced language.

More closely examining specific VACs and related verbs fleshes out these differences more starkly. For example, in ChatGPT with GPT3.5 Web’s medical responses, provisos like “it is important to consult a healthcare provider for…” are so frequent that the nsubj-vcop-acomp VAC is the top-ranked VAC, surpassing the most common VAC in every other sub-corpus (the transitive v-dobj). In stark tonal contrast, a common production of the same VAC in human medical advice is “Don’t worry, you will be alright.” Intriguingly, this pattern did not extend to the Finance domain where nsubj-vcop-acomp only accounted for 5.8% of all VACs, compared to 12.6% for Medicine. This pattern did not hold when the same medical queries were submitted to subsequent models: ChatGPT 3.5 API (9.5%), 4 API (7.4%) and 4o API (6.6%). It is possible that the proliferation of this pattern was picked up on my human annotators at some point, resulting in efforts to curb this behavior in subsequent models. Regardless, it is a clear example of an important pattern. ChatGPT often produces language which contains important VAC resources of human language in its own language production, although with distinct meanings and forms. It would be unscientific and personifying to call this voice, but similar behavior by an agentive language user could be discussed as such.

In the two VACs examined in detail, whereas humans tend to use shorter lemmatized verbs with Germanic roots (e.g., “have,” “make,” “buy,” “do,” “find,” “get”), ChatGPT selects for longer verbs with Latinate origins (e.g., “determine,” “manage,” “reduce,” “consult,” “consider,” “provide,” “ensure”). ChatGPT’s use of these verbs may suggest avoidance of more informal language more broadly, and it also underscores the complexity of categorizing telehealth interactions as a genre practice, as humans showed regular use of a variety of speech-like features in written form. Examination of the verbs used also capture that ChatGPT’s advice tends to be more indirect, focusing on planning and other cognitive processes, compared to human experts’ preference for direct, action-oriented feedback. Our keyword-in-context analysis provides examples of these tendencies and broader insights into the nature of the differences in purpose and framing that may account for them.

For instance, in the most common VAC, (v-dobj), “pay” is the third most common verb used by financial advisors, but virtually non-existent in AI-generated text. On the other hand, we saw ChatGPT directing participants to ‘manage’ and ‘consider’ problems, while human experts focused more directly on what managing and resolving a problem may look like. In medical queries, the verb “consult” appeared in AI-generated text in v-dobj more than 20 times as often as in human-written text, mostly to advise users to “consult a healthcare professional”. Considering the implications of this more broadly, telehealth professionals, logically, do not typically need to tell patients to consult a healthcare professional as they are already doing just that. At the same time, they provide considerable dialogic framing to these interactions, which reflects broader conventions of in-person human-human interactions even in a quick written exchange. For instance, in v-ccomp, the most frequent human verb is ‘let’, most commonly in phrases like “let me know,” which invite follow-up. This expression is nearly absent from all ChatGPT language across models. Likewise, while the verb ‘help’ is common in this VAC across subcorpora, humans tend to use “to help you [verb]”, which does not occur a single time in ChatGPT in this VAC.

Conclusion

Overall, we believe that this study points towards the considerable potential for utilizing Usage-Based lenses, and in particular Construction Grammar for analyzing ChatGPT language production, as well as other forms of AI-generated text. Verb-Argument Constructions proved to be a meaningful point of linguistic comparison, both at the construction level and at the level of verb distribution internally. VACs encapsulate not only conventionalized structural relationships, but also semantic and pragmatic potentials that shape discourse. While it is clear that there is considerable structural similarity across humans and ChatGPT, we also see that the AI-generated texts in these corpora exhibit markedly different semantic and pragmatic goals than those we observe in human written texts that use the same resource. The VAC analysis allows us to collect data points in an empirically robust manner that highlights similarities and these important differences. Moving beyond whether or not human-written and AI-produced text are distinguishable allows us to see how they accomplish similar, but critically different rhetorical ends. By combining VAC and Key Word in Context analysis, researchers can see at scale the recurrent meanings being expressed across texts in the sub-corpora.

In a sense, this suggests that ChatGPT successfully leverages its training data and utilizes pattern-data that resembles constructions to accomplish goals aligned with the product. Despite the use of the word chat in the name, the communicative purpose of these ChatGPT systems when prompted with a short-form question is to provide a long-form informative answer, and not to initiate a chat. In our corpora, faced with identical questions, humans typically gave shorter answers using less formal language, and frequently left rhetorical space for follow on communication and more overt chat framing. We notice that the tendency for lengthy and more elaborated answers increases with successive models.

Building on the findings of this research, we think that considerable future opportunities exist to employ a VAC perspective to examining ChatGPT and other AI-generated Text. In particular, while we focused on the most prototypical patterns and the consistency and variation within those patterns, other research can adopt more statistically-driven comparisons of VAC use overall or otherwise examine creativity more directly. Likewise, while our comparison of two topic domains, medical and financial, highlighted important variation in both human and ChatGPT language, future research can more directly investigate the effects of topic and, perhaps more importantly, register and genre, to examine if ChatGPT varies language production as much as humans do in relationship to their communicative situation. As a final recommendation, we call for future research which adopts a VAC perspective to non-English text.

References

1. 1. Casal JE, Kessler M. Can linguists distinguish between ChatGPT/AI and human writing?: a study of research ethics and academic publishing. Res Methods Appl Linguist. 2023;2(3):100068.

* View Article

* Google Scholar

2. 2. Chemaya N, Martin D. Perceptions and detection of AI use in manuscript preparation for academic journals. PLoS One. 2024;19(7):e0304807. pmid:38995880

* View Article

* PubMed/NCBI

* Google Scholar

3. 3. Liang W, Zhang Y, Wu Z, Lepp H, Ji W, Zhao X, et al. Mapping the Increasing Use of LLMs in Scientific Papers [Internet]. arXiv; 2024 [cited 2025 Jan 21]. Available from: http://arxiv.org/abs/2404.01268

* View Article

* Google Scholar

4. 4. Khalil M, Er E. Will ChatGPT get you caught? Rethinking of plagiarism detection. In: Zaphiris P, Ioannou A, editors. Learning and collaboration technologies. Cham: Springer Nature Switzerland. 2023. p. 475–87.

5. 5. Berber Sardinha T. AI-generated vs human-authored texts: A multidimensional comparison. Applied Corpus Linguistics. 2024;4(1):100083.

* View Article

* Google Scholar

6. 6. Desaire H, Chua AE, Isom M, Jarosova R, Hua D. Distinguishing academic science writing from humans or ChatGPT with over 99% accuracy using off-the-shelf machine learning tools. Cell Rep Phys Sci. 2023;4(6):101426. pmid:37426542

* View Article

* PubMed/NCBI

* Google Scholar

7. 7. Guo B, Zhang X, Wang Z, Jiang M, Nie J, Ding Y, et al. How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection [Internet]. arXiv; 2023 [cited 2025 Jan 21]. Available from: http://arxiv.org/abs/2301.07597

* View Article

* Google Scholar

8. 8. Hayawi K, Shahriar S, Mathew S. The imitation game: detecting human and AI-generated texts in the era of ChatGPT and BARD. J Inf Sci. 2024.

* View Article

* Google Scholar

9. 9. Tang R, Chuang YN, Hu X. The science of detecting LLM-generated text. Commun ACM. 2024;67(4):50–9.

* View Article

* Google Scholar

10. 10. Gorenz D, Schwarz N. How funny is ChatGPT? A comparison of human-and AI-produced jokes. 2024 [cited 2025 Jan 21]; Available from: https://osf.io/5yz8n/download

* View Article

* Google Scholar

11. 11. Charness G, Grieco D. Creativity and AI [Internet]. Rochester, NY: Social Science Research Network; 2024 [cited 2025 Apr 9]. Available from: https://papers.ssrn.com/abstract=4686415

* View Article

* Google Scholar

12. 12. Tang N, Yang C, Fan J, Cao L, Luo Y, Halevy A. VerifAI: Verified Generative AI [Internet]. arXiv; 2023 [cited 2025 Jan 21]. Available from: http://arxiv.org/abs/2307.02796

* View Article

* Google Scholar

13. 13. Dathathri S, See A, Ghaisas S, Huang P-S, McAdam R, Welbl J, et al. Scalable watermarking for identifying large language model outputs. Nature. 2024;634(8035):818–23. pmid:39443777

* View Article

* PubMed/NCBI

* Google Scholar

14. 14. Mahowald K, Ivanova A, Blank I, Kanwisher N, Tenenbaum J, Fedorenko E. Dissociating language and thought in large language models. Trends Cogn Sci. 2024;28(6):517–40.

* View Article

* Google Scholar

15. 15. McCoy RT, Smolensky P, Linzen T, Gao J, Celikyilmaz A. How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN. Trans Assoc Comput Linguist. 2023;11:652–70.

* View Article

* Google Scholar

16. 16. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. 2022;35:27730–44.

* View Article

* Google Scholar

17. 17. Madabushi HT, Romain L, Divjak D, Milin P. CxGBERT: BERT meets Construction Grammar [Internet]. arXiv; 2020 [cited 2025 Jan 21]. Available from: http://arxiv.org/abs/2011.04134

* View Article

* Google Scholar

18. 18. Muñoz-Ortiz A, Gómez-Rodríguez C, Vilares D. Contrasting linguistic patterns in human and LLM-generated news text. Artif Intell Rev. 2024;57(10):265.

* View Article

* Google Scholar

19. 19. Herbold S, Hautli-Janisz A, Heuer U, Kikteva Z, Trautsch A. A large-scale comparison of human-written versus ChatGPT-generated essays. Sci Rep. 2023;13(1):18617. pmid:37903836

* View Article

* PubMed/NCBI

* Google Scholar

20. 20. Seals SM, Shalin VL. Long-form analogies generated by chatGPT lack human-like psycholinguistic properties [Internet]. arXiv; 2023 [cited 2025 Jan 21]. Available from: http://arxiv.org/abs/2306.04537

* View Article

* Google Scholar

21. 21. Ng A. OpenAI’s Rules for Model Behavior, Better Brain-Controlled Robots, and more. 2024 [cited 2025 Jan 21]. Available from: https://www.deeplearning.ai/the-batch/issue-249//.

* View Article

* Google Scholar

22. 22. Ruis L, Khan A, Biderman S, Hooker S, Rocktäschel T, Grefenstette E. The Goldilocks of pragmatic understanding: fine-tuning strategy matters for implicature resolution by LLMs. Adv Neural Inf Process Syst. 2023;36:20827–905.

* View Article

* Google Scholar

23. 23. Li B, Zhu Z, Thomas G, Rudzicz F, Xu Y. Neural reality of argument structure constructions. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics; 2022. p. 7410–23.

24. 24. Weissweiler L, Hofmann V, Köksal A, Schütze H. The Better Your Syntax, the Better Your Semantics? Probing Pretrained Language Models for the English Comparative Correlative [Internet]. arXiv; 2022 [cited 2025 Jan 21]. Available from: http://arxiv.org/abs/2210.13181

* View Article

* Google Scholar

25. 25. Mahowald K. A Discerning Several Thousand Judgments: GPT-3 Rates the Article + Adjective + Numeral + Noun Construction [Internet]. arXiv; 2023 [cited 2025 Jan 21]. Available from: http://arxiv.org/abs/2301.12564

* View Article

* Google Scholar

26. 26. Potts C. Characterizing English preposing in PP constructions. J Linguist. 2024;1–39.

* View Article

* Google Scholar

27. 27. Goldberg A. Constructions at work: the nature of generalization in language. New York: Oxford University Press. 2006.

28. 28. Goldberg AE. Constructions: A Construction Grammar Approach to Argument Structure [Internet]. Chicago, IL: University of Chicago Press; 1995 [cited 2025 Jan 23]. 271 p. (Cognitive Theory of Language and Culture Series). Available from: https://press.uchicago.edu/ucp/books/book/chicago/C/bo3683810.html

29. 29. Goldberg AE. Explain Me This: Creativity, Competition, and the Partial Productivity of Constructions. Princeton University Press; 2019. 209 p.

30. 30. Ellis NC, O’Donnell MB. Statistical construction learning: Does a Zipfian problem space ensure robust language learning? In: Rebuschat P, Williams JN, editors. Statistical Learning and Language Acquisition [Internet]. De Gruyter Mouton; 2012 [cited 2025 Jan 21]. p. 265–304. Available from: https://www.degruyter.com/document/doi/10.1515/9781934078242.265/pdf?licenseType=restricted

31. 31. Ellis N, Ferreira-Junior F. Constructions and their acquisition: islands and the distinctiveness of their occupancy. Annu Rev Cogn Linguist. 2009;7(1):188–221.

* View Article

* Google Scholar

32. 32. Ellis NC, Wulff S. Usage-based approaches to SLA. In: Theories in second language acquisition [Internet]. Routledge; 2014 [cited 2025 Jan 21]. p. 87–105. Available from: https://api.taylorfrancis.com/content/chapters/edit/download?identifierName=doi&identifierValue=10.4324/9780203628942-10&type=chapterpdf

33. 33. Ellis NC, Römer U, O?Donnell MB. Usage-Based Approaches to Language Acquisition and Processing: Cognitive and Corpus Investigations of Construction Grammar. Malden: Wiley-Blackwell; 2016.

34. 34. Goldberg AE, Casenhiser DM, Sethuraman N. Learning argument structure generalizations. Cognit. 2004;15(3):289–316.

* View Article

* Google Scholar

35. 35. Gries S, Wulff S. Do foreign language learners also have constructions?. Annu Rev Cogn Linguist. 2005;3(1):182–200.

* View Article

* Google Scholar

36. 36. Kyle K. Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication [Internet] [Doctoral]. Georgia State University; 2016 [cited 2025 Jan 23]. Available from: https://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1035&context=alesl_diss

* View Article

* Google Scholar

37. 37. Römer U. A corpus perspective on the development of verb constructions in second language learners. IJCL. 2019;24(3):268–90.

* View Article

* Google Scholar

38. 38. Römer U, O’Donnell MB, Ellis NC. Second Language Learner Knowledge of Verb–Argument Constructions: Effects of Language Transfer and Typology. The Modern Language Journal. 2014;98(4):952–75.

* View Article

* Google Scholar

39. 39. Römer U, Skalicky SC, Ellis NC. Verb-argument constructions in advanced L2 English learner production: insights from corpora and verbal fluency tasks. Corpus Linguist Linguist Theory. 2020;16(2):303–31.

* View Article

* Google Scholar

40. 40. Römer U, Garner J. The development of verb constructions in spoken learner English: tracing effects of usage and proficiency. Int J Learn Corpus Res. 2019;5(2):207–30.

* View Article

* Google Scholar

41. 41. Bencini GML, Goldberg AE. The contribution of argument structure constructions to sentence meaning. J Mem Lang. 2000;43(4):640–51.

* View Article

* Google Scholar

42. 42. Chang F, Bock K, Goldberg A. Can thematic roles leave traces of their places?. Cognition. 2003;90(1):29–49.

* View Article

* Google Scholar

43. 43. Johnson MA, Goldberg AE. Evidence for automatic accessing of constructional meaning: Jabberwocky sentences prime associated verbs. Lang Cogn Process. 2013;28(10):1439–52.

* View Article

* Google Scholar

44. 44. Casal JE, Shirai Y, Lu X. English verb-argument construction profiles in a specialized academic corpus: variation by genre and discipline. Engl Specif Purp. 2022;66:94–107.

* View Article

* Google Scholar

45. 45. Stewart C, Windsor A, Casal JE. Replication data for: “It is important to consult” a linguist: Verb-Argument Constructions in ChatGPT and human experts’ medical and financial advice. https://doi.org/10.7910/DVN/RINBPR, Harvard Dataverse, V1.

46. 46. Kyle K. Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-Based Indices of Syntactic Sophistication. Appl Linguist Engl Second Lang Diss [Internet]. 2016 May 9; Available from: https://scholarworks.gsu.edu/alesl_diss/35

47. 47. Davies M. The Corpus of Contemporary American English (COCA). [Internet]. 2008. Available from: https://www.english-corpora.org/coca//

* View Article

* Google Scholar

Citation: Casal JE, Stewart CM, Windsor AJ (2025) “It is important to consult” a linguist: Verb-Argument Constructions in ChatGPT and human experts’ medical and financial advice. PLoS One 20(5): e0324611. https://doi.org/10.1371/journal.pone.0324611

About the Authors:

J. Elliott Casal

Contributed equally to this work with: J. Elliott Casal, Christopher M. Stewart, Alistair J. Windsor

Roles: Conceptualization, Formal analysis, Investigation, Methodology, Writing – original draft, Writing – review & editing

E-mail: [email protected]

Affiliations: Department of English, University of Memphis, Memphis, Tennessee, United States of America, Institute for Intelligent Systems, University of Memphis, Memphis, Tennessee, United States of America

ORICD: https://orcid.org/0000-0002-8920-9120

Christopher M. Stewart

Contributed equally to this work with: J. Elliott Casal, Christopher M. Stewart, Alistair J. Windsor

Roles: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing – original draft, Writing – review & editing

Affiliation: Institute for Intelligent Systems, University of Memphis, Memphis, Tennessee, United States of America

Alistair J. Windsor

Contributed equally to this work with: J. Elliott Casal, Christopher M. Stewart, Alistair J. Windsor

Roles: Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

Affiliations: Institute for Intelligent Systems, University of Memphis, Memphis, Tennessee, United States of America, Department of Mathematical Sciences, University of Memphis, Memphis, Tennessee, United States of America

[/RAW_REF_TEXT]

References

1. Casal JE, Kessler M. Can linguists distinguish between ChatGPT/AI and human writing?: a study of research ethics and academic publishing. Res Methods Appl Linguist. 2023;2(3):100068.

2. Chemaya N, Martin D. Perceptions and detection of AI use in manuscript preparation for academic journals. PLoS One. 2024;19(7):e0304807. pmid:38995880

3. Liang W, Zhang Y, Wu Z, Lepp H, Ji W, Zhao X, et al. Mapping the Increasing Use of LLMs in Scientific Papers [Internet]. arXiv; 2024 [cited 2025 Jan 21]. Available from: http://arxiv.org/abs/2404.01268

4. Khalil M, Er E. Will ChatGPT get you caught? Rethinking of plagiarism detection. In: Zaphiris P, Ioannou A, editors. Learning and collaboration technologies. Cham: Springer Nature Switzerland. 2023. p. 475–87.

5. Berber Sardinha T. AI-generated vs human-authored texts: A multidimensional comparison. Applied Corpus Linguistics. 2024;4(1):100083.

6. Desaire H, Chua AE, Isom M, Jarosova R, Hua D. Distinguishing academic science writing from humans or ChatGPT with over 99% accuracy using off-the-shelf machine learning tools. Cell Rep Phys Sci. 2023;4(6):101426. pmid:37426542

7. Guo B, Zhang X, Wang Z, Jiang M, Nie J, Ding Y, et al. How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection [Internet]. arXiv; 2023 [cited 2025 Jan 21]. Available from: http://arxiv.org/abs/2301.07597

8. Hayawi K, Shahriar S, Mathew S. The imitation game: detecting human and AI-generated texts in the era of ChatGPT and BARD. J Inf Sci. 2024.

9. Tang R, Chuang YN, Hu X. The science of detecting LLM-generated text. Commun ACM. 2024;67(4):50–9.

10. Gorenz D, Schwarz N. How funny is ChatGPT? A comparison of human-and AI-produced jokes. 2024 [cited 2025 Jan 21]; Available from: https://osf.io/5yz8n/download

11. Charness G, Grieco D. Creativity and AI [Internet]. Rochester, NY: Social Science Research Network; 2024 [cited 2025 Apr 9]. Available from: https://papers.ssrn.com/abstract=4686415

12. Tang N, Yang C, Fan J, Cao L, Luo Y, Halevy A. VerifAI: Verified Generative AI [Internet]. arXiv; 2023 [cited 2025 Jan 21]. Available from: http://arxiv.org/abs/2307.02796

13. Dathathri S, See A, Ghaisas S, Huang P-S, McAdam R, Welbl J, et al. Scalable watermarking for identifying large language model outputs. Nature. 2024;634(8035):818–23. pmid:39443777

14. Mahowald K, Ivanova A, Blank I, Kanwisher N, Tenenbaum J, Fedorenko E. Dissociating language and thought in large language models. Trends Cogn Sci. 2024;28(6):517–40.

15. McCoy RT, Smolensky P, Linzen T, Gao J, Celikyilmaz A. How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN. Trans Assoc Comput Linguist. 2023;11:652–70.

16. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. 2022;35:27730–44.

17. Madabushi HT, Romain L, Divjak D, Milin P. CxGBERT: BERT meets Construction Grammar [Internet]. arXiv; 2020 [cited 2025 Jan 21]. Available from: http://arxiv.org/abs/2011.04134

18. Muñoz-Ortiz A, Gómez-Rodríguez C, Vilares D. Contrasting linguistic patterns in human and LLM-generated news text. Artif Intell Rev. 2024;57(10):265.

19. Herbold S, Hautli-Janisz A, Heuer U, Kikteva Z, Trautsch A. A large-scale comparison of human-written versus ChatGPT-generated essays. Sci Rep. 2023;13(1):18617. pmid:37903836

20. Seals SM, Shalin VL. Long-form analogies generated by chatGPT lack human-like psycholinguistic properties [Internet]. arXiv; 2023 [cited 2025 Jan 21]. Available from: http://arxiv.org/abs/2306.04537

21. Ng A. OpenAI’s Rules for Model Behavior, Better Brain-Controlled Robots, and more. 2024 [cited 2025 Jan 21]. Available from: https://www.deeplearning.ai/the-batch/issue-249//.

22. Ruis L, Khan A, Biderman S, Hooker S, Rocktäschel T, Grefenstette E. The Goldilocks of pragmatic understanding: fine-tuning strategy matters for implicature resolution by LLMs. Adv Neural Inf Process Syst. 2023;36:20827–905.

23. Li B, Zhu Z, Thomas G, Rudzicz F, Xu Y. Neural reality of argument structure constructions. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics; 2022. p. 7410–23.

24. Weissweiler L, Hofmann V, Köksal A, Schütze H. The Better Your Syntax, the Better Your Semantics? Probing Pretrained Language Models for the English Comparative Correlative [Internet]. arXiv; 2022 [cited 2025 Jan 21]. Available from: http://arxiv.org/abs/2210.13181

25. Mahowald K. A Discerning Several Thousand Judgments: GPT-3 Rates the Article + Adjective + Numeral + Noun Construction [Internet]. arXiv; 2023 [cited 2025 Jan 21]. Available from: http://arxiv.org/abs/2301.12564

26. Potts C. Characterizing English preposing in PP constructions. J Linguist. 2024;1–39.

27. Goldberg A. Constructions at work: the nature of generalization in language. New York: Oxford University Press. 2006.

28. Goldberg AE. Constructions: A Construction Grammar Approach to Argument Structure [Internet]. Chicago, IL: University of Chicago Press; 1995 [cited 2025 Jan 23]. 271 p. (Cognitive Theory of Language and Culture Series). Available from: https://press.uchicago.edu/ucp/books/book/chicago/C/bo3683810.html

29. Goldberg AE. Explain Me This: Creativity, Competition, and the Partial Productivity of Constructions. Princeton University Press; 2019. 209 p.

30. Ellis NC, O’Donnell MB. Statistical construction learning: Does a Zipfian problem space ensure robust language learning? In: Rebuschat P, Williams JN, editors. Statistical Learning and Language Acquisition [Internet]. De Gruyter Mouton; 2012 [cited 2025 Jan 21]. p. 265–304. Available from: https://www.degruyter.com/document/doi/10.1515/9781934078242.265/pdf?licenseType=restricted

31. Ellis N, Ferreira-Junior F. Constructions and their acquisition: islands and the distinctiveness of their occupancy. Annu Rev Cogn Linguist. 2009;7(1):188–221.

32. Ellis NC, Wulff S. Usage-based approaches to SLA. In: Theories in second language acquisition [Internet]. Routledge; 2014 [cited 2025 Jan 21]. p. 87–105. Available from: https://api.taylorfrancis.com/content/chapters/edit/download?identifierName=doi&identifierValue=10.4324/9780203628942-10&type=chapterpdf

33. Ellis NC, Römer U, O?Donnell MB. Usage-Based Approaches to Language Acquisition and Processing: Cognitive and Corpus Investigations of Construction Grammar. Malden: Wiley-Blackwell; 2016.

34. Goldberg AE, Casenhiser DM, Sethuraman N. Learning argument structure generalizations. Cognit. 2004;15(3):289–316.

35. Gries S, Wulff S. Do foreign language learners also have constructions?. Annu Rev Cogn Linguist. 2005;3(1):182–200.

36. Kyle K. Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication [Internet] [Doctoral]. Georgia State University; 2016 [cited 2025 Jan 23]. Available from: https://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1035&context=alesl_diss

37. Römer U. A corpus perspective on the development of verb constructions in second language learners. IJCL. 2019;24(3):268–90.

38. Römer U, O’Donnell MB, Ellis NC. Second Language Learner Knowledge of Verb–Argument Constructions: Effects of Language Transfer and Typology. The Modern Language Journal. 2014;98(4):952–75.

39. Römer U, Skalicky SC, Ellis NC. Verb-argument constructions in advanced L2 English learner production: insights from corpora and verbal fluency tasks. Corpus Linguist Linguist Theory. 2020;16(2):303–31.

40. Römer U, Garner J. The development of verb constructions in spoken learner English: tracing effects of usage and proficiency. Int J Learn Corpus Res. 2019;5(2):207–30.

41. Bencini GML, Goldberg AE. The contribution of argument structure constructions to sentence meaning. J Mem Lang. 2000;43(4):640–51.

42. Chang F, Bock K, Goldberg A. Can thematic roles leave traces of their places?. Cognition. 2003;90(1):29–49.

43. Johnson MA, Goldberg AE. Evidence for automatic accessing of constructional meaning: Jabberwocky sentences prime associated verbs. Lang Cogn Process. 2013;28(10):1439–52.

44. Casal JE, Shirai Y, Lu X. English verb-argument construction profiles in a specialized academic corpus: variation by genre and discipline. Engl Specif Purp. 2022;66:94–107.

45. Stewart C, Windsor A, Casal JE. Replication data for: “It is important to consult” a linguist: Verb-Argument Constructions in ChatGPT and human experts’ medical and financial advice. https://doi.org/10.7910/DVN/RINBPR, Harvard Dataverse, V1.

46. Kyle K. Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-Based Indices of Syntactic Sophistication. Appl Linguist Engl Second Lang Diss [Internet]. 2016 May 9; Available from: https://scholarworks.gsu.edu/alesl_diss/35

47. Davies M. The Corpus of Contemporary American English (COCA). [Internet]. 2008. Available from: https://www.english-corpora.org/coca//

Word count: 9950

Show less

© 2025 Casal et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

This paper adopts a Usage-Based Construction Grammar perspective to compare human- and AI-generated language, focusing on Verb-Argument Constructions (VACs) as a lens for analysis. Specifically, we examine solicited advice texts in two domains—Finance and Medicine—produced by humans and ChatGPT across different GPT models (3.5, 4, and 4o) and interfaces (3.5 Web vs. 3.5 API). Our findings reveal broad consistency in the frequency and distribution of the most common VACs across human- and AI-generated texts, though ChatGPT exhibits a slightly higher reliance on the most frequent constructions. A closer examination of the verbs occupying these constructions uncovers significant differences in the meanings conveyed, with a notable growth away from human-like language production in macro level perspectives (e.g., length) and towards humanlike verb-VAC patterns with newer models. These results underscore the potential of VACs as a powerful tool for analyzing AI-generated language and tracking its evolution over time.

Details

Title

“It is important to consult” a linguist: Verb-Argument Constructions in ChatGPT and human experts’ medical and financial advice

Author

J. Elliott Casal

; Stewart, Christopher M; Windsor, Alistair J

First page

e0324611

Section

Research Article

Publication year

2025

Publication date

May 2025

Publisher

Public Library of Science

e-ISSN

19326203

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1371/journal.pone.0324611

ProQuest document ID

3212658501

“It is important to consult” a linguist: Verb-Argument Constructions in ChatGPT and human experts’ medical and financial advice

Jump to:

Full text

Introduction

Literature review

The language of large language models

Verb-Argument constructions

The present study

Methods

The corpus

Verb-Argument construction analysis

Results

Verb-Argument constructions across subcorpora

Verbs in Verb-Argument constructions across corpora

Lemmatized verbs in v-dobj

Lemmatized verbs in v-ccomp

Usage of nsubj-vcop-acomp

Discovering contextual comparisons

Discussion

Conclusion

References

Abstract

Details

Suggested sources