From Recommendations to Delegation: A Systematic

Full text

Turn on search term navigation

1. Introduction

Advances in artificial intelligence (AI) research have revolutionized the state of online e-commerce from being passive, static decision support tools to increasingly interactive, adaptive decision-support systems and early forms of delegated automation that can in some cases autonomously carry out limited two-step decisions on behalf of consumers [1,2,3]. Such a transition from decision support to decision delegation marks a serious paradigm shift in human–AI interactions, where algorithms and models are increasingly planning, negotiating, and trading in online environments as a technological trajectory; however, most consumer-facing empirical studies still examine assistive conversational systems and intention-based delegation rather than fully delegated execution [4,5]. Agentic e-commerce systems, fueled by advances in large language models (LLMs), multi-agent frameworks, and the application of reinforcement learning, enhance the application frontiers of commercial AI while heightening psychological conflicts over autonomy, agency, and accountability [6,7,8]. In this review, we therefore distinguish between (i) capabilities claimed or demonstrated in technical systems and (ii) behaviors empirically studied in consumer commerce contexts.

On the technological front, there is a robust pace of innovation for multi-layered agentic systems with end-to-end decision workflows. According to Bai et al. [9], Insight Agents are a good example, using a hierarchical multi-agent approach—manager and worker agents—to produce autonomous business insights for Amazon sellers. The trend is also found in the approaches of Nie et al. [10] (Hybrid-MACRS) and Huang et al. [11] (InteRecAgent), where a team of specialized agents is leveraged for joint latency mitigation, scalability enhancement, and increased context interpretation for conversational recommenders. Taken together, these systems represent a shift towards multi-agent “societies” for “hybrid reasoning,” moving beyond standalone chatbots and towards a world where a single conversation may involve a network of agents capable of breaking down tasks in shared environments, integrating memories, and reasoning together [12,13,14,15,16,17,18]. However, consumer/marketing evaluations of these systems most often focus on recommendation and interaction quality rather than on observed execution-level delegation (e.g., checkout/payment) in realistic settings. Notably, Stummer et al. [19] and Čavoški et al. [20] illustrate, using agent-based simulations, how autonomy is not only a source for efficiency but also for simulation of bounded rationality, social interactions, and competition, all of which are crucial for adoption process research.

With the rise in machine agency, the issue of behavioral and moral complexity has emerged more noticeably between consumers and platforms [8,21,22]. Delegation via the use of artificial intelligence in the form of AI agents has influenced the means by which control, trust, and emotional engagement are processed in online spaces. According to Yoon et al. [23], personalized chatbots pose a threat to freedom of choice as consumers react in the form of psychological reactance and avoidance, specifically when product involvement is high. Pizzi et al. [24] found in another study that non-anthropomorphic, computer-initiated assistants are more prone to reactance in consumers yet enhance levels of satisfaction due to improved performance certainty. This indicates the means by which delegation might have cost as opposed to the expected benefits. The study of Schindler et al. [8] showed that communication modality between text and voice affects cognitive focus on enhanced levels regarding choice satisfaction [25,26,27,28,29]. This further illustrates how the interface acts as a critical mechanism through which experiences regarding freedom are processed in terms of depth rather than a superficial addition. Importantly, these effects are largely documented in assistive or scenario-based operationalizations, leaving open how they translate to execution-capable agents once control rights are transferred for consequential actions.

The socio-technical tension that results from agentic commerce is centered on trust and control [30,31,32,33]. While task-oriented conversational agents can mimic “physical store-like” interactions and even facilitate integrative negotiation—thereby fostering relational trust when designed for transparency and mutual gain [2,34,35]—greater autonomy can also erode trust when oversight is weak, a risk amplified by documented vulnerabilities like prompt injection and memory poisoning (Agent Security Bench; >80% success). Because such vulnerabilities are primarily evidenced in technical/security evaluations, we treat them as governance-relevant risks rather than as consumer-validated behavioral effects within the mapped empirical corpus. At the same time, research on simulation and modeling shows that delegation is part of a broader process of adoption and diffusion that is affected by different preferences, network effects, and social influence [19,36]. Applied work on autonomous delivery also shows that it needs to be able to scale and adapt to changes [37]. In summary, these streams frame agentic commerce as a technical architecture and a psychological ecosystem. However, the literature is still fragmented. Technical studies improve performance without looking at how consumers work, and consumer studies look at trust/control/reactance without saying how autonomy should be designed. This means that there is no clear link between autonomy level, task scope, interaction mode, and transaction capability and delegation and downstream outcomes. Moreover, technical “end-to-end” capability does not automatically imply that execution-level delegation has been empirically studied with consumers in realistic transaction contexts.

Consistent with the empirical profile mapped in this review, we position the manuscript as an evidence-mapping systematic review rather than a new adoption theory. The empirical contribution is to clarify how “agentic commerce” is operationalized and evidenced across studies by jointly coding autonomy level, task scope, interaction mode, and transaction/evidence realism, thereby separating claimed technical capability from empirically studied delegation behavior. The conceptual contribution is integrative and boundary-conditioning: we show why utility/assurance pathways (e.g., usefulness–trust–intention) dominate in assistive settings, and we specify when governance conditions (control-rights transfer, oversight scaffolding, recourse, accountability) become decisive as autonomy increases. Accordingly, the two-lane framing is presented as a testable organizing framework that structures the mapped evidence and directs future work toward behavioral delegation, execution realism, and calibration outcomes rather than intention-only acceptance.

A further limitation of the existing corpus is its evidentiary profile. Many studies rely on simulated systems or intention data, whereas observed delegation behavior, transaction execution, and longitudinal dynamics remain comparatively scarce. Constructs central to safe and robust delegation, trust calibration, verification behavior, recovery after error, and post-delegation regret/blame, are frequently theorized but inconsistently operationalized. Contextual boundary conditions (e.g., involvement, stakes, and regulatory setting) are also recognized but rarely examined systematically across designs and samples. These patterns constrain generalizability and make it difficult to align consumer evidence with emerging technical claims about execution-capable agents.

To address these issues, we conduct a socio-technical evidence map that integrates consumer, marketing, and HCI evidence with conservative coding of autonomy and evidentiary realism. Specifically, this review answers three linked questions:

RQ1—Phenomenon mapping. How is agentic commerce operationalized in the literature (agent autonomy level, task scope, interaction mode, and transaction capability)?

RQ2—Decision delegation. What factors explain consumers’ willingness to delegate shopping decisions to AI agents (e.g., trust, perceived control, risk/privacy, effort reduction, empowerment vs. loss of power)?

RQ3—Outcomes. What behavioral and attitudinal effects does agentic commerce (and decision delegation to AI agents) have on marketing outcomes (e.g., purchase/conversion, spend, loyalty, satisfaction) and consumer outcomes (e.g., trust calibration, perceived autonomy/control, fairness perceptions, regret/reliance)?

2. Conceptual Background

2.1. Defining Agentic Commerce and Its Boundaries

Recent breakthroughs in generation and conversation artificial intelligence (AI) have increased the overlap among traditional chat robots, recommendation systems, and fully autonomous agents who can perform multiple-step commercial transactions [23,38,39]. While end-to-end “agentic” systems are increasingly discussed as a technological trajectory, the consumer-facing empirical literature has primarily evaluated assistive conversational systems and only rarely tested execution-level delegation. The idea of agentic commerce refers to the use of artificial intelligence that requires autonomy, initiative, and transaction capacity at various stages of the purchasing process, from simply providing information to active processing and carrying out actions on the buyer’s behalf. In this review, “agentic commerce” is treated as an umbrella concept spanning a continuum from assistive decision support to delegated execution, but we reserve the strongest “agentic” claims for systems that demonstrate workflow-level initiative and/or execution capacity beyond conversational guidance. These stages include autonomy initiation, memory use, tool use, and execution of transactions [40,41]. Crucially, “transaction capacity” is defined as evidence that an agent can initiate or complete a commercial action (e.g., checkout/payment) in a functioning system or validated sandbox, rather than merely being described as capable in principle.

Such systems are still treated as traditional conversational agents in an extensive amount of current research, which emphasizes interaction quality, perceived usefulness, and satisfaction. Le et al. [42] discover that utilitarian, hedonistic, and social gratifications outweigh privacy concerns in driving satisfaction among Vietnamese users, while Chakraborty et al. [33] combine the Elaboration Likelihood Model and Status Quo Bias Theory to demonstrate how interaction quality, credibility, and threat perceptions shape trust in generative AI chatbots. While functional acceptance is confirmed by these studies, their empirical focus largely reflects assistive autonomy (information and persuasion) and true autonomy and complete task completion are rarely considered or empirically validated.

Factors regarding interface and design further obscure the user experience regarding agency. Pizzi et al. [24] show that non-anthropomorphic, computer-initiated assistants increase psychological reactance and can improve satisfaction as perceived performance gains improve. A factor added by Chua et al. [43] regarding the overall meaningful use of visual, identity, and conversation features is that they all together influence human-likeness and social presence, implying that it is the inference of “agentic-ness” from more expressive rather than simply functional aspects that matters to the user. In support of findings from Puertas et al. [4] and Vebrianti et al. [44], it appears that properly designed assistants may increase loyalty when they balance effectively between entertaining, providing information, and usability, as well as overcoming concerns about privacy risks. However, research focuses almost entirely upon satisfaction and purchasing intentions rather than upon the post-action follow-throughs when agency is placed into action autonomously.

This creates a boundary problem: many studies address conversational commerce and decision support (often measured via intentions), whereas fewer studies test observable delegation behaviors or execution outcomes. From an automation perspective, this distinction is not merely semantic: adoption-oriented constructs (e.g., perceived usefulness and trust) do not fully capture whether reliance is appropriate once systems can act on the user’s behalf. In trust-in-automation research, trust is analytically distinct from reliance, and the normative benchmark is calibration—aligning reliance with actual system capability and uncertainty—because miscalibration yields misuse (overreliance) or disuse (avoidance). In agentic commerce, this implies that the central boundary is not only “chatbot versus agent,” but whether studies observe calibrated reliance, oversight, and recoverability when execution fails. With the growing use of large-language-model architectures (as in ChatGPT models; [5]) in enhancing transactional functionality, it has become critical to differentiate assistive from agentic systems to provide clarity on the concept. Accordingly, this review differentiates (i) assistive systems that inform and recommend, (ii) semi-agentic systems that guide multi-step workflows without executing transactions, and (iii) high-autonomy systems that alter control rights and/or execute actions—while tracking whether evidence is behavioral, system-demonstrated, or scenario-based. This positioning also motivates our governance focus: execution-capable systems raise accountability and error-recovery requirements (e.g., approval gates, action traceability, and recourse), which are not addressed by intention-only evidence.

2.2. Decision Delegation as the Core Mechanism

Decision delegation consistently serves as the conduit linking system autonomy and consumer outcomes across various studies [37,45,46,47]. Delegation is when users are willing to allow AI agents the authority to arrive at decisions or execute actions on their behalf. In this review, delegation is treated as an automation-reliance mechanism: it ranges from willingness to accept advice (low-stakes reliance) to willingness to cede control rights for consequential actions (execution-level delegation). This distinction is important because trust can increase favorable attitudes without producing calibrated reliance in contexts where errors, uncertainty, and accountability are salient. This is influenced by trust, control, privacy, and perceived usefulness. Recent research, grounded in psychological and behavioral theories, elucidates the facilitators and obstacles to delegation in commercial contexts.

Trust is an antecedent to delegation. Gao et al. [48] show that problem-solving proficiency, confirmation, and perceived ease of use increase user satisfaction and trust, which further mediate chatbot usage. Likewise, Qi et al. [6] show that credibility and image ubiquity positively influence perceived usefulness and ease of use, thereby boosting attitude and intentions to use visual search AI chatbots. These findings unequivocally confirm prior research, indicating that trust grounded on performance, as well as perceived diagnosticity, positively contribute to successful engagement particularly when it comes to assisting users with searching and comparing. However, much of this evidence operationalizes delegation indirectly via satisfaction or intention constructs rather than via observed reliance behaviors, which limits inference about calibrated delegation under real execution risk.

However, delegation does not have entirely positive effects. Yoon et al. [23] demonstrate that personalized systems proposed by AI-driven chatbots stimulate reactance and avoidance, mediated by freedom-threat and negative affect. Their results are consistent with those from Frank et al. [40], who propose that high autonomy leads to lower adoption intentions because it evokes feelings of powerlessness—a conflict between human and machine autonomy. Insignificant effects are observed regarding scarcity conditions, implying that external conditions (for example, time pressure and sense of urgency) influence the relationship between autonomy and its adoption. Finally, value and risk barrier effects are confirmed to evoke negative emotions and innovation resistance by Chang et al. [49] on the theory of innovation resistance, affirming that functionality and psychological barriers are still fundamental factors to determine user response to automation.

These findings align with broader automation research showing that user responses are dynamic and error-sensitive: salient failures can shift users toward avoidance (disuse), whereas well-scaffolded control and accountability cues can support continued reliance even when performance is imperfect. In agentic commerce, the implication is that delegation is shaped not only by initial trust formation, but also by how systems support error recovery (undo/cancel, escalation), communicate uncertainty, and clarify responsibility when outcomes are negative.

Delegation decisions are influenced by psychological factors other than trust and control, such as hedonic motivation, social influence, and gratifications. While Puertas et al. [4] found social presence and entertainment as powerful antecedents of positive customer experience and purchase intention, Le et al. [42] emphasize the significance of peer validation and social dynamics, demonstrating that social influence can override privacy concerns. Collectively, these studies indicate that the best way to understand delegation is as a multifaceted assessment process that balances instrumental advantages against ethical and emotional costs. At the same time, the dominance of cross-sectional, self-reported intention designs creates a risk of theoretical self-confirmation: usefulness–trust–intention pathways are repeatedly tested and “validated” with similar instruments, while key delegation constructs—appropriate reliance, overreliance, accountability judgments, and recovery after agent errors—remain under-measured. However, most of the research is still self-reported and cross-sectional, offering little insight into how delegation changes with repeated interaction or actual transactional exposure.

2.3. Outcomes of Agentic Commerce

The empirical research on the effects of agentic commerce is still dispersed but suggests an interrelated set of commerce and psychological effects [7,44,50,51]. However, the mapped outcomes are better interpreted through an “appropriate reliance” lens rather than a simple adoption lens: as systems become more autonomous, the central outcome is not only whether users trust the agent, but whether they rely on it in a calibrated way (i.e., aligning reliance with demonstrated capability, uncertainty, and error risk). In the context of the marketing discipline, the performance of the chatbot has been identified as an effective indicator of satisfaction, loyalty, or purchase intention. The research conducted by Vebrianti et al. [44] has shown that the combined effect of the response time, the quality of information, usefulness, and ease of use has positively influenced the satisfaction level and the level of loyalty, while the customer experience has acted as the mediator between the purchase intention or the effect of the chatbot design in the 2023 research conducted by [4]. These results align with a utility/assurance pathway, but they also illustrate a measurement concentration: many “positive outcomes” are inferred from satisfaction and intention measures in assistive settings, rather than from observed delegation choices, verification behavior, or post-error updating.

Consumer-level outcomes are more mixed, showing that there are conflicts between convenience and freedom. Yoon et al. [23] identify perceived freedom threats as key mediators of avoidance, while [40,41] demonstrate that elevated AI autonomy engenders a sense of powerlessness that hinders adoption. In reliance terms, these patterns map onto “disuse” dynamics (avoidance of automation) when autonomy cues imply loss of agency or responsibility without safeguards. Conversely, as systems become more persuasive or friction-reducing, an opposing risk emerges: “misuse” (overreliance/automation bias), where users may accept recommendations or delegated actions without sufficient verification—an outcome that is rarely measured directly in the reviewed studies. Concerns about privacy, accountability, and fairness endure even when productivity and effort reduction increase satisfaction. While Chang et al. [49] show that value and risk barriers produce negative emotions and resistance, highlighting that technological innovation alone does not guarantee acceptance, Chakraborty et al. [33] emphasize how perceived threat and regret avoidance shape trust formation. These findings imply that governance outcomes (perceived accountability, blame/regret, and recourse after errors) are not peripheral but central to delegation: in execution-capable settings, “trust” must be evaluated together with error recovery (undo/cancel/escalate), responsibility partition, and the ability to inspect and contest the agent’s rationale. Moderators related to culture and context further complicate results. According to Le et al. [42], Romanian consumers’ acceptance varies by age, with younger cohorts displaying greater openness to AI, while Vietnamese consumers prioritize hedonic and social gratifications over privacy risk. Therefore, culture, product involvement, and perceived stakes all influence autonomy’s effects. Importantly, these moderators may also shape calibration: high-stakes contexts and prior negative experiences can heighten verification, trigger algorithm aversion after salient errors, or alternatively increase appreciation when performance advantages are reliably demonstrated.

There are still significant gaps despite encouraging evidence. Most of the research emphasizes self-reported intentions, examines assistive rather than truly agentic systems, and ignores the socio-technical relationships between autonomy, task scope, and transaction capability. By addressing these limitations, the current study provides a unified framework for understanding when delegation to AI agents improves or degrades consumer experience and market performance by connecting design parameters to psychological mechanisms—trust, control, privacy, and effort reduction—as well as to marketing and consumer outcomes (Table 1).

3. Research Method

3.1. Search Strategy

We conducted a systematic literature search in Scopus and the Web of Science Core Collection utilizing a singular theory-driven Boolean query aimed at identifying (i) agentic AI/assistant terminology, (ii) consumer commerce and shopping contexts, and (iii) delegation or acting-on-behalf concepts. Searches were limited to English-language records published from 2015 to 2026 to document the rise of conversational agents via contemporary LLM-enabled agentic systems. To put primary evidence first, we left out records that were indexed as Review at the database-filter level. Web of Science and Scopus were last searched on 20 December 2025. We retained review articles only for background framing and citation chasing. Because “agentic commerce” is used inconsistently across fields (marketing, HCI, and computing), we intentionally included adjacent terms (e.g., conversational AI, chatbot) and relied on full-text eligibility rules and coding thresholds (Section 3.2 and Section 3.4) to distinguish assistive systems from higher-autonomy, execution-capable agents. To reduce the risk of terminology-driven omission from the adjacent literature (e.g., algorithmic advice, automation bias, reliance, and decision delegation), we complemented the primary query with targeted backward/forward citation chasing of included studies and relevant reviews, and a focused keyword check using adjacent terms combined with commerce constraints (e.g., “automation bias” AND shopping/e-commerce; “algorithmic advice” AND purchase/checkout; “decision delegation” AND e-commerce). Records retrieved via these supplementary routes were screened under the same eligibility criteria and coded using the same autonomy and evidence thresholds (Section 3.4). We did not systematically search the grey literature (e.g., industry white papers, product documentation, blogs, or non-peer-reviewed preprints) because our aim was to map peer-reviewed consumer-facing empirical and evaluative evidence. This scope choice reduces coverage of practitioner implementations but increases interpretability and comparability of the included evidence. The final search string was:

(agentic OR (AI AND agent) OR (autonomous AND (agent OR assistant *)) OR “conversational AI” OR chatbot *) AND (“online shopping” OR “e-commerce” OR ecommerce OR checkout OR cart OR “purchase decision” OR “shopping assistant”) AND (delegat * OR reliance OR “act on behalf” OR “acting on behalf” OR “user control” OR approval OR confirm * OR autonomy OR planning OR “tool use” OR tools OR “multi step” OR multistep OR memory OR “task completion”).

3.2. Eligibility Criteria

Studies were included if they: (a) examined an AI agent/assistant with agentic features relevant to autonomy or acting-on-behalf (e.g., planning, approval/confirmation logic, tool use, memory, or delegation-related control); (b) were situated in a consumer commerce context (e.g., online shopping/e-commerce, cart/checkout, purchase decisions); and (c) reported empirical or evaluative evidence related to decision delegation and/or downstream outcomes, including marketing outcomes (e.g., purchase intention/behavior, conversion, satisfaction, loyalty) and/or consumer outcomes (e.g., trust, perceived control/autonomy, reliance). We included articles from peer-reviewed journals and papers from peer-reviewed conferences and proceedings. Studies were excluded if they: (a) concentrated on non-commercial contexts lacking purchase-decision relevance (e.g., general customer service, education, healthcare); (b) represented solely technical or architectural contributions devoid of consumer-facing evaluation or quantifiable consumer/marketing outcomes; or (c) constituted non-primary publication types (reviews, editorials, commentaries).

To address conceptual ambiguity in “agentic” claims, we applied an evidentiary threshold at inclusion and coding; in particular, studies were eligible if they provided at least one of the following forms of delegation-relevant evidence: (i) demonstrated workflow initiative (multi-step orchestration) or (ii) explicit delegation/acting-on-behalf control logic (e.g., approval gates, confirmation steps, reversible actions) or (iii) transaction execution or a validated proxy (e.g., order placement, checkout completion, payment initiation in a sandbox) or (iv) consumer measures explicitly tied to delegation (e.g., willingness to delegate, delegation choice, proportion of tasks delegated), rather than generic chatbot satisfaction alone. Accordingly, conversational assistants remained eligible when they were analyzed as part of the broader continuum, but their autonomy was coded separately as assistive unless execution-level autonomy was demonstrated (Section 3.4).

3.3. Screening and Study Selection

The database search returned 226 records in Scopus and 71 in Web of Science (total n = 297, prior to deduplication). After merging exports and removing duplicates (n = 135), 162 unique records remained. Screening followed two stages: (1) title/abstract screening and (2) full-text assessment against the eligibility criteria. Because agentic capability and delegation/acting-on-behalf features are frequently under-specified in titles and abstracts, we adopted an intentionally liberal title/abstract screening approach and carried all 162 records forward to full-text assessment (i.e., records excluded at title/abstract = 0) to avoid false exclusions at the abstract stage. Final inclusion/exclusion decisions were made by the author using pre-specified criteria and documented reasons for exclusion at the full-text stage (e.g., “agent” referring to generic customer service without commerce decision tasks; outcomes not consumer- or marketing-relevant; purely technical papers without user-facing evaluation). The complete selection process and exclusion reasons are reported in the PRISMA flow diagram (Figure 1).

We followed PRISMA 2020 (see Supplementary Materials) reporting items where applicable to an evidence-mapping and narrative synthesis design. Screening and coding were conducted using a pre-specified codebook with decision rules and examples, supported by an audit-oriented verification procedure for high-leverage classifications. Specifically, we maintained a full-text screening log with explicit inclusion/exclusion reasons and performed a time-separated second-pass verification of autonomy level, transaction capability, evidence type, and delegation operationalization, resolving discrepancies by re-checking the original text and applying the same rules. We additionally conducted a purposive boundary-case audit for ambiguous agentic claims (e.g., autonomy framed as perception only, transaction capability asserted but not evidenced) and applied conservative default rules (lower autonomy/evidence grade) unless the full text provided sufficient operational detail to justify a higher classification.

All included studies were coded using a structured scheme aligned with the three research questions. For RQ1 (phenomenon mapping), we coded agent design and implementation features including autonomy level, task scope (search/compare/recommend/cart/checkout/postpurchase), interaction mode (text/voice/multimodal; initiative; explanation), and transaction capability (simulated vs. real; order placement/payment execution). In Table 2, “transaction-capable” denotes the presence of an execution pathway described or instantiated; when execution was not observed (behavior = No) and context was not real/realistic, capability was treated as claimed rather than demonstrated and was not sufficient for Level 3 coding. For RQ2 (delegation determinants), we categorized key drivers and constraints (trust, perceived control/agency, powerlessness/reactance, risk/privacy concern, usefulness/convenience) and the operationalization of delegation (e.g., delegation intention, delegation choice, proportion of tasks delegated). For RQ3 (outcomes), we coded marketing outcomes (purchase intention/behavior, conversion, loyalty, satisfaction) and consumer outcomes (trust calibration, reliance/overreliance, perceived autonomy/control, fairness, regret/blame), along with evidence type and whether outcomes were intention-based, simulated, or observed in a real or realistic transaction context (Table 3).

To prevent conflating conversational assistance with execution-level delegation, we implemented a two-axis coding approach: (i) Autonomy Level (0–3) and (ii) Evidence Grade for delegation/transaction capability. Autonomy Level captures functional control rights and workflow initiative (Table 2), while Evidence Grade captures whether autonomy/transactions are demonstrated versus perceived or hypothetical (Section 3.4).

Evidence Grade was coded to clarify evidentiary thresholds:

E0 (conceptual/claimed capability): agentic/transaction capability asserted but not evaluated with users or not instantiated.

E1 (perceived/scenario-based): autonomy/delegation measured as perceptions or intentions (e.g., “willingness to delegate,” perceived autonomy) without observed delegation behavior or transaction execution.

E2 (behavioral in simulated tasks): delegation choices or workflow behaviors observed in controlled tasks, but transactions are simulated or proxied (e.g., mock checkout, hypothetical purchase, limited-stakes environment).

E3 (demonstrated execution): observed execution-level outcomes in a realistic or real transaction context (e.g., order placement, checkout completion, payment initiation), or a validated transaction proxy with clear control-rights transfer.

Transaction capability (RQ1) was coded using the same conservative evidentiary logic: “realistic/real” required explicit description of execution or a validated transaction proxy; otherwise, transaction capability was coded as simulated or absent. Delegation operationalization (RQ2) was coded as intention/perception versus observed choice/behavior, to avoid overstating maturity of decision delegation evidence.

This review follows PRISMA 2020 (see Supplementary Materials) as a reporting framework; however, some items are not fully applicable to the present evidence-mapping and narrative synthesis design. Specifically, no formal study-level risk-of-bias tool was applied and no certainty grading (e.g., GRADE/CERQual) was performed because the evidence base is methodologically heterogeneous and the review did not aim to estimate pooled effects. Instead, we implemented a transparent study-level appraisal checklist and a pre-specified evidence-direction coding rule set, reported in Section 3.4, to ensure that synthesis claims remain proportional to evidentiary realism and to the maturity of decision-delegation evidence.

3.4. Quality Appraisal and Evidence-Direction Coding

Because the included studies are methodologically heterogeneous (surveys, experiments, platform evaluations, and system demonstrations), we implemented a structured appraisal checklist appropriate for heterogeneous evidence-mapping. Instead, to support robustness and replicability while preserving the review’s evidence-mapping aim, we applied a transparent, study-level quality appraisal checklist and a pre-specified rule set for evidence-direction coding.

For each study, we coded quality-relevant features that affect interpretability of delegation and governance claims: (i) evidentiary realism (scenario/simulated vs. real purchase context; and whether behavior was observed vs. self-reported), (ii) autonomy instantiation (manipulated control rights vs. perceived autonomy only), (iii) transaction capability evidence (none; partial order only; full order + payment; and whether execution was demonstrated or merely asserted), (iv) construct measurement rigor (validated scales vs. ad hoc single items; common method risk signals), and (v) design transparency (sufficient description to replicate coding decisions). These fields were used to contextualize synthesis claims and to avoid treating intention-only evidence as equivalent to execution-level delegation.

To improve replicability of synthesis claims, we coded evidence direction for key RQ2/RQ3 relationships using explicit rules. Evidence direction was coded as positive, negative, null, or mixed/contingent based on the primary reported tests: positive/negative required a reported effect in the expected/opposite direction (statistical test, effect estimate, or clear qualitative evaluation outcome); null required an explicitly reported non-effect; and mixed/contingent was used when effects differed by condition (e.g., autonomy cues increasing utility but decreasing perceived control) or when results were inconsistent across outcomes within the same study. When evidence was intention-based only, it was flagged as intention-only and was not treated as execution-level delegation evidence.

To contextualize external validity and the maturity of the evidence base, we summarize where the included studies come from (geography), what kind of evidence they provide (evidence maturity), and who they sample (sample composition). As shown in Table 4, the mapped corpus (N = 47 studies) is geographically concentrated in East and Southeast Asia (combined 42.5%), with China representing the single most common country setting (19.1%). In parallel, the evidence base is dominated by intention-only/cross-sectional designs (55.3%), while behavioral evidence in realistic contexts remains comparatively scarce (14.9%). Sample composition further reflects typical generalizability constraints in this literature, with a substantial share of online panels (23.4%) and student samples (21.3%), alongside customer/user samples (36.2%). This context profile is used throughout the synthesis to keep interpretation proportional to evidentiary realism and to distinguish assistive, intention-based evidence from more mature behavioral evaluations of delegation and execution. We return to these external-validity constraints explicitly in Section 7.3.4 as a limitation of the reviewed evidence base.

4. Results Part I—RQ1 Phenomenon Mapping

4.1. Agent Identity/Labels and GenAI vs. Pre-GenAI Framing

In the studies reviewed, “agentic commerce” is defined using a variety of labels that often obscure important differences in ability. Most of the papers focus on chatbots or virtual assistants in retail or service settings (e.g., [4,50,51,52,53]), while a smaller group uses more general terms like “conversational agent” or “digital assistant” [8,24]. Only a small number of people clearly define the system as a multi-agent or agentic recommender architecture [10] or as a role-defined “algorithmic agent” whose independence is tested [46]. The GenAI transition is evident but inconsistent: certain studies explicitly assess ChatGPT- or LLM-associated systems [5,46,54,55] or suggest LLM-driven retail assistants [56], while numerous papers utilize “AI chatbot” as a general term without clarifying whether the system is generative or rule-/ML-based [31,44,57]. Consequently, “agentic commerce” in the literature often signifies the presence of conversational AI rather than confirmed end-to-end autonomy.

4.2. Autonomy Levels (0–3) and How Autonomy Is Instantiated

The distribution of autonomy has a significant tilt toward low autonomy. A minor subset aligns with Level 0 (reactive, Q&A-like) service agents, lacking evidence of autonomous goal pursuit (e.g., [43,49]). The predominant category is Level 1 (assistive autonomy), wherein the agent offers information, suggestions, or support, while the user maintains decision-making authority and performs actions (e.g., [23,39,42,46,48,50,53]). Level 2 (workflow/guided or semi-agentic) is less common and is usually defined as (a) structured, guided flows that cover multiple subtasks without executing transactions [32,58,59] or (b) systems that are framed as more agentic through architecture/tool-use or fewer LLM-calls in conversational recommendation [10].

To reduce conceptual ambiguity, Level 2 is coded only when studies show workflow-level initiative (multi-step orchestration) beyond single-turn assistance, even if transactions remain user-executed. Level 3 (high autonomy with execution) is infrequent and predominantly realized through experimental frameworks that distinctly alter control rights and execution [40,41,46]. Level 3 is reserved for cases with explicit delegation of control rights and demonstrated execution (or validated transaction proxy), rather than purely conversational completion of a task narrative. Significantly, even when “autonomy” is pivotal, it is frequently implemented as a perceived characteristic [60] or as a contrived scenario condition [40,46] rather than as actual task-level delegation in a functioning system. Therefore, autonomy in the evidence base is often “perceived autonomy” rather than “demonstrated autonomy,” and the literature currently provides stronger support for acceptance and governance perceptions than for mature, real-world decision delegation at execution level.

4.3. Task Scope Patterns Across the Commerce Journey

Task scope clusters around pre-purchase decision support and post-purchase service, with limited coverage of true execution. Most studies incorporate search and/or recommendation functionalities (e.g., [10,33,46,48,54,56]), whereas comparison is observed sporadically [5,8,24,46]. Structured ordering flows, such as cart-builds, are not very common and usually show limited workflows, like listing/build tasks [7,61] or messenger-based ordering scripts [62]. The checkout/payment process delineates the distinction between “assistive chatbots” and agentic commerce, manifesting in a limited number of instances, typically via experimental execution frameworks [40,41] or through more generalized “ordering/payment support” assertions rather than documented autonomous completion [45]. Negotiation is virtually nonexistent in the coded set (with no significant N = Yes cases), suggesting that a commonly addressed agentic capability in theoretical discourse is inadequately reflected in empirical commerce research. In general, the literature focuses on S/R/A (search–recommend–support) instead of end-to-end shopping delegation (Table 3).

4.4. Interaction Modes, Initiative, Anthropomorphism, and Explanation

Only a few studies demonstrate system-guided initiation [61] or autonomy-contingent initiative [40,46]. The majority of interaction is text-based chat, usually user-initiated (e.g., [42,46,48,55]). Voice is relatively uncommon and primarily manifests as an anthropomorphic cue [52] or as a modality manipulation [8]. Additionally rare is multimodal interaction, which is mainly operationalized as image + text search in fashion shopping [6]. Anthropomorphism is common but conceptually distinct from autonomy; numerous studies use it as a manipulation or measured cue, frequently through avatars, naming, warmth cues, or social presence [24,38,63,64,65]. On the other hand, XAI and explanations are rarely clear. When they do exist, they are usually procedural or minimal [10] or presented as a design element rather than a tried-and-true oversight mechanism [5,61]. As a result, the interaction layer frequently increases human-likeness and engagement while under-specifying features that are pertinent to governance, such as explanations, overrides, and confirmations (Table 5).

4.5. Transaction Capability and Evidentiary Realism

Even when “commerce” is the setting, the agent usually does not place orders or carry out payments (e.g., [4,46,50,51,54,66]). Most systems are not transaction-capable. Partial transaction capability, such as placing an order without making a payment, is a small subset [33,62]. The high-autonomy manipulation studies [40,41] focus on full execution (order + payment), which makes them crucial but also shows how infrequently execution is seen in naturalistic commerce use. Numerous studies assert the existence of a genuine purchase context through mechanisms such as consumer recall or field surveys (e.g., [38,42,48,56,64]). However, direct observation of actual behavior is relatively scarce, primarily manifesting in controlled experiments or platform-integrated evaluations [8,10,24,38,41,55,61]. The phenomenon map shows that the empirical literature is still mostly about assistive conversational commerce. There is only a small amount of evidence for true delegation and autonomous transaction execution, which are the main characteristics of agentic commerce in its strictest sense (Table 5).

Table 5

Core mapping table per study (label, autonomy, task scope, interface, transaction capability).

Study	Agent Identity (Label; GenAI)	Autonomy (0–3; Basis)	Task Scope	Interaction Mode	Transaction Capability
Gharib et al. [52]	Virtual assistant/chatbot; pre-GenAI	1; implied	R, A	Mixed; user-initiated; moderate interactivity; anthropomorphic cues Yes (voice); explanation NR	Transaction-capable No; order No; payment No; real context No; behavior observed No
Pizzi et al. [24]	Digital assistant; pre-GenAI	2; manipulated	S, C	Embedded site/app; mixed initiative; moderate interactivity; anthropomorphic cues Yes (avatar); explanation No	Transaction-capable No; order No; payment No; real context Yes; behavior observed Yes
Chakraborty et al. [33]	GAI chatbots; GenAI	1; implied	S, R, P, A	Text; user-initiated; high interactivity; anthropomorphic cues Yes (type NR); explanation NR	Transaction-capable Claimed; order Claimed; payment No; real context No; behavior observed No
Gao et al. [48]	Chatbot; GenAI	1; implied	S, R, A	Text; user-initiated; high interactivity; anthropomorphic cues No; explanation NR	Transaction-capable No; order No; payment No; real context Yes; behavior observed No
Le et al. [42]	AI-driven chatbot; pre-GenAI	1; implied	S, R, A	Text; user-initiated; moderate interactivity; anthropomorphic cues No; explanation NR	Transaction-capable No; order No; payment No; real context Yes; behavior observed No
Martínez Puertas et al. [4]	Chatbot; pre-GenAI	1; implied	R	Text; initiative NR; moderate interactivity; anthropomorphic cues No; explanation NR	Transaction-capable No; order No; payment No; real context Yes; behavior observed No
Vebrianti et al. [44]	“AI chatbot” (customer service); Gen NR	1; reactive service	S, A	Text chat (web/app); user-initiated; moderate interactivity; anthropomorphic cues No; explanation No	Transaction-capable No; order No; payment No; real context Yes; behavior observed No
Chua et al. [43]	Chatbot (human-likeness cues); pre-GenAI	0; cues manipulated (not autonomy)	A	Text; user-initiated; moderate interactivity; anthropomorphism Varied/manipulated; explanation No	Transaction-capable No; order No; payment No; real context No; behavior observed No
Nie et al. [10]	Multi-agent conversational recommender system; GenAI	2; implied	S, R	Embedded site/app; user-initiated; high interactivity; anthropomorphic cues No; explanation Minimal	Transaction-capable Claimed; order No; payment No; real context Yes; behavior observed Yes
Park et al. [61]	Closed-domain listing assistant (“CSO”); GenAI: No	1; scripted/guided	B	Mobile app; system-initiated guided; moderate interactivity; anthropomorphic cues No; explanation Yes	Transaction-capable No; order No; payment No; real context No; behavior observed Yes
Ionescu et al. [67]	“AI/chatbots” in online shopping; Gen NR	1; user-perceived assist	S, R	Text (implied); user-initiated; moderate; anthropomorphic No; explanation No	Transaction-capable No; order No; payment No; real context No; behavior observed No
Ranjan et al. [45]	AI-driven/conversational chatbot; Gen NR	1; assistive	S, P, A (others NR)	Text (in-app/WhatsApp noted); user-initiated; high; anthropomorphic Yes (framing); explanation No	Transaction-capable Claimed; order Claimed; payment Claimed; real context No; behavior observed No
Qi et al. [6]	AI chatbot image search service; multimodal (Gen NR)	1; implied	S, R (C/P/N/A NR/No)	Multimodal (image + text); user-initiated; moderate; anthropomorphic NR; explanation No	Transaction-capable No; order No; payment No; real context No; behavior observed No
Chang et al. [49]	Online customer service chatbot; Gen NR	0; reactive Q&A	A	Text chat; user-initiated; moderate; anthropomorphic No; explanation No	Transaction-capable No; order No; payment No; real context Yes; behavior observed No
Yoon et al. [23]	AI service assistant/chatbot (personalized recs); GenAI: No	1; recommendation-only	R	Text chat; mixed initiative; high; anthropomorphic Yes (name/tone); explanation No	Transaction-capable No; order No; payment No; real context Yes; behavior observed No
Frank et al. [40]	AI shopping assistant (high vs. low autonomy); non-GenAI	3; manipulated (auto-purchase vs. notify)	S, R, P	Text scenarios; AI-initiated (high autonomy) vs. user-initiated (low); low interactivity; anthropomorphic No; explanation No	Transaction-capable Simulated; order Simulated; payment Simulated; real context No; behavior observed Yes
Sadiq et al. [5]	ChatGPT “Green Evangelist”; GenAI	1; info/recommend	S, C, R	Text chatbot; mixed initiative; moderate; anthropomorphic Yes (generic cues); explanation Yes	Transaction-capable No; order No; payment No; real context No; behavior observed No
Yue et al. [46]	ChatGPT-driven e-commerce chatbot (“LazzieChat”); GenAI	1; assistive	S, R	Text chat in app; user-initiated; high; anthropomorphic Yes (language realism); explanation NR	Transaction-capable No; order No; payment No; real context Yes; behavior observed No
Ovalle [55]	ChatGPT + content-based RS integration; GenAI	1; tool-triggered RS call	S, R, A	Text chatbot; user-initiated; moderate; anthropomorphic Yes (persona); explanation No	Transaction-capable No; order No; payment No; real context Yes; behavior observed Yes
Enriquez et al. [60]	“AI chatbots”/AI-powered chatbots; Gen NR	1; perceived attribute	C, R	Conversational (text implied); user-initiated; high; anthropomorphic No; explanation NR	Transaction-capable No; order No; payment No; real context No; behavior observed No
Becan et al. [68]	AI-driven consumer chatbot; Gen NR	1; implied assistive	S (others NR)	Text chat; user-initiated; high; anthropomorphic NR; explanation NR	Transaction-capable No; order No; payment No; real context Yes; behavior observed No
Jia et al. [69]	AI chatbots/intelligent customer service; non-GenAI (NLP/ML)	1; implied	S, R, A	Text chat; user-initiated; high; anthropomorphic Yes (human-like language/friendliness); explanation NR	Transaction-capable No; order No; payment No; real context Yes; behavior observed No
Fan et al. [46]	Algorithmic agents (roles: substitute/co-assistant/performer); non-GenAI	3; manipulated	S, C, R	Scenario vignettes (text); system-initiated varies by autonomy; moderate; anthropomorphic No; explanation NR	Transaction-capable No; order No; payment No; real context No; behavior observed No
Hu et al. [38]	Avatar-based shopping chatbot; non-GenAI (scripted)	1; implied	S, C, R	Text chat (Messenger); user-initiated; high; anthropomorphic Yes (visual avatar); explanation No	Transaction-capable No; order No; payment No; real context Yes; behavior observed Yes
Liu et al. [39]	Product recommendation chatbot; Gen NR	1; assistive	R	Text conversational UI; user-initiated; high; anthropomorphic No; explanation NR	Transaction-capable No; order No; payment No; real context No; behavior observed No
Ooi et al. [56]	Virtual retail assistant chatbot; GenAI implied (LLM)	1; implied	R	Text conversational UI; user-initiated; high; anthropomorphic Yes (appearance/tone/social presence); explanation No	Transaction-capable No; order No; payment No; real context Yes; behavior observed No
Yatawara et al. [22]	AI-driven chatbot (NLP); Gen NR	1; affordance-based	S, R, A	Chat interface (modality NR); user-initiated; high; anthropomorphic NR; explanation NR	Transaction-capable Not reported; order No; payment No; real context Yes; behavior observed No
Khan et al. [21]	AI-based chatbot service agent; Gen NR	1; implied (weak AI framing)	S, R, A	Text chat; user-initiated; medium; anthropomorphic Yes (design cues mentioned); explanation NR	Transaction-capable Not reported; order Not reported; payment Not reported; real context No; behavior observed No
Sundjaja et al. [50]	Customer service chatbot; non-GenAI (AI/NLP)	1; reactive	A	Text chatbot; user-initiated; medium; anthropomorphic No; explanation No	Transaction-capable No; order No; payment No; real context Yes; behavior observed No
Cheng et al. [51]	Text-based AI chatbot; Gen NR	1; reactive service	S, R, A	Text; user-initiated; high; anthropomorphic Yes (relational cues); explanation No	Transaction-capable Not reported; order Not reported; payment Not reported; real context Yes; behavior observed No
Kim et al. [54]	ChatGPT vs. search engine; GenAI	1; prompt-response	S, C, R	Scenario/query task; user-initiated; low; anthropomorphic Yes (conversational tone); explanation No	Transaction-capable No; order No; payment No; real context No; behavior observed No
Akdemir et al. [70]	Dialog chatbots/service assistants; Gen NR	1; reactive	A (others NR)	Interface NR; user-initiated; medium; anthropomorphic NR (measured); explanation No	Transaction-capable Not reported; order Not reported; payment Not reported; real context Yes; behavior observed No
Nikolov et al. [57]	AI chatbot/virtual assistant; Gen NR	1; reactive	NR	Text chat implied; user-initiated; high; anthropomorphic Yes (warmth/personality measured); explanation No	Transaction-capable Not reported; order Not reported; payment Not reported; real context Yes; behavior observed No
Kamoonpuri et al. [59]	GenAI chatbots; GenAI	2; positioned as broader support	S, C, R, A	Text conversational (survey context); user-initiated; high; anthropomorphic Yes (human-like ability framing); explanation No	Transaction-capable No; order No; payment No; real context No; behavior observed No
Klein et al. [7]	Rule-based chatbot (“Luigi” vs. standard); non-GenAI	1; scripted decision tree	S, R, B	Text web chat; mixed (guided flow); high; anthropomorphic Yes (multi-cue design); explanation No	Transaction-capable No; order No; payment No; real context No; behavior observed No
Frank et al. [41]	AI shopping assistant (“AI-Sal”); non-GenAI	3; manipulated (auto vs. notify)	S, C, R, B, P	Web ad + assistant description; mixed initiation; moderate; anthropomorphic No; explanation No	Transaction-capable Observed; order Observed; payment Observed; real context Yes; behavior observed Yes
Matosas-López et al. [63]	AI chatbot (human-like vs. assistant naming); Gen NR	1; scripted scenarios	S, R	Text chat shown via video; user-initiated; high; anthropomorphic Yes (name/avatar/tone); explanation NR	Transaction-capable No; order No; payment No; real context No; behavior observed No
Schindler et al. [8]	Conversational agents (voice vs. text); non-GenAI	1; interface manipulation	S, C (others NR)	Voice vs. text; system-initiated guided prompts; high; anthropomorphic Yes (named agents/voice); explanation No	Transaction-capable No; order No; payment No; real context No; behavior observed Yes
Li et al. [64]	AI-enabled/anthropomorphic chatbots; Gen NR	1; reactive service	S	Text chat; user-initiated; medium; anthropomorphic Yes (warmth/competence cues); explanation No	Transaction-capable No; order No; payment No; real context Yes; behavior observed Yes
Ding et al. [31]	AI chatbot; Gen NR	1; decision support	S, R	Text chat; user-initiated; high; anthropomorphic Yes (human-like cues); explanation NR	Transaction-capable Not reported; order Not reported; payment Not reported; real context Yes; behavior observed No
Kumar et al. [65]	Chatbot/humanized bot; Gen NR	1; reactive	S, R	Text chat implied; user-initiated; moderate; anthropomorphic Yes (mind/agency attribution); explanation No	Transaction-capable Not reported; order Not reported; payment Not reported; real context Yes; behavior observed No
Hassan et al. [30]	AI chatbot personalized recommendations; Gen NR	1; decision support	R	Mixed recommender/chatbot described; system-initiated recs; moderate; anthropomorphic No; explanation No	Transaction-capable Not reported; order Not reported; payment Not reported; real context Yes; behavior observed No
Han [62]	Chatbot commerce (mobile messenger); non-GenAI (AI/NLP)	1; scripted ordering flow	B, A	Text messenger; user-initiated; high; anthropomorphic Yes (conversational human-likeness); explanation No	Transaction-capable Claimed; order Claimed; payment No; real context Yes; behavior observed No
Christopher et al. [53]	AI chatbot embedded in platform; Gen NR	1; reactive	S, R, A	Chat interface (text implied); user-initiated; high; anthropomorphic Yes (scale); explanation No	Transaction-capable No; order No; payment No; real context Yes; behavior observed No
Kim et al. [66]	Chatbot services (text customer service); Gen NR	1; assistive	S, R, A	Text chat; mixed/user-initiated recall; high; anthropomorphic Yes; explanation No	Capable No; order No; pay No; real context No; behavior No
Song et al. [32]	Chatbot/virtual assistant (sales delegate); non-GenAI scripted	2; guided recommender	S, C, R	Text chat UI + avatar; mixed guided; high; anthropomorphic Yes (avatar/familiarity); explanation NR	Transaction-capable No; order No; payment No; real context Yes; behavior observed No
Foroughi et al. [58]	E-commerce chatbot (2nd-gen AI; rule-based + NLP); pre-GenAI	2; adaptive responses	S, C, R, A	Text chat; reactive-interactive; moderate; anthropomorphic Yes (linguistic warmth); explanation No	Transaction-capable No; order No; payment No; real context No; behavior observed No

Abbreviations (Task scope): S = search; C = compare; R = recommend; B = cart build/listing creation; P = checkout/pay; N = negotiate; A = postpurchase support. NR = not reported.

Figure 2 illustrates the evidence map for agentic commerce by cross-tabulating the level of autonomy (0–3) with the maximum task-scope stage an agent covers. The size of the bubbles presents how many studies are in each cell, and the shape of the bubbles shows the difference between real purchase contexts (circles) and simulated or scenario designs (squares). The map shows that the literature is very clustered, not spread out evenly across the autonomy–scope space. Most studies focus on low autonomy (level 1), particularly during the recommendation and post-purchase/support stages. The primary concentrations are in recommendation at autonomy 1 (both real-context and scenario-based studies) and postpurchase at autonomy 1 (mostly real-context studies). This pattern indicates that most of the empirical base operationalizes “agentic commerce” as conversational decision support rather than delegated execution. Put differently, the mapped evidence is richer for assistance (advice, explanation, support) than for delegation (acting-on-behalf with execution authority). This shows that the mainstream “agentic commerce” evidence base still mostly demonstrates assistive agents that recommend, explain, or support, not agents that independently carry out end-to-end transactions.

According to Figure 2, the modal task-scope ceiling is at recommendation and postpurchase support, and empirical operationalizations of agentic commerce cluster at low autonomy (level 1). There is a gap between system capabilities and consumer-facing evaluation as there is little evidence for higher autonomy (levels 2–3), later-stage execution (cart-building; checkout/pay), and negotiation. Evidence grows significantly sparse outside these predominant clusters. Search and compare show up, but mostly at autonomy levels 1–2 and with much smaller counts. This suggests that “upstream” decision support is studied less often than recommendation-based interaction. Cart-building is infrequently observed, primarily within scenario-based designs, aligning with the idea that intermediate delegation (basket assembly) is more frequently discussed than empirically tested in realistic contexts. High-autonomy (level 3) evidence is scarce and manifests solely as isolated instances associated with later-stage tasks (e.g., checkout/payment), underscoring that fully delegated execution is an anomaly in the empirical record. Importantly, even within the few “execution” cases, the evidentiary basis often depends on explicit experimental manipulation (e.g., auto-purchase vs. notify/approve) rather than on naturally occurring, instrumented delegated transactions. Lastly, the map does not show any negotiation, which shows a clear gap between new technical claims about agentic negotiation and the consumer/marketing evaluation literature that this review covers.

Figure 3 maps the distribution of studies by interaction mode and autonomy level (0–3), highlighting how the empirical literature operationalizes agentic commerce primarily through text-based conversational interfaces. The evidence is heavily concentrated in the Text × Autonomy-1 cell (n = 31), indicating that most studies examine low-autonomy systems that support consumers via dialog, information provision, and recommendations rather than through independent task execution.

Only a small number of studies appear at higher autonomy levels within text interfaces (Autonomy-2: n = 3; Autonomy-3: n = 3), suggesting that moderate- and high-autonomy agents remain comparatively under-tested in mainstream consumer-commerce settings. Alternative modalities are uncommon outside of text. While multimodal systems (such as image + text) are similarly sparse (Autonomy-1: n = 2), voice interfaces only occur at Autonomy-1 (n = 3). Hybrid or unclearly defined interaction modes are represented by a small “Mixed” category (n = 1 at Autonomy-1). Interestingly, embedded/app-based agent implementations are rare and mostly occur at Autonomy-2 (n = 2), which is consistent with systems that use specialized components or guided workflows (e.g., multi-agent recommenders or constrained in-app assistants). Overall, the heatmap supports a major finding of RQ1: the empirical foundation of the field is presently based on conversational, text-first, low-autonomy prototypes, with comparatively few evaluations of higher-autonomy agents in tasks relevant to commerce and little coverage of richer modality designs.

5. Results Part II—RQ2 Determinants of Delegation

5.1. Core Drivers: Usefulness, Effort Reduction, and Trust

The most consistently measured drivers across the mapped evidence are perceived usefulness (n = 11), effort reduction/convenience (n = 11), and trust in the agent (n = 11). Trust in the platform/brand, on the other hand, appears less often (n = 3). The prevailing pattern illustrates a sequence of performance leading to usefulness/satisfaction, then fostering trust, and ultimately resulting in reliance/continuance. Systems designed to minimize friction (e.g., expedited information access, simplified input, structured guidance, enhanced navigation) augment perceived utility and user satisfaction, thereby reinforcing trust and subsequent intention outcomes (e.g., retention, purchase intention, reliance-oriented metrics) [6,44,45,48,50,52,67,69]. When trust is explicitly modeled, it often acts as a mediator that connects perceived usefulness/ease (and broader service-quality or “technical characteristics” signals) to continuance or reliance, which is in line with extended TAM/ECT-style pathways [48,50,67,69]. In numerous studies, “delegation” is operationalized indirectly—through adoption, purchase intention, or continuance intention—rather than through a direct delegation choice or verified hand-off/execution (e.g., [44,45,46,50,52,67]). Evidence with more behavior-proximal outcomes is available but is relatively scarce in the mapped set (e.g., [10,38,53,55]).

5.2. Control and Agency: Empowerment Versus Reactance/Powerlessness

A second cluster centers on perceived control (n = 4), perceived autonomy/agency (n = 5), and reactance/powerlessness (n = 4). This stream illustrates that delegation encompasses both a utility assessment and a governance perspective: when an agent transitions from “advisor” to “decider/executor,” users may perceive diminished agency, eliciting threat-to-freedom reactions (reactance) or feelings of powerlessness, which inhibit adoption and delegation intentions [23,24,40,58]. Experimental autonomy manipulations (e.g., auto-purchase vs. notify) demonstrate that increased autonomy can diminish acceptance via pathways of powerlessness [40,58]. Simultaneously, interface and interaction design decisions can influence these responses: the type of assistant and the initiation mode (user- vs. system-initiated) affect reactance and subsequent satisfaction/choice certainty outcomes [24], whereas personalization-based recommendation cues can induce avoidance through threats to freedom and negative affect pathways, contingent upon user engagement [23]. The control/agency evidence indicates a trade-off: delegation increases when agents perceive themselves as “capable yet still governable,” and decreases when autonomy cues suggest displacement rather than support [23,24,40,58].

5.3. Risk and Privacy: Barrier Pathways and the Limited Role of Transparency Testing

Privacy concern (n = 4) and perceived risk (n = 3) manifest less frequently than usefulness/trust and exhibit more heterogeneous effects. Numerous models conceptualize risk and privacy as impediments that amplify adverse emotions and resistance or directly inhibit purchase or adoption intentions [4,49,67]. Nonetheless, additional evidence suggests that risk/privacy may be insignificant compared to gratification and utility drivers in specific contexts/samples [42], and that effects may differ based on usage frequency/experience [4]. Importantly for “agentic” delegation governance, while transparency and disclosure are frequently examined conceptually, there is a scarcity of empirical studies that operationalize transparency as a significant governance factor (e.g., approval gates, explainability, accountability cues). When disclosure is modeled, it typically functions as an antecedent or contextual factor rather than a direct delegation-control mechanism; for instance, disclosure can influence perceived quality in an ECT chain or moderate attribute-to-trust relationships [50,51]. Consequently, the available evidence is insufficient regarding when transparency effectively adjusts trust and mitigates over-delegation or regret as systems transition from assistive chatbots to transaction-executing agents.

5.4. Boundary Conditions: When Delegation Drivers Strengthen, Weaken, or Reverse

Moderation and boundary conditions are reported inconsistently (moderators tested: varied across determinants), yet several recurring themes surface. First, beliefs about one’s ability (such as perceived task-solving competence) strengthen the usefulness and trust routes stronger. Delegation is more likely when the agent is seen as competent for the task [51,52,69]. Second, task and contextual pressures influence autonomy responses: scarcity and time constraints can diminish the negative autonomy-to-powerlessness trajectory, rendering elevated autonomy more acceptable in urgent situations [40,41]. Third, user characteristics and experience (technological readiness/AI literacy n = 5; self-efficacy n = 5; prior usage/experience) consistently function as acceptance “facilitators,” enhancing positive pathways to persistence or reliance [4,7,39,46,68,69]. Fourth, design and channel factors (voice vs. text; initiation mode; anthropomorphic cues) influence perceived control and social presence, occasionally altering the direction of effects through reactance or satisfaction mechanisms [7,24,31,32,52]. Lastly, culture and demographics are often mentioned, but rarely tested rigorously. Descriptive reports often discuss generation/age effects [7,67], collectivist peer influence is linked to social-influence pathways [42], and cultural dimensions like power distance seem to act as moderators in certain streams [46]. However, robust multi-country causal tests are still lacking in the mapped set, resulting in a continuing deficiency in generalizability (Table 6).

Three interdependent dimensions are used in the mapped literature to explain “delegation”: (i) utility/friction reduction (usefulness + convenience), (ii) relational assurance (trust in agent/platform), and (iii) governance (control/agency versus reactance/powerlessness), with risk/privacy acting as secondary constraints. As systems move toward truly agentic, transaction-executing commerce [24,40,44,45,50,52,67], the main limitation is measurement: a significant portion of the evidence deduces delegation from adoption or continuation intentions, rather than from direct observation of explicit delegation choices or verified execution.

Table 4 outlines the different boundary conditions that make delegation-related pathways stronger, weaker, or reverse. Instead of treating moderators as separate variables, the table shows how each moderator changes a specific link in the main mechanism chains (for example, friction-reduction → trust → reliance versus autonomy cues → powerlessness/reactance → avoidance). This clarifies that “delegation” is contingent. For example, a design choice such as higher autonomy, anthropomorphic framing, or voice interaction can increase reliance in one context while suppressing it in another, depending on users’ capability beliefs, situational stakes, and prior experience.

Within studies, moderators tend to group into three main families. Capability and experience enablers (e.g., task-solving competence, self-efficacy, prior chatbot/tech familiarity, eHL) generally enhance the usefulness/trust pathway, increasing the likelihood of intention-based delegation when users perceive the agent as reliable. Second, cues of urgency or stakes, such as scarcity and time pressure, can make people less resistant to governance. This weakens the autonomy → powerlessness pathway and makes higher autonomy more acceptable in situations where people need it. Third, interface and social-design contingencies (such as type of interaction, initiation, anthropomorphism, avatar familiarity, and modality–product matching) primarily function through reactance, eeriness, fluency, or satisfaction, influencing the perception of an agent as either supportive or controlling. It is important to note that while privacy and risk frequent topics of discussion about, not many studies test transparency or accountability mechanisms as moderators of delegation, leaving a clear evidence gap for truly agentic commerce settings.

The mapped determinants of delegation are integrated into two interrelated pathways in Figure 4. The “utility/assurance lane” captures the dominant friction-reduction logic: delegation-proximal outcomes (e.g., reliance, continuance, purchase intention, or observed acceptance/execution) are enabled by interface and performance cues (e.g., ease, speed, guided steps) that raise perceived usefulness and/or satisfaction and, in turn, trust. The “governance lane” captures the agency trade-off: as autonomy shifts from advisory support toward decision-making and execution, perceived control can erode, triggering reactance or powerlessness that suppresses delegation unless mitigated by safeguards; risk and privacy function primarily as constraints that can amplify resistance or weaken assurance. The strength—and at times direction—of these links varies with boundary conditions (capability beliefs, prior experience, task pressure and stakes, modality/design cues, and cultural/demographic heterogeneity), and the framework is directly testable by manipulating autonomy escalation (advisor → co-assistant → executor) and governance safeguards (approval gate, undo/recourse, transparency/accountability cues) while measuring utility/assurance mediators (usefulness, satisfaction, trust), governance mediators (perceived control/agency, reactance/powerlessness), and delegation outcomes across increasing realism (intention, simulated delegation choice, transaction-proxied execution).

Figure 4 is intended as a testable organizing model rather than a post hoc claim. It can be evaluated in a single experimental or longitudinal design by manipulating autonomy escalation (advisor → co-assistant → executor) and governance safeguards (approval gate, undo/recourse, transparency/accountability cues), while measuring (i) utility/assurance mediators (usefulness, satisfaction, trust), (ii) governance mediators (perceived control/agency, reactance/powerlessness), and (iii) delegation outcomes at increasing realism (intention, simulated delegation choice, and—where feasible—transaction-proxied execution). Estimation can be implemented as a moderated mediation/conditional process specification (e.g., safeguards and urgency moderating the autonomy → governance-threat link and downstream delegation), with boundary conditions in Table 6 treated as moderators of specific links rather than generic “context variables.

6. Results Part III—RQ3 Outcomes: What Delegation/Agentic Commerce Does

6.1. Marketing Outcomes: Purchase, Conversion, Spend, Loyalty, Satisfaction

In the studies that were mapped, “agentic” or assistant-enabled commerce most often leads to positive marketing-relevant outcomes when the system is set up to reduce friction and speed up decision-making (for example, by giving users faster access to relevant options, fewer steps, and clearer navigation). These benefits are seen as higher purchase/adoption intention, higher observed conversion/acceptance, higher spend/GMV, and stronger retention/continuance/loyalty. Studies that focus on intention usually follow the extended TAM/ECT/TPB logic, which says that quality/usefulness and confirmation lead to higher satisfaction and/or attitudes, which then predict outcomes like continued use, purchase intention, or loyalty (e.g., [21,30,44,46,48,50,66,69,70]). Field and quasi-field designs demonstrate that enhancements in performance can lead to quantifiable business results, such as increased option acceptance and GMV per user [10], heightened click-through rates under scarcity–autonomy framing [41], and elevated repurchase rates following an experimental recommendation treatment in a live-store environment [55].

However, marketing outcomes are not uniformly positive. Some studies show that design elements that give agents more freedom or make them feel more personal can make people less likely to buy [23,40], and some situations show that “trust” or “capability affordances” do not always lead to buying intention (for example, CA → trust not leading to PI in Yatawara et al. [22]). When systems continue to be seen as helpful, capable, and efficient while still giving users control, rather than taking away their freedom, the best marketing results happen.

6.2. Consumer Outcomes: Trust Calibration, Reliance/Overreliance, Autonomy/Control, Fairness, Regret/Blame

Consumer outcomes are not as consistently covered as marketing outcomes, and measurement is often indirect. “Trust” is frequently conceptualized as a mediator or downstream belief; however, formal trust calibration (distinguishing appropriate trust from miscalibrated overtrust) is seldom implemented within the mapped framework. When “trust calibration” is indicated, the implementation typically mirrors trust levels within satisfaction/experience frameworks rather than calibration tests under varying accuracy/accountability conditions (e.g., [42,68]). Evidence regarding reliance is more definitive than that concerning overreliance: reliance generally escalates with trust and diminishes with resistance, while task complexity and disclosure influence the translation of empathy and friendliness cues into trust [51], and trust directly forecasts reliance in simplified models [53]. Across studies, calibration indicators such as confidence–accuracy alignment, verification frequency, and reliance adjustment after errors are rarely reported; thus, ‘trust’ should be interpreted primarily as an attitudinal belief rather than an appropriateness-of-reliance measure.

Outcomes related to control and autonomy/agency exhibit the most pronounced “agentic-commerce” signature. Several studies indicate that when systems signal greater autonomy (e.g., auto-execution), users may experience diminished control or agency due to feelings of powerlessness or perceived threats to freedom, thereby inhibiting adoption intentions [40]. Other research, however, shows that design can also boost perceived autonomy and psychological ownership, which can boost trust and sometimes attitudes toward the product [38]. This indicates that governance is a key consumer outcome domain: agentic systems can either facilitate agency (ownership, controllability) or undermine it (powerlessness), impacting downstream acceptance. Fairness constructs primarily function as boundary-strengtheners (e.g., price fairness enhancing the engagement → continuous purchase intention relationship [65]), whereas explicit regret/blame attribution outcomes are predominantly absent in the coded set (despite their conceptual significance in delegation contexts).

6.3. Net-Positive vs. Net-Negative Regimes: When Outcomes Flip

The mapped results indicate that, depending on how agency and efficiency are perceived, delegation-related effects cluster into recognizable “regimes” where impacts change from net-positive to net-negative. Systems that provide clear utility—higher perceived usefulness, confirmation, and satisfaction—while maintaining user control over consequential decisions yield the best results in a net-positive regime. According to numerous studies [44,46,48,50,69], this pattern is evident in dominant satisfaction/continuance pathways. It is further supported by field data demonstrating that performance enhancements can result in increased option acceptance and revenue metrics [10]. On the other hand, when autonomy cues suggest a transition from “advisor” to “executor,” a net-negative regime appears. Particularly in regular, non-urgent shopping contexts, the user experience can shift in these situations toward helplessness and threat-to-freedom reactions, resulting in lower adoption intentions and higher avoidance [23,40].

The mapped evidence reveals a contextual override regime that mitigates the negative autonomy pathway: within scarcity or urgency framing, users demonstrate a greater willingness to embrace higher autonomy, as the perceived value of speed and assistance escalates while the autonomy → powerlessness mechanism diminishes [40,41]. Finally, several findings correspond with an intermediate-optimum regime, indicating a governance “sweet spot” where mid-level autonomy (a co-assistant role) can excel beyond both low and high autonomy by optimizing assistance without indicating displacement [46]. The RQ3 evidence suggests that agentic commerce is most effective when perceived as competent assistance under user governance, and least effective when regarded as a transfer of control without sufficient perceived safeguards, an asymmetry that grows more significant as systems evolve from intention-based “chat assistance” to transaction-executing delegation.

Table 7 arranges marketing-related outcomes by type of outcome, type of evidence, and main mechanisms. Two patterns stand out. First, most positive results are based on intentions (like plans to buy, adopt, or keep using something) and follow a consistent pattern of utility → satisfaction/attitude → intention, which is common in TAM/ECT/TPB extensions. Behavioral outcomes (like conversion, click-through, repurchase, and GMV) are less common but provide stronger external validity (e.g., [10,41,55]). Second, the direction of effects is not consistently positive: designs that emphasize autonomy or personalization can lead to avoidance, especially in non-urgent situations. This demonstrates that marketing lift depends on maintaining the user’s sense of control, not just on improving capability cues (e.g., [22,23,40]). Consequently, Table 7 should be interpreted as evidence that “agentic commerce” produces the most dependable marketing benefits when enhancements in efficiency are coupled with manageable decision-making authority.

Table 8 demonstrates the consumer-facing outcomes that are theoretically important to delegation but empirically uneven in coverage. Satisfaction/CX is the most consistently measured consumer outcome. It generally improves through confirmation/usefulness and interaction/service-quality routes, but it can worsen when designs cause reactance or overload. On the other hand, trust calibration is a clear gap: most studies look at trust level (often as a mediator) instead of calibration (appropriate reliance under uncertainty/accountability). This suggests that the “trust calibration” conclusions in the current mapped set are only temporary (e.g., [42,68]). The most distinctive characteristic of agentic-commerce is its effects on governance outcomes (control and autonomy/agency post), which depend on the design and regime: autonomy can either help agency (e.g., via psychological ownership) or hurt it (via powerlessness). This helps explain why marketing outcomes change from one context to another (for example, [38,40,46]). Finally, fairness is mostly positioned as a boundary-strengthener (for example, price fairness can make someone less likely to engage with a purchase), and explicit regret or blame attribution is mostly missing as a measured outcome, even though it is conceptually important in delegation settings.

The direction of empirical research on important marketing and consumer outcomes related to agentic commerce and delegation-enabled systems is summarized in Figure 5. Consumer outcomes (bottom row) and marketing outcomes (top row) are the two categories of outcomes. Cell shading shows the direction of evidence that is most prevalent throughout the mapped studies: POS stands for primarily positive effects; MIX for contingent or mixed effects based on design, context, or moderators; NEG for primarily negative effects; and GAP for inadequate or absent direct empirical evidence in the reviewed literature. Hatched cells denote research gaps, highlighting outcome domains that remain under-examined despite their conceptual relevance to agentic delegation.

Figure 5 provides a high-level outcomes map that sums up what agentic commerce does. It displays the effects on marketing-level performance and the effects on consumer-level experiences. In terms of marketing, the evidence is mostly good for outcomes related to spending (like GMV/WTP) and loyalty/repurchase. However, outcomes related to purchase/adoption intentions, observed conversion, and brand choice show mixed and context-dependent effects, which is in line with autonomy-sensitivity and governance conditions. From the consumer’s perspective, satisfaction and customer experience outcomes are largely favorable, while perceptions of control and autonomy following interactions display inconsistent trends, indicating the dual capacity of agentic systems to either enhance or diminish user agency. Several theoretically significant outcomes—such as trust calibration, reliance versus overreliance, perceptions of fairness, and the attribution of regret or blame—are empirically under-defined, highlighting critical research deficiencies for future investigations.

7. Discussion

7.1. Integrative Interpretation: “When Delegation Works/When It Fails”

This review sought to connect the concept of agentic commerce systems (RQ1) with the rationale behind delegation (RQ2) and the subsequent outcomes (RQ3). A consistent pattern of integration emerges from the mapped evidence: delegation is most probable—and advantageous—when autonomy is perceived as effective assistance under user governance, and least probable (or detrimental) when autonomy is regarded as control transfer without safeguards. In other words, consumers do not see “autonomy” as just a functional ability; they evaluate it as a governance condition that reshapes perceived agency, accountability, and trust. This interpretation is strengthened by foundational work on trust in automation and reliance: the relevant benchmark is not merely “high trust,” but appropriate reliance, calibration of reliance to actual capability and uncertainty, because miscalibration yields misuse (overreliance/automation bias) or disuse (avoidance/algorithm aversion). In agentic commerce, where systems can act on behalf of the user, this calibration criterion becomes decisive for both performance and consumer protection.

7.1.1. The Net-Positive Regime: “Capable Help + Preserved Agency”

Delegation is most effective in a system where the agent minimizes friction (time/effort, search costs, cognitive load) while maintaining the user’s sense of authorship and ability to override. In the main empirical cluster (Autonomy-1, recommendation/support scope), positive outcomes usually follow a utility/assurance chain: performance cues and interaction quality make people feel more useful and satisfied, which builds trust and leads to downstream adoption/continuance or purchase-intent outcomes (e.g., [44,46,48,50,69]). In instances where behavioral or platform-embedded evidence is present, the same rationale can lead to quantifiable performance enhancement (e.g., increased acceptance and revenue/GMV through improved conversational recommendation) instead of being solely intention-driven [10,55]. The integrative implication is that the “success condition” for delegation is not maximal autonomy—it is credible competence plus governability: systems are welcomed when they feel like they are making the user more effective rather than displacing the user’s decision rights. From a reliance perspective, this regime corresponds to supported reliance: users accept assistance when competence cues are credible and when governance cues (override, preview, confirmation) keep reliance within the user’s comfort boundary. This also clarifies why algorithm appreciation can emerge in commerce contexts: when the agent demonstrably reduces search and coordination costs without obscuring control rights, reliance becomes instrumentally justified rather than identity-threatening.

7.1.2. The Net-Negative Regime: “Autonomy Cues → Agency Threat → Reactance/Powerlessness”

Delegation “fails” when autonomy cues trigger perceived loss of control or freedom. Two mechanisms consistently emerge across studies: psychological reactance (perceived threat to autonomy) and powerlessness (experienced dislocation). Personalized or heavily directed recommendations can provoke reactance and avoidance, especially in contexts of heightened involvement or when such recommendations are viewed as limiting choice [23]. Similarly, changes that change the agent from advisor to executor (like auto-purchase vs. notify/assist) can lower acceptance through powerlessness pathways [40]. Design elements that suggest system-initiated control, particularly in the absence of explicit user veto or approval, exacerbate this phenomenon, even when downstream satisfaction may remain favorable due to certainty and performance improvements [24]. The main idea is that capability can make things easier and also make people less likely to follow rules. When users think the agent is “taking over,” delegation becomes mentally burdensome and outcomes go from net-positive to net-negative. In automation terms, this regime resembles disuse (avoidance) driven by perceived agency loss, and it helps reconcile “competent but rejected” systems: even accurate systems can be resisted if their operation threatens authorship, identity, or responsibility boundaries. It also highlights a second failure mode—misuse—when persuasive interface cues inflate trust beyond competence, producing overreliance (automation bias) that may not be visible in intention-only outcomes but becomes consequential once transactions are executed.

7.1.3. A Contextual Override Regime: Urgency/Stakes Can Blunt Governance Resistance

The realization that negative autonomy effects are not fixed is a significant boundary insight. The autonomy to powerlessness pathway can weaken in high-urgency situations, which are most obviously operationalized by scarcity/time pressure. This makes higher autonomy more acceptable because speed and assistance become more valuable in comparison to autonomy concerns [40,41]. This helps resolve conflicting results in RQ3: when the opportunity cost of delay is significant, the same autonomy design that causes avoidance in routine shopping may become advantageous. Delegation is best modeled as context-sensitive trade-off optimization rather than a monotonic “more autonomy is better/worse” relationship, which has implications for both platforms and theory. This regime is also consistent with reliance research showing that users shift thresholds under time pressure: they may accept higher automation despite lower transparency, increasing the importance of safeguards that prevent “fast” from becoming “reckless,” such as bounded authority, staged execution, and clear escalation policies.

7.1.4. An Intermediate-Optimum Regime: The “Co-Assistant” Sweet Spot

The mapped evidence also supports a regime that fits with an inverted-U intuition: intermediate autonomy can do better than both minimal and maximal autonomy by maximizing help and minimizing displacement. In role-based autonomy frameworks (e.g., substitute vs. co-assistant vs. performer), the co-assistant role can attain the optimal equilibrium between support and perceived agency, corroborating the overarching conclusion that delegation is most effective when users maintain significant authorship and control rights [46]. In a similar vein, modality and interface cues determine whether autonomy is perceived as supportive or intrusive (e.g., the effects of voice versus text on cognition and satisfaction; assistant initiation patterns affecting reactance) [8,24,52]. If we examine these results collectively, they suggest that there is no “best” level of autonomy that works universally; instead, it depends on whether the design communicates partnership (shared control) rather than replacement (control transfer). Framed through appropriate reliance, the “co-assistant” regime functions as a calibration architecture: it enables reliance while maintaining verification opportunities, thereby reducing both misuse (overreliance) and disuse (avoidance). It also provides a natural locus for error recovery—users can inspect intermediate steps (candidate items, rationale, constraints) before consequential execution.

7.1.5. What Is Missing (And Why It Matters): Calibration, Accountability, and True Execution

The regime framework underscores a structural deficiency in the evidence base: a significant portion of what the literature refers to as “delegation” is derived from intentions of adoption or continuation, or levels of satisfaction, rather than from direct observations of delegation decisions, execution behaviors, or trust calibration (appropriate reliance under uncertainty). Even in contexts where trust is paramount, it is typically assessed in terms of trust level rather than calibration (e.g., confidence–accuracy alignment, verification behavior, response to errors) [42,68]. In the same way, true agentic execution (placing an order and paying for it) is still rare and mostly limited to scenario manipulations rather than embedded transaction observation [40,41]. This measurement gap is significant because governance failures in agentic commerce are essentially post-delegation occurrences (overreliance, blame/regret, auditability following error). Without calibration and accountability measures, the field risks overgeneralizing from “chat satisfaction” to “safe delegation.” In particular, foundational automation findings suggest that trust is path-dependent and error-sensitive: users update reliance after failures and successes, and these dynamics are central to long-run delegation. From a human factors perspective, the core construct is not global trust but appropriate reliance—whether users monitor, verify, and intervene in ways that match the system’s competence and uncertainty. The reviewed corpus rarely measures (i) verification behavior (cross-checking, seeking second opinions), (ii) reliance under stated uncertainty, (iii) recovery actions (undo/cancel/escalate), or (iv) accountability judgments (who is blamed, how responsibility is partitioned, and whether recourse is available). These behaviors are the practical markers of calibrated reliance (monitoring, intervention, and recovery), yet they are largely absent from the commerce evidence base. These constructs are especially consequential for agentic commerce because execution errors have direct monetary and reputational consequences and can trigger “algorithm aversion” after salient failures, even when average performance is high. Moreover, as autonomy increases, users’ oversight demands typically rise (supervisory control), but situation awareness can deteriorate when intermediate steps, uncertainty, and constraints are not externalized; this makes governance mechanisms (visible state, confidence cues, approval gates, and reversible actions) central to safe delegation rather than optional UX features.

Delegation is successful across domains and designs when systems provide felt competence under user governance; it fails when autonomy cues suggest a loss of agency in the absence of safeguards. The strongest unresolved frontier is empirical: as systems advance beyond assistive chat into transaction-capable agents, we need more studies that observe real delegation, real execution, and calibration/oversight behaviors (monitoring, intervention, recovery). However, urgency and mid-level “co-assistant” designs can tip the scales in favor of acceptance. Accordingly, the ‘delegation regimes’ synthesis should be read as mapping acceptance dynamics under largely assistive or scenario-based autonomy, not as definitive evidence about safe execution-level delegation in the wild.

7.2. Implications for Design (What to Build: Autonomy, Controls, Explanations, Privacy)

A fundamental implication of the mapped evidence is that “more autonomy” is not a universal enhancement; rather, it constitutes a governance decision that alters the consumer’s experience of agency, accountability, and risk. Negative reactions arise when autonomy cues suggest a loss of control (reactance/threat-to-freedom) or displacement (powerlessness); conversely, positive outcomes are observed when systems facilitate friction reduction while maintaining perceived authorship and recourse [23,24,40]. Design should, therefore, be based on a straightforward principle:

Default to “co-assistant” autonomy for consequential steps, and use escalation (to executor) only with explicit user consent, strong recourse, and clear accountability.

This principle is also consistent with human factors research on levels of automation and supervisory control: as automation increases, users shift from doing to monitoring, and therefore require stronger support for situation awareness and intervention. In agentic commerce, this means that higher autonomy must be paired with “oversight scaffolding”—clear visibility into what the agent is doing, the option to pause/override, and reliable recovery paths when errors occur—otherwise, autonomy cues are likely to be interpreted as control transfer rather than assistance. In practice, this implies that autonomy design and oversight design must be specified together: a higher “automation level” without corresponding monitoring and recovery affordances increases the risk of both misuse (overreliance) and disuse (avoidance after failure).

This principle aligns with the review’s “regimes” structure. In everyday shopping situations, executor-like behavior can stop people from adopting something by making them feel powerless [40] and can make reactance worse when recommendations seem to be steering or limiting [23]. Conversely, co-assistant designs can maintain satisfaction and choice certainty when users recognize performance advantages without a reduction in governance [24], and modality/interaction selections can additionally influence whether autonomy is perceived as supportive or intrusive [8]. The literature seldom evaluates transparency and oversight mechanisms as first-class levers, rather than mere background “quality” indicators. Consequently, the design agenda must delineate not only what to construct but also what to assess to guarantee secure delegation. Specifically, governance should be treated as a calibration toolkit: its role is to keep reliance aligned with competence and uncertainty, thereby preventing both overreliance (misuse/automation bias) and avoidance after failure (disuse/algorithm aversion). Practically, this requires measurement that goes beyond “trust level” toward calibration outcomes (e.g., verification and intervention rates conditional on uncertainty, error-following reliance, and recovery success), because these are the behaviors that determine whether delegation remains safe over repeated use. Accordingly, the most policy-relevant design question is not whether users “trust” the agent, but whether they rely appropriately—monitor when needed, intervene when stakes are high, and recover effectively when errors occur.

Table 9 and Table 10 employs a governance levers × risk zones approach to translate the review’s socio-technical synthesis into actionable design guidance. The framework does not treat autonomy as a single “more vs. less” setting. Instead, it breaks agentic commerce down into a select few consumer-harm failure modes that are already visible—either directly or indirectly—across the mapped studies. First, feelings of powerlessness and psychological reactance occur when cues of autonomy or personalization are perceived as threats to freedom or displacement, leading to decreased acceptance and heightened avoidance [23,24,40]. Second, because most studies measure trust as a level rather than as calibration (i.e., appropriate reliance under uncertainty and error), overreliance and miscalibrated trust remain empirically under-measured despite representing a conceptually central risk in delegated decision-making—automation bias and blind acceptance. Third, privacy may function as a conditional constraint rather than a uniform inhibitor because privacy risk and surveillance concerns—including creepiness and fear of data misuse—appear in resistance and barrier pathways and exhibit heterogeneous significance across settings and user groups [4,42,49]. Fourth, the evidence currently available treats fairness and price fairness risks (such as opaque rankings, discriminatory offer selection, and hidden markups) as boundary conditions or conversion amplifiers. However, as agents move from recommending to executing and optimizing offers, these risks become structurally more significant [65]. Fifth, despite being crucial for long-term trust and willingness to delegate, regret and blame attribution—who is accountable when delegated decisions go wrong—are rarely measured directly in the mapped set. Lastly, security-induced governance failures, such as compromised instructions or memory that can reroute agent behavior, are highlighted by the broader framing of agentic systems. Although this risk has not yet been thoroughly examined in consumer studies, it emphasizes the necessity of auditability, confirmation gates, and recourse mechanisms in transaction-capable designs. Viewed through a human factors lens, these risk zones correspond to predictable failure modes in supervised automation: reduced perceived control (reactance/powerlessness), inappropriate reliance (misuse), inappropriate avoidance (disuse), and weak recovery after error—each of which can be mitigated by explicit oversight and intervention mechanisms.

Across all risk zones, the same six governance “building blocks” recur:

Autonomy defaults and escalation (advisor → co-assistant → executor)

Override/undo/recourse

Confirmation gates (human-in-the-loop)

Explanation depth (what/why/uncertainty; inputs used)

Data and privacy controls

Accountability signals (responsibility partition and audit trail)

The implication of design is that governance should not be an afterthought. If the field continues to define “agentic commerce” mainly through satisfaction or intention, it will miss the conditions that decide whether delegation is safe, trusted, and resilient. This is especially true when systems move from assisting individuals chat to execution. Therefore, future evaluations should treat oversight, intervention, and recovery as first-class outcome variables, not merely interface features.

7.3. Limitations of Reviewed Studies

The mapped literature provides a preliminary evidence base regarding delegation in commerce-oriented AI assistants; however, it is limited by (i) the insufficient realism of “agentic” execution, (ii) an excessive dependence on cross-sectional self-reporting, and (iii) inconsistent focus on governance-relevant constructs (e.g., calibration, recourse, accountability). Table 11 and Table 12 summarize these limitations by separating higher-evidence designs (field/quasi-field/experiments/usability) from observational and mixed-method studies, thereby clarifying where inference is comparatively stronger and where it remains provisional. In addition, two review-process constraints should be interpreted transparently and proportionately: screening/coding were performed by a single reviewer, and we did not apply a single formal risk-of-bias instrument (e.g., a clinical-style RoB tool) because the evidence base spans heterogeneous designs and the aim is evidence mapping rather than pooled causal estimation. Instead, we implemented audit-oriented safeguards (time-separated second-pass verification; boundary-case audits; conservative default rules) and a transparent study-level appraisal checklist and evidence-direction rule set (Section 3.4). These procedures reduce interpretive drift but do not eliminate the possibility of residual selection/coding error; accordingly, our conclusions are framed as a structured evidence map and mechanism synthesis rather than definitive causal claims.

7.3.1. Execution Realism and External Validity Remain Thin

We did not report inter-rater reliability statistics because a single reviewer made all of the decisions regarding full-text screening and high-leverage coding (autonomy level, transaction capability, evidence grade, and delegation operationalization). We employed a pre-specified codebook with explicit decision rules, maintained a full-text exclusion log, performed purposive boundary-case audits and time-separated second-pass verification, and applied conservative default classifications unless the full text offered adequate operational detail in order to reduce single-reviewer bias (Section 3.3). These steps improve transparency and replicability of the mapping, but they cannot fully substitute for dual independent screening. Likewise, we did not apply a single formal risk-of-bias tool because the corpus contains surveys, experiments, platform evaluations, and system demonstrations with design features that differ from under one RoB category, and the review did not calculate pooled effects. We present a structured evaluation of evidentiary realism, autonomy instantiation, transaction evidence, measurement rigor, and reporting transparency (Section 3.4), utilizing these criteria to ensure that synthesis claims remain commensurate with evidentiary maturity. Consequently, our most robust assertions pertain to (i) the concentration versus scarcity of evidence and (ii) the mechanisms that consistently emerge within particular autonomy/evidence frameworks, whereas assertions regarding execution-level delegation and governance outcomes are regarded as preliminary in the absence of substantial behavioral evidence.

A primary constraint is that a significant portion of the evidence categorized as “agentic” continues to represent assistive conversational commerce rather than validated delegated execution. Even studies with more evidence often use scenarios, mock-ups, or limited tasks instead of real purchases. For instance, lab experiments often manipulate the design of the assistant or give them more freedom, but they measure the results as intentions or simulated choice instead of actual checkout/payment [7,8,24,38,43,52,54,63]. The autonomy-to-powerlessness mechanism is well-supported in controlled experiments, but it is often not tested in real-life transactions [40,46]. Field evidence is not uncommon, and when it is, it is often reported with little methodological detail (for example, short A/B reports with unclear sampling/randomization) [10]. Quasi-field studies enhance ecological validity but encounter challenges including ambiguous randomization, limited site contexts, and short follow-up intervals [55]. Due to these factors, the literature can point to possible psychological mechanisms, but it is still less clear on how these mechanisms work when people use them a lot, spend real money, and platform governance constraints.

7.3.2. Causal Identification Is Uneven Even in “Higher-Evidence” Designs

Table 11 demonstrates that experimental research frequently substantiates causal assertions when randomization is evident and manipulation checks are documented; nonetheless, limitations endure, characterized by scenario-based autonomy, restricted domains, and intention-laden outcomes [40,41,46]. Numerous studies exhibit robust internal validity while functioning within simulated purchasing contexts [7,8,24,63]. Some studies incorporate execution framings but are constrained by contextual scope or sample composition (e.g., experiments with a predominance of students) [32,38,54]. Field and quasi-field contributions enhance inferences regarding market outcomes; however, methodological limitations—including ambiguous assignment, platform specificity, abbreviated observation periods, and analytical simplifications—constrain generalizability and the testing of mechanisms [10,55]. Pilot usability studies provide detailed understanding of interface comprehension and control requirements; however, small-N prototypes are inadequate for substantiating extensive behavioral assertions [61].

7.3.3. The Observational Core Is Dominated by Cross-Sectional SEM and Single-Source Self-Report

Table 10 indicates that the bulk of the literature relies on cross-sectional survey designs (often SEM/PLS-SEM) that usually measure usefulness, satisfaction, trust, and purchase/continuance intention in one wave. This establishes foreseeable limitations on causal interpretation and heightens susceptibility to common method bias, notwithstanding the reporting of statistical validations (e.g., VIF heuristics, single-factor tests, procedural remedies) [39,42,44,45,46,50,56,58,67,68,69]. Time-lag designs are infrequent and typically confined to intention-based outcomes rather than actual behavior [31]. Mixed-method contributions (e.g., interviews plus surveys) enhance contextual richness but generally rely on self-reporting and fail to address gaps in behavioral verification [33]. As a result, the main evidence base backs up correlational “acceptance-style” pathways (utility → satisfaction/trust → intention), but does not give us much information about actual delegation choices, verification behavior, or recovery after delegation.

7.3.4. Sampling and Context Concentration Constrain Generalizability

Empirically, the mapped corpus is geographically concentrated and demographically skewed. In the included studies (N = 47), East Asia and Southeast Asia account for 42.5% of settings, and China is the single most common country context (19.1%) (Table 4). Sample frames also reflect common generalizability constraints: students (21.3%) and online panels (23.4%) represent a substantial share of samples, while behavioral evidence in realistic contexts remains limited (14.9%) relative to intention-only/cross-sectional designs (55.3%) (Table 4). Accordingly, cultural/regulatory boundary conditions are frequently acknowledged but rarely tested systematically, and synthesis claims are interpreted as most applicable to digitally intensive, early-adopter, platform-specific contexts rather than as universal effects across regulatory regimes and consumer segments.

The majority of the samples in both tables are based on convenience or are biased toward students or young adults. Also, most of the studies are limited to e-commerce in one country, such as China, Vietnam, Indonesia, Turkey, Malaysia, Sri Lanka, Egypt, India, Singapore, or the Philippines. This concentration renders it challenging to draw conclusions across cultures and may make early adopters, people who are very familiar with digital technology, or platform-specific norms more common [22,30,39,42,46,48,49,50,51,56,60]. Panels are often not random samples, and the range of topics they cover can be very limited (for example, just one product category, one platform, or a few types of tasks) [7,8,41]. These limits are especially important for agentic commerce because acceptance of autonomy is likely affected by stakes, norms, literacy, and regulatory expectations—factors that are difficult to generalize from geographically and demographically concentrated samples.

7.3.5. Measurement Gaps: Delegation, Calibration, and Accountability Are Under-Specified

This limitation is not peripheral: it constrains what can be claimed about execution-level delegation and governance outcomes. A final limitation pertains to the construct level: numerous studies operationalize “delegation” indirectly through adoption, continuance, or purchase intention, instead of through explicit delegation choice or the proportion of tasks delegated. In the same way, “trust” is usually thought of as a level (often a mediator) instead of a calibration (the right amount of trust when things are uncertain or mistakes happen). Governance-critical outcomes—regret, blame attribution, dispute behavior, recourse utilization, and auditability perceptions—are seldom measured directly, despite their significance in high-autonomy execution environments. This disparity is evident in both experimental and survey methodologies: experiments confirm autonomy → powerlessness/reactance mechanisms [23,40] and interface-driven reactance shifts [24], whereas surveys focus on usefulness/satisfaction/trust chains [42,44,69]. However, there is a scarcity of studies employing calibration tasks, error-and-recovery paradigms, or behavioral logging of oversight (e.g., checking, undo, gate dropout).

Overall, evidence tables (Table 9 and Table 10) demonstrate a body of research that is strong at identifying acceptance mechanisms under assistive autonomy but still falls short in capturing the empirical hallmark of genuinely agentic commerce, which includes actual execution, continuous human oversight, and accountability following mistakes. This pattern serves as the basis for the intended agenda and drives the emphasis on design and measurement.

7.4. Implications for Theory (Delegation as Mediator; Autonomy as Boundary Condition)

The evidence compiled from RQ1 to RQ3 substantiates a distinct theoretical assertion: agentic commerce ought to be conceptualized not as “AI presence on outcomes,” but rather as “design capability on delegation (mechanism) to outcomes,” with autonomy influencing the timing and manner of the mechanism’s operation. In other words, delegation is the main link between system design and marketing and consumer outcomes. Autonomy, on the other hand, is best seen as a boundary condition that changes the meaning of the interaction and the sign and strength of key effects.

7.4.1. Delegation Is the Mechanism That Links Agent Design to Outcomes

Customers’ favorable reactions to commercial chatbots follow a well-known acceptance pathway in which usability/service quality signals boost perceived usefulness and satisfaction, which in turn strengthen trust and continuation/purchase intentions, according to a large portion of the survey-dominant evidence. However, rather than measuring delegation explicitly, this chain frequently makes assumptions about it. The theoretical improvement is to make the “handoff” unambiguous: perceived utility and trust are transformed into behavior-proximal outcomes through delegation (transfer of decision rights/actions). Perceived usefulness/satisfaction mediating design cues → trust/continuance and trust mediating performance cues → intention are two examples of mediation structures frequently reported in the mapped studies [44,46,48,50,69]. Additionally, it is consistent with research demonstrating that design decisions and interface modalities function through psychological pathways that ultimately determine a user’s willingness to rely on the agent [8,24].

Higher-autonomy studies critically illuminate the mechanism: as systems transition from advising to executing, outcomes increasingly hinge on users’ willingness to delegate significant actions rather than their “attitude toward the chatbot.” This is why negative pathways often come up through governance-related mediators, like powerlessness and reactance, which make adoption and delegation less likely even when the benefits are clear [23,40]. Consequently, theory ought to regard delegation as a separate construct (decision-right transfer) instead of a synonym for adoption or continuance.

Future theoretical models must explicitly incorporate (i) a delegation construct (choice to delegate; proportion delegated; or observed execution acceptance), (ii) mediators that facilitate delegation (trust, perceived control, usefulness/effort reduction), and (iii) downstream consequences resulting from delegation (conversion, expenditure, loyalty; alongside post-delegation autonomy/control, regret/blame, and trust calibration).

7.4.2. Autonomy Is Not Just an Antecedent—It Changes the Regime of Effects

Another implication is that autonomy is not just an “intensity” variable; it is a regime switch. Autonomy is often weakly defined in the mapped record (for example, “perceived autonomy” and vague “AI chatbot” labels), but the most prominent differences in outcomes occur when autonomy is manipulated as a change in control rights and execution authority [40,46]. These studies show that more autonomy can make people less accepting by making them feel powerless. On the other hand, “co-assistant” roles that are in between low and high autonomy can do better than both. This suggests that autonomy should be seen as a boundary condition that affects the delegation mechanism rather than as a simple main effect [40,46].

Another implication is that autonomy is a regime change rather than merely a “intensity” variable. The most noticeable variations in results happen when autonomy is manipulated as a change in control rights and execution authority [40,46]. Autonomy can be poorly defined in the mapped record (e.g., “perceived autonomy” and ambiguous “AI chatbot” labels). According to these studies, having more autonomy can make people feel helpless and less accepting. However, “co-assistant” positions that fall between low and high autonomy can outperform both. According to this, autonomy should not be viewed as a straightforward main effect but rather as a boundary condition that influences the delegation mechanism [40,46].

7.4.3. A Two-Lane Theoretical Synthesis: Utility/Assurance vs. Governance

Putting the above together, the evidence supports a two-lane theoretical framing consistent with Figure 4:

Utility/assurance lane: Design quality leads to usefulness and less effort, which leads to satisfaction and trust, which results in to delegation and proximal outcomes like continuing purchasing or converting when possible. This lane predominates in cross-sectional SEM studies within the corpus [44,46,48,50,69].

Governance lane: Autonomy and steering cues lead to perceived control or agency versus powerlessness or reactance, which in turn affects willingness to delegate or avoid. This lane becomes prominent when agents seem to decide or act “on behalf,” and is most evident in experimental autonomy manipulations and reactance studies [23,24,40].

Theoretically, these lanes are active at any given time, and the dominant lane is determined by autonomy. Benefits typically take precedence under low autonomy, and delegation is inferred loosely from adoption and continuation. Delegation becomes a more brittle mechanism under increased autonomy because governance costs may take precedence unless safeguards are prominent.

7.4.4. Positioning Agentic Commerce Relative to Adjacent Theories

Ultimately, these implications elucidate the disparity between the positive outcomes frequently observed in traditional technology acceptance frameworks (TAM/ECT-like chains) and the mixed or negative results reported in autonomy-centric studies. Acceptance theories encompass the utility/assurance dimension, while reactance/control frameworks address the governance dimension. The agentic-commerce contribution is to bring these two ideas together around delegation. Acceptance models usually assume that delegation is the action-level construct, and autonomy is the boundary condition that makes governance costs important [23,24,40].

7.5. Research Agenda: Design-Ready Gaps and Testable Recommendations

The research agenda prioritizes design-testable questions over further descriptive mapping, building on the evidence map (RQ1), the determinants synthesis (RQ2), and the outcomes review (RQ3). The agenda is motivated by two cross-cutting patterns. First, genuinely agentic execution (cart/checkout/payment) and negotiation are still uncommon, and the empirical base is still focused on low-autonomy, text-based assistants with outcomes dominated by intentions and satisfaction constructs [10,40]. Second, the effects of manipulating autonomy frequently shift through governance mechanisms, such as perceived control and powerlessness/reactance, indicating that future research should treat governance and delegation as causal constructs rather than downstream correlates [23,24,46]. As a result, the following agenda is structured around gaps that prevent safe design and generalizable theory: behavioral identification, autonomy regimes, external validity and replication, governance levers, calibration/overreliance, and accountability/fairness.

7.5.1. Strengthen External Validity and Replicability of Field Evidence

The review includes a small number of field or quasi-field studies that show the strongest link to real outcomes (like acceptance/conversion, spend/GMV), but these studies are often hard to replicate across platforms and agent architectures because they fail to report enough on assignment, exposure windows, and guardrails [10,55]. This is an “existence proof” problem: we can see that performance improves in a platform setting, but it is hard to say for sure if the effect depends on a certain interface, user base, product area, or level of autonomy. The agenda puts a lot of weight on clear, pre-registered field experiments and replicating them on multiple platforms, along with mechanism measurement (for example, acceptance → GMV mediation; heterogeneity for new vs. returning users). This suggestion fits with the fact that a lot of the strongest causal evidence in the corpus comes from experiments instead of natural settings, leaving the real-world persistence of effects under-identified [8,40].

7.5.2. Move from Intention-Centric Models to Behavioral Identification and Mechanism Testing

One major drawback is that many studies use cross-sectional SEM/PLS-SEM pathways (e.g., usefulness/satisfaction → trust → intention) to infer “delegation” from adoption/continuance or purchase intention [44,46,48,50]. However, the advantages and disadvantages of autonomy are essentially behavioral: whether users accept, verify, override, revert, or abandon when the agent acts. In order to recover governance mediators (trust, control, and powerlessness), the agenda calls for randomized field A/B designs (or stepped-wedge rollouts) with behavioral outcomes modeled appropriately (e.g., logistic/Poisson/negative binomial for conversions and counts), along with minimal post-task surveys and longer retention windows (30/60/90 days). This directly addresses findings that personalization can cause avoidance through threat-to-freedom mechanisms [23] and that high autonomy can decrease adoption through helplessness in routine contexts [40]. However, these pathways are rarely validated against actual purchase behavior and repeat usage.

7.5.3. Test Autonomy Regimes as Boundary Conditions, Not Merely Main Effects

The mapped evidence suggests that autonomy effects are contingent upon context—especially regarding urgency and scarcity—and may not extend to routine shopping without education, safeguards, and trust recalibration [40,41]. So, one of the main things on the agenda is to broaden the boundary conditions and test persistence and learning: do users feel more at ease with autonomy after a lot of successful interactions, or does governance anxiety build up?. Multi-context experiments (routine vs. urgent), with longitudinal follow-up and manipulations of autonomy and governance (override availability, confirmations), can determine whether autonomy is effective due to contextual pressure or because users acquire the skills to delegate safely. This agenda also expands the assertion of the “autonomy sweet spot”: evidence indicating that an intermediate co-assistant role may surpass both advisor and executor modes is theoretically significant, yet it lacks robust behavioral validation and task classification (Fan & Liu). Future designs ought to examine three tiers of autonomy (advisor/co-assistant/executor) in conjunction with task type (objective versus preference-based), and explicitly incorporate nonlinearity (quadratic/splines) in behavioral outcomes and regret.

7.5.4. Bridge Lab Cue Effects to Conversion and Retention Under Governance Constraints

A significant body of literature examines anthropomorphism, modality, warmth/competence cues, and initiation patterns, frequently demonstrating impacts on satisfaction, social presence, or reactance in controlled environments [8,24]. Nonetheless, these cue effects are not reliably associated with conversion and retention, particularly in transaction-capable environments. The agenda thus emphasizes “lab-in-the-field” methodologies that connect controlled manipulations (voice versus text; anthropomorphic cues; explanation formats) to quantifiable platform outcomes (add-to-cart, conversion, repeat usage), while maintaining manipulation checks and incorporating governance safeguards (privacy disclosure, control toggles). This bridging logic follows the review’s broader conclusion that interface design is not superficial appearance; cues can shift agency perceptions and reactance—but the market importance of these cue pathways is still not well understood [23,24].

7.5.5. Treat Governance as a Design Variable: Overrides, Confirmations, Transparency, and Revocation

In RQ2–RQ3, governance-related constructs (perceived control/agency, powerlessness/reactance, risk/privacy) consistently influence whether delegation is perceived as empowerment or displacement [23,24]. However, governance is typically assessed as a perception rather than manipulated through experimental design. The agenda necessitates governance-focused experiments that alter override/undo availability, confirmation gates, explainability depth, and permission revocation, subsequently evaluating mediation through perceived control and powerlessness in relation to adoption, avoidance, and retention. This is especially important because most of the corpus is still at autonomy level 1 (assistive), where governance costs are low. As systems move toward execution, governance design will determine whether autonomy effects change.

7.5.6. Build a Measurement Standard for Trust Calibration and Miscalibrated Reliance

A recurrent deficiency is that “trust” is typically assessed as a level (frequently a mediator) instead of as calibration—suitable reliance amidst uncertainty, fluctuating accuracy, and error exposure. In the mapped set, miscalibrated reliance and overreliance are conceptually central but empirically sparse, which constrains the theory regarding the detrimental effects of delegation (automation bias) despite high satisfaction levels. The agenda thus advocates for calibration paradigms that adjust agent accuracy and confidence, monitor reliance based on correct versus incorrect advice, and measure trust–accuracy alignment (calibration indices), including trust modification following errors. This direction aligns with the review’s characterization of governance-lane outcomes (control/agency) as the unique hallmark of agentic commerce, thereby transforming “trust” from a mere acceptance facilitator into a construct pertinent to safety.

7.5.7. Expanding Accountability, Fairness Outcomes as Autonomy Increases and the Issue of Population Validity

Fairness/price fairness appears in the mapped evidence primarily as a boundary condition or conversion amplifier, but agentic execution—such as substitutions, dynamic pricing, and opaque rankings—makes fairness and accountability outcome-critical. Similarly, despite being crucial to post-delegation trust and platform switching, regret and blame attribution—who is at fault when the agent makes a mistake—are mainly absent as measured outcomes. Therefore, the agenda calls for experiments that include price/offer optimization and dynamic recommendation contexts, measure fairness as a moderator and an outcome, and include accountability endpoints such as switching, regret intensity, blame allocation (agent vs. platform vs. self), complaint behavior, and recourse preferences. This directly addresses the noted discrepancy between conceptual importance and empirical measurement coverage and expands the review’s risk-zone framing.

Finally, context concentration (single-country/platform studies, young/skewed samples) limits population validity, and in the survey-heavy core, CMV mitigation is often more about how things are done than how they are designed. The agenda gives priority to multi-country harmonized designs (with invariance testing), dividing people up by tech readiness, age, and risk sensitivity, and making sure that regulatory expectations are included as context variables. Methodologically, it necessitates multi-wave measurement, marker variables, and—crucially—the integration of survey constructs with subsequent usage logs to evaluate predictive generalization. This addresses the review’s conclusion that numerous “works in practice” assertions lack out-of-sample evaluation, and that behavioral endpoints are insufficiently utilized despite the potential support from platform data.

When taken as a whole, this agenda changes the topic of discussion from “what predicts intentions toward chatbots” to “what changes delegation behavior, under which autonomy regime, and with which governance safeguards” (Table 13). Additionally, it makes the next empirical step understandable: designs that jointly manipulate autonomy × governance, capture behavioral outcomes, and measure calibration, accountability, and fairness—the outcome domains that are least tested in the current body of evidence but are most theoretically central [8,23,24,40,41,46].

8. Conclusions

This review integrated evidence on agentic commerce by correlating the operationalization of autonomy (RQ1) with the rationale behind consumers delegating decisions to AI agents (RQ2) and the resultant outcomes (RQ3). In the corpus, most of the empirical work is still focused on low-autonomy, text-based assistants which assist with search, recommendations, and post-purchase interactions. Full delegation of execution (cart/checkout/payment) and negotiation, on the other hand, is rare. In this evidence profile, delegation is most probable—and outcomes most advantageous—when agents provide explicit friction reduction and perceived competence while maintaining user governance (control, oversight, and reversibility). On the other hand, when autonomy cues suggest a transfer of control rather than a partnership, the effects can reverse through governance mechanisms like powerlessness and psychological reactance, leading to avoidance even when the benefits of performance are clear. The primary contribution is a comprehensive socio-technical framework linking design parameters (autonomy, scope, modality, transaction capability) to delegation mechanisms (trust, usefulness, control versus reactance, risk/privacy) and to marketing and consumer outcomes, elucidating the circumstances in which delegation facilitates value creation as opposed to diminishing agency and trust.

The evidence base examined is limited by inadequate execution realism, excessive dependence on cross-sectional single-source self-reporting, and inconsistent measurement of governance-relevant outcomes. Although we mitigated terminology risk through citation chasing and adjacent-term checks, it remains possible that some relevant work outside commerce-scoped keywords (e.g., general automation bias studies without shopping context) was not captured. In numerous studies, “delegation” is derived from intentions of adoption or continuation rather than from direct observation of handoff behavior, and “trust” is generally assessed as a level rather than as calibration, indicating suitable reliance in the face of uncertainty and subsequent to errors [62,63,67]. Samples are frequently concentrated demographically and geographically, which restricts their generalizability across cultures, regulatory expectations, and product-stakes contexts [23,39,46]. Future research should (i) transition from intention-centric models to behavioral identification through field A/B tests or platform rollouts with conversion, expenditure, and retention metrics; (ii) evaluate autonomy frameworks (advisor vs. co-assistant vs. executor) in conjunction with task type and stakes, explicitly incorporating nonlinearity and contextual pressures (e.g., scarcity/urgency); and (iii) consider governance as a manipulable design variable by experimentally altering confirmation gates, override/undo options, explanation depth, and permission controls, while integrating standardized metrics for trust calibration, overreliance, regret/blame attribution, and recovery behavior [62,64,67]. These steps would close the main gap that the review identified: understanding not only whether consumers like AI shopping assistants, but when—and under what safeguards—they will safely delegate consequential commercial decisions.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available upon reasonable request from the author.

Conflicts of Interest

The author declares no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 PRISMA flow diagram.

Figure 2 Evidence map of agentic commerce operationalizations across the reviewed studies (RQ1). The x-axis represents the maximum task-scope stage covered by the agent; the y-axis shows autonomy level (0–3). Bubble size reflects the number of studies per cell (#). Circles denote real purchase contexts, and squares denote simulated/scenario-based designs.

Figure 3 Heatmap of Interaction mode × autonomy counts. The number of studies are reflected per cell (#).

Figure 4 Two-lane mechanism model of delegation in agentic commerce.

Figure 5 Impact of Agentic Commerce: Evidence Direction on Key Outcomes. POS = mostly positive evidence; MIX = contingent or mixed evidence; GAP = research gap (no direct outcome evidence in the mapped studies). No outcomes were coded as mostly negative (NEG) in the mapped evidence set.

Table 1

Construct dictionary and typical measures.

Construct	What It Means	Typical Operationalizations	Example Indicators (Illustrative)
Agent autonomy (objective)	Degree to which the system initiates, plans, and executes without continuous user instruction	autonomy level (0–3); initiative (user vs. agent); approval requirement; tool-use/memory	“Actions required my approval”; observed: #steps executed without user input
Task scope	Breadth of commerce tasks supported across the journey	task checklist (search/compare/recommend/cart/checkout/postpurchase); scope index	“Supports cart building/returns”; count of tasks supported
Transaction capability	Ability to execute transactions rather than only suggest	simulated vs. real; order placement (Y/N); payment execution (Y/N)	“The agent placed an order”; logged transaction completion
Decision delegation	Willingness to let the agent decide and/or act on one’s behalf	delegation intention; agent vs. manual choice; % tasks delegated	“I would delegate checkout”; choosing “Auto” mode
Trust (agent/platform)	Confidence in agent competence/intent and platform governance	trust scales; competence/integrity; platform credibility/safeguards	“I trust the agent/platform to act in my interests”
Perceived control/agency	Felt ability to oversee, intervene, and retain ownership of choices	controllability; override/undo; agency/autonomy scales	“I can override decisions”; “I feel I’m still deciding”
Powerlessness/reactance	Aversive response to perceived restriction of freedom	reactance/powerlessness scales; negative affect pathways	“The agent restricts my freedom”; “I feel powerless”
Risk/privacy concern	Anticipated loss or harm from errors, misuse, or data practices	financial/performance/privacy risk; creepiness/surveillance concern	“This could lead to a costly mistake”; “I worry about my data”
Usefulness/convenience	Perceived performance gains and effort/time savings	PU + convenience/ease; reduced friction/decision fatigue	“It improves outcomes”; “It saves me time/effort”
Reliance quality (incl. overreliance and calibration)	Whether reliance matches true competence (appropriate vs. excessive use)	reliance/verification behavior; automation bias; confidence–accuracy alignment	observed: acceptance rate + checking; reliance decreases after errors

Table 2

Operational rules for autonomy levels and transaction evidence.

Dimension	Code	Operational Criterion
Autonomy	Level 0 (reactive)	Reactive Q&A/service responding; no multi-step orchestration; no acting-on-behalf logic; user performs all actions.
Autonomy	Level 1 (assistive)	Provides recommendations/support; may personalize; but user retains decision authority and executes actions (cart/checkout). “Autonomy” can be discussed, but no control-rights transfer is instantiated.
Autonomy	Level 2 (workflow/guided)	Agent orchestrates a multi-step workflow across subtasks (e.g., search → compare → recommend → cart build) and/or uses tools/memory; however, transaction execution is not performed by the agent or is not evidenced beyond claims.
Autonomy	Level 3 (execution-level)	Delegated control rights for at least one consequential action (e.g., order placement, checkout completion, payment initiation) in a functioning system, validated sandbox, or explicit experimental manipulation of execution authority (with or without approval gates).
Transaction evidence	T0 (none/NR)	No transaction action described or assessed.
Transaction evidence	T1 (claimed)	Paper asserts “transaction-capable” without observed execution, without clear proxy, or without sufficient procedural detail.
Transaction evidence	T2 (simulated/proxied)	Mock checkout, scenario/vignette, sandbox proxy, or controlled task that represents execution but is not a real transaction.
Transaction evidence	T3 (demonstrated)	Observed order/checkout/payment execution in a real or realistic transaction setting, or a validated transaction proxy with clear control-rights transfer.

Note: Transaction-capable indicates the presence of an execution pathway described or instantiated; when execution was not observed/proxied and the context was scenario-based, capability was coded as claimed rather than demonstrated. Level 3 autonomy was reserved for studies that explicitly manipulate or implement execution authority (e.g., auto-purchase vs. notify/approve gates) rather than for studies that only describe autonomy as a perceived attribute or interaction cue.

Table 3

Inclusion/exclusion criteria.

Criterion Domain	Inclusion Criteria	Exclusion Criteria
Publication type and quality	Peer-reviewed journal articles and peer-reviewed conference/proceedings papers.	Reviews, editorials, commentaries, conceptual/opinion pieces, theses, book chapters, non-peer-reviewed preprints (unless later peer-reviewed).
Language and time window	English-language publications, 2015–2026.	Non-English publications; outside the date range.
Context (commerce relevance)	Consumer commerce/shopping context with purchase-decision relevance (e.g., e-commerce, online shopping, cart/checkout, shopping assistant, purchase decisions).	Non-commerce domains without purchase-decision relevance (e.g., general customer service with no purchase outcomes, education, healthcare, public services).
System type (agentic features)	AI agent/assistant with agentic features linked to autonomy or acting-on-behalf (e.g., planning, multi-step task completion, memory/tool use, approval/confirmation logic, delegation-related control). Eligible even when labeled “chatbot” or “assistant,” provided delegation-relevant features are present or evaluated.	Systems without agentic features (e.g., static recommenders with no interactive/agent behavior), or purely informational FAQ bots with no delegation/acting-on-behalf component.
Empirical/evaluative evidence	Empirical or evaluative study reporting measurable evidence (experimental, survey, field/observational, or mixed methods). At minimum, the study must report delegation-relevant constructs (e.g., willingness to delegate, perceived control/autonomy, reliance) and/or downstream consumer/marketing outcomes.	Purely technical/architectural papers without user-facing evaluation or without measurable consumer/marketing outcomes; purely theoretical frameworks without data or evaluation.
Outcomes (RQ2/RQ3 relevance)	Reports at least one delegation-relevant determinant and/or downstream outcome: (a) delegation (intention/choice/% tasks delegated) and/or (b) consumer outcomes (trust, control/agency, reactance/powerlessness, risk/privacy, reliance) and/or (c) marketing outcomes (purchase intention/behavior, conversion, satisfaction, loyalty, spend). Generic “chatbot satisfaction” studies were included only when linked to delegation/control logic or autonomy cues; otherwise they were excluded at full text.	No delegation-related constructs and no consumer/marketing outcomes (e.g., offline system benchmarks only; model accuracy only; latency only).
Unit of analysis (consumer-facing)	Consumer/user responses or behaviors in commerce interactions (including platform users, shoppers, customers).	Business-internal agent systems with no consumer-facing evaluation and no implications tested for consumer delegation or commerce outcomes.
Duplicate handling	When the same study appears in multiple records, retain the most complete/peer-reviewed version.	Duplicate records; earlier versions superseded by final publication.

Table 4

Context profile of included studies (geography, evidence maturity, samples).

Panel	Category	N	%
A1. Region distribution	East Asia	11	23.4
	Southeast Asia	9	19.1
	Europe	7	14.9
	North America	7	14.9
	South Asia	6	12.8
	Latin America	1	2.1
	North America/Europe (multi-country)	2	4.3
	Global/Unspecified	3	6.4
	Other	1	2.1
A2. Top countries	China	9	19.1
	USA	7	14.9
	India	5	10.6
	Indonesia	3	6.4
	USA–UK (multi-country)	2	4.3
	Other countries (combined)	18	38.3
	Global/Unspecified	3	6.4
B1. Evidence maturity (collapsed)	Intention-only/cross-sectional (SEM, surveys)	26	55.3
	Behavioral—simulated/proxied (lab tasks, mock checkout, scenarios)	14	29.8
	Behavioral—realistic context (platform/field/transaction proxies)	7	14.9
B2. Sample composition (collapsed)	Customers/Users	17	36.2
	Online panels	11	23.4
	Students	10	21.3
	Platform users	4	8.5
	Other/NR	5	10.6

Note: Percentages are calculated within the full set of included studies (N = 47). “Evidence maturity” collapses designs into intention-only/cross-sectional evidence versus behavioral evidence in simulated/proxied tasks versus behavioral evidence in realistic contexts (platform/field/transaction proxies).

Table 6

Moderators and boundary conditions shaping delegation-related pathways (RQ2).

Moderator (How It Is Framed)	What It Changes (Strengthens/Weakens)	Which Link Is Moderated (From Mechanism Pathways)	Example Study/Studies	Context/Boundary Notes
Perceived task-solving competence (high vs. low)	Strengthens the downstream delegation/continuance pathway when competence is high	Indirect effect from interaction mode to continuance via optimism → cognitive trust → customer care	[52]	Effects stronger at high competence; scenario-based; voice vs. text manipulation
Assistant type × initiation (plus gender covariate)	Changes reactance intensity (lower vs. higher), cascading to satisfaction/certainty	Assistant type → Reactance → choice difficulty → choice certainty → performance → satisfaction; and Type × Initiation → Reactance	[24]	Reactance lowest for anthropomorphic and user-initiated; highest for non-anthropomorphic and system-initiated
Perceived anthropomorphism	Changes strength of quality cues shaping attitudes (direction depends on cue mix; reported as moderation)	PIQ/PCI/PET → Attitude (upstream of Trust)	[33]	Anthropomorphism moderates key quality→attitude links; regret avoidance path ns
Relationship quality	Conditions familiarity effects on outcomes (moderated mediation)	Familiarity → (search preference, self-disclosure) (moderated)	[54]	Relationship quality with source company moderates familiarity → outcomes
Personalization and hedonic value	Changes shape/strength of anthropomorphism → privacy concerns (curvilinear moderated effect)	Anthropomorphism → privacy concerns (curvilinear)	[57]	“Too low/too high anthropomorphism” regime depends on personalization and hedonic value
Age/gender/shopping frequency/prior interaction	Group differences in perceived cues and downstream trust/satisfaction (direction varies by outcome)	Anthropomorphic design cues → enjoyment/attitude/trust → satisfaction	[7]	Age differences reported; low-frequency shoppers lower ratings; prior experience relates to perceived cues
Product type (hedonic vs. utilitarian) and brand equity	Strengthens “matching” effects when modality fits product and equity is low	Modality → verbalizing focus → preference fluency → choice outcomes	[8]	Matching stronger for fit + low equity; weaker for high equity/mismatch
Task complexity (low vs. high)	Conditions humanness → brand associations route	Humanness (anthropomorphism + conv. intelligence) → brand associations → purchase intention	[63]	Effects differ under low vs. high complexity conditions
Perceived enjoyment	Amplifies trust’s impact on adoption	Interactivity/humanness → trust → adoption; Trust × Enjoyment → adoption	[31]	Working professionals; China e-commerce users
PPF and TRT (median split MGA)	Strengthen engagement → purchase intention when fairness/trust is high	Engagement → purchase intention; moderated by PPF and TRT	[65]	India-only setting; boundary limits noted; MGA via median splits
Personalization in platform context	Changes strength of trust → outcome link (reported negative moderation; exact link wording inconsistent)	Trust → satisfaction → loyalty (partial mediation) + personalization moderation	[30]	Boundary: effect sizes shift when personalization included; moderation coefficient negative; link definition unclear in report
Gender (on overload pathway)	Moderates competence → overload (affecting downstream evaluation/intent)	Warmth/competence → (trust + overload) → service evaluation → purchase intention	[64]	Gender moderates competence → overload (reported in additional analysis)
Avatar familiarity (celebrity vs. unknown)	Reduces eeriness (thus increasing trust) when familiarity is high	Human likeness → eeriness → trust → (purchase/reuse); familiarity moderates likeness→eeriness	[32]	Student sample; laptop scenario; confounds noted (attractiveness/expertise)

Note: Some records mark moderators as “tested” but actually report only descriptive subgroup differences (e.g., generation via χ²; demographics as controls) rather than an explicit interaction term/MGA. Those should be described as “subgroup comparisons” or “exploratory heterogeneity checks”, not formal moderators.

Table 7

Marketing outcomes synthesis (RQ3).

Outcome	Evidence Type	Direction (Overall)	Mechanism Notes (What Drives It)	Studies
Purchase intention/adoption intention	Survey; experiments	Mostly positive via utility/trust, but can decrease under high autonomy	Utility chain dominates: quality/usefulness → attitude/satisfaction → intention. High-autonomy cues can reduce intention via powerlessness/threat-to-freedom, unless urgency/scarcity raises benefit salience and attenuates resistance.	[4,6,8,22,30,31,32,39,40,41,45,46,59,60,63,64,65]
Observed conversion/purchase behavior	Field; experiments; survey (self-reported behavior)	Positive when friction drops or recommendations fit, but not universal	Behavioral lift appears when systems reduce friction (latency/steps) or improve option fit (recommendation relevance). In many studies, “conversion” is proxied via self-reported ordering/behavioral intention rather than verified execution.	[10,40,41,45,55,64,65,68]
Spend/GMV/WTP	Field; survey	Positive in field and some survey settings	Spend increases when performance improvements raise acceptance/engagement (e.g., fewer failures, faster results). WTP aligns with usefulness + trust but is constrained by perceived risk and governance concerns.	[10,67,69]
Brand choice/choice share	Experiments	Context-dependent; “matching” improves choice outcomes	Choice effects are context-dependent: design/modality shifts preference fluency and reason-vs-feeling focus, changing brand selection and satisfaction. Effects are stronger when modality matches product type and when brand equity is low.	[8]
Loyalty/repurchase/continuance/reuse	Survey; field	Mostly positive via satisfaction/attitude, but varies by governance perceptions and user traits	Most evidence supports quality/usefulness → satisfaction/attitude → continuance/loyalty, with trust sometimes feeding satisfaction/loyalty. Repurchase effects emerge when recommendations increase relevant exposure; governance threats can weaken retention.	[21,30,32,44,46,52,55,58,66,69,70,71]

Table 8

Consumer outcomes synthesis (RQ3).

Outcome	Evidence Type	Direction (Overall)	Mechanism Notes (What Drives It)	Studies
Satisfaction/CX/perceived service quality	Survey; lab/usability; experiments	Mostly positive, but design can introduce negative affect (reactance, overload)	Satisfaction rises via problem-solving, confirmation, usefulness, and interaction/service quality. It declines when designs trigger reactance or cognitive/affective overload, which depress downstream evaluations.	[4,7,22,24,44,48,50,61,64,69,70]
Trust (level) vs. trust calibration (rarely tested)	Survey	Thin/under-specified	Most studies model trust level (often as mediator/outcome), not calibration. Calibration evidence (appropriate trust under varying accuracy/accountability) is thin, so “trust” findings should not be read as mis-/over-trust tests.	[42,68]
Reliance/resistance/overreliance	Survey	Reliance increases with trust; resistance decreases with trust (overreliance rarely measured)	Reliance generally increases with trust and resistance decreases. Moderators (e.g., task complexity, disclosure) mainly reshape attribute → trust formation; overreliance is rarely operationalized as a distinct outcome.	[50,51,53]
Perceived control (post)	Experiments; survey; lab/usability	Design-contingent	Control is design-contingent: system-driven execution or constrained choice lowers perceived control; supportive interface choices can preserve it. Measures vary (TPB-style PBC vs. “control” service-use items), limiting comparability.	[21,24,40,41,50,58,61]
Autonomy/agency (post)	Experiments; survey	Mixed (can increase psychological ownership/agency, or decrease via powerlessness depending on autonomy regime)	Effects are mixed: agency can rise via psychological ownership/controllability, but high autonomy can lower agency via powerlessness. Some evidence points to a mid-autonomy “co-assistant” optimum.	[38,40,41,46]
Fairness/price fairness	Survey	Fairness strengthens conversion from engagement to purchase intention	Fairness is mostly treated as a conversion amplifier: it strengthens whether engagement/experience translates into purchase/continuance intentions, rather than being measured as a post-delegation outcome.	[65]
Regret/blame attribution	—	Not directly measured	Largely not measured as an outcome in the coded set. Related constructs appear only indirectly (e.g., “regret avoidance” as an antecedent/path), leaving accountability and post hoc attribution effects under-tested.	—

Table 9

Governance levers × risk zones in agentic commerce: what to build and what to measure.

Risk Zone	Autonomy Level Design	Override/Undo/Recourse	Confirmation Gates (Human-in-the-Loop)
Powerlessness/reactance	Build: default to co-assistant for consequential steps; avoid auto-execute as default. Measure: powerlessness + reactance; “felt control” items; TTFF to controls/undo.	Build: always-visible “Cancel/Undo/Revert” + fast rollback. Measure: undo use rate; “recovery confidence”; perceived reversibility.	Build: explicit “Approve before purchase” and “Review changes” step. Measure: opt-in to auto; approval latency; drop-off at gate.
Overreliance/miscalibrated trust	Build: constrain executor mode to low-stakes/low-variance tasks; require training period. Measure: reliance vs. performance; error-following rate.	Build: “Why this could be wrong” + “Show alternatives” + easy revert. Measure: alternative-view rate; correction rate after errors.	Build: mandatory review when confidence is low/high impact (price, brand, returns). Measure: calibration curve; acceptance conditional on confidence.
Privacy risk/surveillance concern	Build: keep personalization optional; avoid “overly knowing” cues by default. Measure: creepiness; privacy concern; opt-out rates.	Build: one-click delete, “forget this,” and consent revision. Measure: deletion requests; perceived control over data.	Build: consent gates for sensitive data and cross-app actions. Measure: consent acceptance by data type; abandonment at consent.
Fairness/price fairness	Build: avoid fully automated price/offer selection without fairness checks. Measure: perceived price fairness; differential outcomes across groups.	Build: dispute/appeal + price breakdown + “report unfair price.” Measure: appeal rate; perceived procedural fairness.	Build: confirmation gate for price changes, upsells, substitutions. Measure: acceptance of price changes; regret after purchase.
Regret/blame attribution	Build: executor mode requires explicit accountability framing + user choice. Measure: regret intensity; blame allocation (agent/platform/user).	Build: guaranteed remediation path (refund/return assistance) + logs. Measure: complaint handling time; satisfaction with resolution.	Build: “Final review” summary (price, brand, constraints, return policy). Measure: post-purchase regret; “I noticed errors” rate.

Table 10

Governance levers × risk zones in agentic commerce: what to build and what to measure (explanation depth, data controls and accountability signals).

Risk Zone	Explanation Depth	Data and Privacy Controls	Accountability Signals
Powerlessness/reactance	Build: “What I’m about to do” + “You can stop me anytime.” Measure: comprehension check; perceived transparency; reduced reactance.	Build: “Use minimal data” toggle; session-only mode. Measure: privacy comfort; willingness-to-delegate under privacy constraints.	Build: “You’re in charge” framing + clear escalation to human/support. Measure: perceived responsibility clarity; reduced avoidance.
Overreliance/miscalibrated trust	Build: show evidence strength (sources, constraints, uncertainty). Measure: trust calibration tasks (good/bad advice discrimination).	Build: data provenance display + permission boundaries (what it can/cannot access). Measure: boundary awareness; misuse attempts.	Build: “Agent recommendation” vs. “Merchant/Platform claim” labeling. Measure: attribution accuracy; trust placement (agent vs. platform).
Privacy risk/surveillance concern	Build: “Why I asked for this data” + retention duration. Measure: understanding of data use; perceived legitimacy.	Build: granular permissions + local processing where possible. Measure: permission choices; delegation willingness under privacy-safe mode.	Build: audit trail (“what data used, when, by whom”) + policy link. Measure: perceived accountability; trust under high transparency.
Fairness/price fairness	Build: explain ranking/offer logic (e.g., “best value vs. fastest”). Measure: perceived fairness; understanding of ranking criteria.	Build: allow “no personalization” pricing and comparator view. Measure: fairness perceptions under personalization off vs. on.	Build: disclose incentives/affiliations (“sponsored”, “commission”). Measure: trust erosion vs. transparency benefit; choice shifts.
Regret/blame attribution	Build: rationale + constraint satisfaction (“matched your constraint X”). Measure: counterfactual regret (“should’ve chosen Y”); perceived justifiability.	Build: privacy-safe logs for dispute resolution (minimization). Measure: acceptance of logging trade-off; perceived fairness.	Build: clear responsibility partition (agent suggestion vs. platform execution). Measure: attribution clarity; reduced hostility/avoidance.

Table 11

Higher-evidence designs (field/quasi-field/experiments/usability).

Study	Design	Sample	N	Analysis	Evidence	Causal?	Manip. Check?	CMB Risk	Effect Size?	Limits	Quality
[52]	experiment	not stated	200	mediation	Lab experiment	Y	Y	low	Y	Scenario-based experiment; single region; self-reported continuance (no actual booking/behavior).	Moderate
[24]	experiment	panel	400	mediation	Lab experiment	Y	N	NA	Y	Mock-up scenarios; simulated choice context; limited external validity; no real purchasing.	High
[43]	experiment (factorial)	mixed/convenience	100	ANOVA/factorial	Lab experiment	Y	N	medium	Y	Perceived human-likeness ratings only; convenience sample; simulated setting; no real transactions.	Moderate
[10]	field_AB	platform customers	NA	A/B testing	Field AB	Y	NA	NA	Y	Short report; limited methods detail; sample size/randomization unclear; platform-specific; short window.	High
[55]	quasi_experimental_field	transaction users/records	160	t-tests (binary outcomes)	Field quasi experiment	Y	N	low	Y	Randomization unclear; single store/city; small N; analytic choice (t-test on binary); no mediators; short follow-up.	Moderate
[61]	prototype usability	older adults (≥65)	10	descriptive + Wilcoxon	pilot usability	N	N	low	Y	Small-N pilot; limited context; calls for larger/diverse samples and real-world deployment.	Small
[40]	multi-study experiments	online consumer panels	1981	t-tests/OLS + PROCESS	multi-study experimental	Y	Y	low	Y	Self-report intentions; no behavioral validation; English-speaking e-comm contexts; limited real purchase behavior.	High
[41]	multi-study (incl. field)	panels + ad users	1975	χ²/t/ANOVA + PROCESS	experimental + field	Y	Y	low–medium	Y	Limited product scope; short-term behavioral proxies; no long-run/retention tracking; calls for replications.	High
[46]	randomized experiments	online panel adults	510	ANOVA + mediation/moderation	causal (randomized)	Y	Y	moderate	Y	Scenario-based outcomes; no field behavior/longitudinal outcomes; task-type boundary conditions suggested.	Moderate
[38]	experiment	student participants	105	PLS-SEM	experimental	Y	Y	low	Y	Single product + simulated setting; Gen Z students; no behavioral/longitudinal outcomes.	High
[54]	experiment	student sample	211	ANOVA + PROCESS	experimental	Y	Y	medium	Y	Static stimuli; student skew; limited interactivity realism; suggests multi-item measures and broader samples.	High
[7]	randomized experiment	online convenience	401	t-tests/ANOVA + PROCESS	experimental	Y	Y	medium	Y	Simulated store; no real checkout/payment; non-probability sample; domain-limited.	Good
[63]	experiment	MTurk panel	233	ANCOVA + PROCESS	randomized experiment	Y	Y	low/mitigated	Y	Scenario/video-based; no real interaction/transaction; under-represents some user groups.	Medium–High
[8]	5 experiments	students + Prolific	~1134	ANOVA/χ² + PROCESS	randomized experiments	Y	Y	low	Y	Artificial lab contexts; no real transactions; needs field/offline replications.	High
[64]	multi-study experiments	mostly students/young adults	568	PLS-SEM + checks	experimental (incl. “lab-in-field”)	Y	Y	medium	Y	Homogenous samples; some reliance on self-report; boundary conditions not fully explored.	High
[32]	experiment	university students	185	PLS-SEM + moderation	experimental	Y	partial	medium	Y	Celebrity confounds; limited checks; missing individual-difference controls; student sample.	Medium

Note: Design = study design (experiment/survey/field/mixed). Sample = recruitment source/type. Analysis = primary analytic method (e.g., regression, SEM, mediation). Evidence = evidence tier (e.g., lab experiment, field/observational). Causal? indicates whether the design supports causal inference. Manip. check? = manipulation check reported (experiments only). CMB risk = common method bias risk (self-report/same-source). Effect size? indicates whether standardized or interpretable effect sizes were reported. Limits summarizes key limitations. Quality = overall quality rating per review criteria. NA = Not applicable by design.

Table 12

Observational surveys (cross-sectional/time-lag) and mixed-method study designs.

Study	Design	Sample	N	Analysis	Evidence	Causal?	Manip. Check?	CMB Risk	Effect Size?	Limits	Quality
[33]	mixed (interviews + survey)	customers	394/372	mixed	mixed survey	N	NA	medium	Y	Cross-sectional self-report; no observed behavior; India-only; residual method bias risk.	Moderate
[48]	survey	panel	315	SEM	survey	N	NA	medium	Y	Convenience cross-sectional; no behavioral data; China e-commerce context limits generalizability.	Moderate
[42]	(survey; implied)	students/young	1007	PLS-SEM	survey	N	NA	medium	Y	Snowball/young users; Vietnam-only; cross-sectional; no behavioral data.	Moderate
[4]	survey	convenience online	173	SEM	survey	N	NA	medium	Y	Small sample; recall-based self-report; Spain-only; no real purchase data.	Moderate
[44]	survey	panel/purposive	400	PLS-SEM	survey	N	NA	high	Y	Self-report; no actual behavior; moderator needs; long-term trust impacts not tested.	Moderate
[67]	cross-sectional survey	convenience	250	SEM + CFA	observational	N	N	high	Y	Demographic skew; correlational SEM; calls for larger/multicultural samples + controls.	Moderate
[45]	cross-sectional survey	purposive	150	CFA + SEM	observational	N	NA	high	Y	Non-experimental; privacy/security concerns noted; calls for comparisons/segmentation.	Moderate
[6]	scenario survey	consumers (China)	216	SEM + moderation	survey (stimulus)	N	N	medium	Y	18–40 tech-savvy; scenario-based; no real behavior; model omits key drivers.	Moderate
[49]	cross-sectional survey	online consumers	419	PLS-SEM	observational	N	N	medium	Y	Region-limited; 18–30 skew; self-report; not industry-specific.	Moderate
[23]	scenario-based survey + video	MTurk panel	186	AMOS + PROCESS	survey + stimulus	N	Y	medium	Y	Personalization not manipulated; suggests experiments; add traits (privacy/trust/skepticism).	Moderate
[71]	cross-sectional survey	Lazada users (SG)	305	PLS-SEM	observational	N	N	medium	Y	Single platform/country; omitted constructs; no behavior outcomes.	Moderate
[60]	cross-sectional survey	Gen Z/Millennials (PH)	370	PLS-SEM	observational	N	N	medium	Y	Narrow demographic/time window; e-comm only; suggests longitudinal/cross-cultural + AI literacy/ethics.	Moderate
[68]	cross-sectional survey	chatbot shoppers	423	SEM + serial mediation	observational	N	N	medium	Y	Turkey-only; cross-sectional; early-adopter skew; suggests longitudinal replication.	Moderate
[69]	cross-sectional survey	pharma e-comm users (CN)	371	CB-SEM	observational	N	N	medium	Y	20–40/educated skew; cross-sectional; suggests experiments + broader samples.	Moderate
[39]	cross-sectional survey	Gen Z (CN)	480	PLS-SEM + ANN + NCA	observational	N	N	moderate	Y	China-only Gen Z; male/student skew; intention not behavior; cross-sectional.	Moderate
[56]	cross-sectional survey	e-comm chatbot users (MY)	362	PLS-SEM	observational	N	N	medium	Y	Malaysia-only; cross-sectional; omitted attributes; uncanny-valley interpretation needs testing.	Moderate
[22]	cross-sectional survey	e-comm customers (LK)	385	PLS-SEM	observational	N	N	yes (single-source)	Y	Purposive sample; cross-sectional; CMB risk acknowledged; recommends experiments/probability sampling.	Moderate
[21]	cross-sectional survey	USA online panel	510	PLS-SEM	observational	N	N/NA	medium	Y	Non-probability; USA-only; suggests broader contexts + qualitative.	Moderate
[50]	cross-sectional survey	e-comm users (ID)	310	PLS-SEM + predict/IPMA	observational	N	N	medium	Y	Cross-sectional; low out-of-sample predictive power; context-specific; calls for broader domains.	Moderate
[51]	cross-sectional survey	e-comm users (CN)	299	OLS + interactions	observational	N	N	moderate	Y	Limited attributes; e-comm only; suggests human-vs-machine comparisons + trait moderators.	Good/usable
[70]	cross-sectional survey	Türkiye users	210	SEM (AMOS)	observational	N	N/NA	low–moderate	Y	Convenience sample; single country; cross-sectional; self-report; calls for broader samples.	Medium
[57]	cross-sectional survey	panel shoppers	366	regression + nonlinear med	observational	No	NA	medium	Y	Recall-based; cross-sectional; suggests store-specific designs and diverse samples; no behavioral outcomes.	Moderate
[59]	cross-sectional survey	adult shoppers (IN)	448	PLS-SEM	observational	N	N	moderate	Y	Early-adopter context; intentions not behavior; calls for experiments and more factors.	Moderate
[31]	time-lag survey	e-comm customers (CN)	343	SEM/path modeling	observational (time-lag)	N	NA	medium	Y	Context-specific; intention not behavior; sample-size reporting inconsistency noted.	Medium
[30]	cross-sectional survey	e-comm users (EG)	729	SEM (reporting mixed)	observational	N	NA	medium	Y	Cross-sectional; nonprobability; Egypt-only; reporting inconsistencies.	Medium
[62]	scenario-based survey	student sample	170	SEM (AMOS)	correlational (stimulus)	N	N	high	Y	Student sample; scenario realism; self-report single wave.	Moderate
[53]	cross-sectional survey	e-comm users (ID)	209	PLS-SEM	observational	No	No	medium–high	Y	Limited context; single-source self-report; calls for cross-cultural/longitudinal.	Moderate
[66]	cross-sectional survey	Prolific panel	300	CB-SEM	observational	N	N	medium	Y	Recall bias; self-report; suggests future log/behavioral data + broader antecedents.	Moderate–High
[58]	cross-sectional survey	consumers (VN)	476	PLS-SEM + fsQCA + NCA	observational	N	N	medium	N	VN-only; self-report; cross-sectional; CMV check via VIF < 3.3 reported.	Medium

Table 13

Future Research Agenda: Design-ready gaps and testable recommendations.

Research Gap	Future Recommendation(s)	Suggested Design/Measures
Field evidence exists but is hard to generalize/replicate across platforms and agent types	Replicate with transparent protocols and multiple platforms/agent architectures	Pre-registered field A/B; report allocation, exposure windows, guardrails; include acceptance → GMV mediation; heterogeneous treatment effects (new vs. returning users)
Behavioral effects are under-identified and mechanisms untested in natural settings	Re-run with stronger identification and analytics aligned to outcomes	Randomized field A/B or stepped-wedge; logistic/Poisson models for behavioral DVs; add post-purchase surveys for trust/control; longer retention horizon (30/60/90 days)
When autonomy “works” is context-bound (scarcity/urgency) and may not translate to routine shopping	Expand boundary conditions and test persistence/learning	Multi-context experiments (routine vs. urgent); longitudinal panel follow-up; manipulate autonomy + governance (override, confirmations); measure powerlessness, perceived safeguards, and delayed reuse
Autonomy sweet-spot evidence lacks behavioral validation and task taxonomy	Validate inverted-U in behavioral settings and differentiate task types	3-level autonomy (advisor/co-assistant/executor) + task type (objective vs. preference) factorial; behavioral choice + downstream regret; analyze nonlinearity (quadratic/splines)
UX/satisfaction mechanisms may not hold in real transactions	Re-test in live platforms and high-stakes purchases	Field pilots with clickstream + post-task UX; measure reactance → difficulty → certainty → performance → satisfaction with behavioral time/effort indicators
Cue effects (anthropomorphism/modality/warmth/competence) not anchored to real conversion and retention	Bridge lab-to-field by linking cues to measurable outcomes	“Lab-in-the-field” (real store links, real carts); track add-to-cart, conversion, repeat; include manipulation checks + guardrails (privacy disclosure, control toggles)
Interface comparisons (search vs. chatbot) need realistic interaction	Test interactive use and real information disclosure behavior	Interactive tasks with real queries; behavioral disclosure metrics; multi-item measures; compare with/without explanation and privacy assurances
Accessibility and comprehension constraints underrepresented in delegation research	Scale accessibility research with diverse ability/age groups	Larger usability trials; standardized accessibility measures; task success + NASA-TLX; comprehension checks for agent actions/intent
Over-reliance on correlational SEM; weak evidence about actual delegation outcomes and causal mechanisms	Shift from “what predicts intention” to “what changes behavior and under what governance”	Mixed-method programs: (1) randomized experiments manipulating autonomy/governance; (2) field telemetry; (3) longitudinal panels. Use robust CMV controls; report effect sizes; replicate across cultures/platforms
Personalization can trigger threat-to-freedom, but causality needs direct tests	Manipulate personalization and autonomy independently	2 × 2: personalization (low/high) × autonomy (advisor/executor); measure threat-to-freedom, negative affect, avoidance; include involvement moderation
Miscalibrated reliance (overtrust/overreliance) largely unmeasured	Add calibration and error-sensitive reliance outcomes	Calibration tasks with varying agent accuracy; measure reliance under correct vs. incorrect recommendations; compute calibration indices (Brier score-style; trust–accuracy alignment)
Trust calibration is a missing construct in agentic commerce	Build a calibration measurement standard	Use trust calibration paradigms (appropriate reliance, skepticism under low accuracy); include accountability cues; measure trust updating after errors
User governance is central but under-specified (override, confirmations, transparency)	Treat governance as a design variable, not just a perception	Manipulate override availability, confirmation steps, explainability depth, revocation; measure perceived control, agency, powerlessness; test mediation to adoption/avoidance
Price/fairness dynamics in delegation under-modeled (especially when agent executes)	Extend fairness to outcomes and accountability	Experiments with dynamic pricing recommendations; measure fairness as outcome + moderator; add blame attribution and complaint intention
Accountability outcomes are a blind spot in delegation	Add regret/blame, responsibility, and recourse outcomes	Scenarios with good vs. bad agent outcomes + varying explanations; outcomes: regret, blame attribution (agent vs. platform vs. self), complaint, switching, legal/recourse preferences
Rich constructs, limited causal/behavioral validation	Extend mixed-methods into behavioral trials	Sequential design: qualitative → preregistered experiment → field telemetry; incorporate context-specific constraints and adoption stages
Nonlinear effects (e.g., anthropomorphism → privacy) need causal tests	Experimentally test nonlinearities and privacy mechanisms	Manipulate anthropomorphism levels; measure privacy concern, disclosure, and adoption; identify turning points; include heterogeneous effects (privacy-sensitive segments)
Predictive generalization is weak for “works in practice” claims	Use out-of-sample validation and behavioral endpoints	Train/test splits, rolling validation; link survey constructs to subsequent usage logs; compare model families (SEM vs. ML) on behavioral prediction
Population validity is limited (age, literacy, culture, regulation)	Systematically test cross-cultural and demographic heterogeneity	Multi-country harmonized designs; invariance testing; segment by tech readiness, age, risk; include policy/regulation as context variables
CMV mitigation often procedural, not design-based	Strengthen causal and multi-source measurement	Multi-wave surveys; objective behavioral data; marker variables; experimental manipulations; triangulate with logs/observations

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/info17030222/s1. PRISMA 2020 Checklist. Reference [72] is cited in the Supplementary Materials.

References

1. Kim, T.; Lee, O.K.D.; Kang, J. Why People Trust AI Software Robots: A Mediated Moderation Perspective on the Interaction between Their Intelligence and Appearance. Ind. Manag. Data Syst.; 2025; 125, pp. 2426-2456. [DOI: https://dx.doi.org/10.1108/IMDS-04-2024-0329]

2. Park, A.; Lee, S.B. Examining AI and Systemic Factors for Improved Chatbot Sustainability. J. Comput. Inf. Syst.; 2024; 64, pp. 728-742. [DOI: https://dx.doi.org/10.1080/08874417.2023.2251416]

3. Baran, Z.; Karaca, S. Factors Affecting Customer Experience, Attitude, and Repurchase Intention on Smart Tourism Applications. Curr. Issues Tour.; 2024; 29, pp. 902-920. [DOI: https://dx.doi.org/10.1080/13683500.2024.2440809]

4. Martínez Puertas, S.; Illescas Manzano, M.D.; Segovia López, C.; Ribeiro Cardoso, P. Purchase Intentions in a Chatbot Environment: An Examination of the Effects of Customer Experience. Oeconomia Copernic.; 2024; 15, pp. 145-194. [DOI: https://dx.doi.org/10.24136/oc.2914]

5. Sadiq, M.W.; Akhtar, M.W.; Huo, C.; Zulfiqar, S. ChatGPT-Powered Chatbot as a Green Evangelist: An Innovative Path toward Sustainable Consumerism in E-Commerce. Serv. Ind. J.; 2024; 44, pp. 173-217. [DOI: https://dx.doi.org/10.1080/02642069.2023.2278463]

6. Qi, L.; Ko, E.; Cho, M. AI Chatbots with Visual Search: Impact on Luxury Fashion Shopping Behavior. J. Glob. Sch. Mark. Sci. Bridg. Asia World; 2025; 35, pp. 99-117. [DOI: https://dx.doi.org/10.1080/21639159.2024.2429497]

7. Klein, K.; Martinez, L.F. The Impact of Anthropomorphism on Customer Satisfaction in Chatbot Commerce: An Experimental Study in the Food Sector. Electron. Commer. Res.; 2023; 23, pp. 2789-2825. [DOI: https://dx.doi.org/10.1007/s10660-022-09562-8]

8. Schindler, D.; Maiberger, T.; Koschate-Fischer, N.; Hoyer, W.D. How Speaking versus Writing to Conversational Agents Shapes Consumers’ Choice and Choice Satisfaction. J. Acad. Mark. Sci.; 2024; 52, pp. 634-652. [DOI: https://dx.doi.org/10.1007/s11747-023-00987-7]

9. Bai, J.; Zhang, Z.; Zhang, J.; Zhu, J. Insight Agents: An LLM-Based Multi-Agent System for Data Insights. Proceedings of the SIGIR 2025—Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval; Association for Computing Machinery, Inc.: New York, NY, USA, 2025; pp. 4335-4339.

10. Nie, G.; Zhi, R.; Yan, X.; Du, Y.; Zhang, X.; Chen, J.; Zhou, M.; Chen, H.; Li, T.; Cheng, Z. . A Hybrid Multi-Agent Conversational Recommender System with LLM and Search Engine in E-Commerce. Proceedings of the RecSys 2024—Proceedings of the 18th ACM Conference on Recommender Systems; Association for Computing Machinery, Inc.: New York, NY, USA, 2024; pp. 745-747.

11. Huang, X.; Lian, J.; Lei, Y.; Yao, J.; Lian, D.; Xie, X. Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations. ACM Trans. Inf. Syst.; 2025; 43, 96. [DOI: https://dx.doi.org/10.1145/3731446]

12. Zhang, H.; Huang, J.; Mei, K.; Yao, Y.; Wang, Z.; Zhan, C.; Wang, H.; Zhang, Y. Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-Based Agents. arXiv; 2025; arXiv: 2410.02644

13. Benke, I.; Gnewuch, U.; Maedche, A. Understanding the Impact of Control Levels over Emotion-Aware Chatbots. Comput. Hum. Behav.; 2022; 129, 107122. [DOI: https://dx.doi.org/10.1016/j.chb.2021.107122]

14. Andrzejak, E.G. AI-Powered Digital Transformation: Tools, Benefits and Challenges for Marketers-Case Study of Lpp. Proceedings of the Procedia Computer Science; Elsevier B.V.: Amsterdam, The Netherlands, 2023; Volume 219, pp. 397-404.

15. Alti, A.; Lakehal, A. AI-MDD-UX: Revolutionizing E-Commerce User Experience with Generative AI and Model-Driven Development. Future Internet; 2025; 17, 180. [DOI: https://dx.doi.org/10.3390/fi17040180]

16. Orzol, M.; Szopik-Depczynska, K. ChatGPT as an Innovative Tool for Increasing Sales in Online Stores. Proceedings of the Procedia Computer Science; Elsevier B.V.: Amsterdam, The Netherlands, 2023; Volume 225, pp. 3450-3459.

17. Gružauskas, V.; Gimžauskienė, E.; Navickas, V. Forecasting Accuracy Influence on Logistics Clusters Activities: The Case of the Food Industry. J. Clean. Prod.; 2019; 240, 118225. [DOI: https://dx.doi.org/10.1016/j.jclepro.2019.118225]

18. Vhatkar, M.S.; Raut, R.D.; Gokhale, R.; Kumar, M.; Akarte, M.; Ghoshal, S. Leveraging Digital Technology in Retailing Business: Unboxing Synergy between Omnichannel Retail Adoption and Sustainable Retail Performance. J. Retail. Consum. Serv.; 2024; 81, 104047. [DOI: https://dx.doi.org/10.1016/j.jretconser.2024.104047]

19. Stummer, C.; Kiesling, E.; Günther, M.; Vetschera, R. Innovation Diffusion of Repeat Purchase Products in a Competitive Market: An Agent-Based Simulation Approach. Eur. J. Oper. Res.; 2015; 245, pp. 157-167. [DOI: https://dx.doi.org/10.1016/j.ejor.2015.03.008]

20. Čavoški, S.; Marković, A. Agent-Based Modelling and Simulation in the Analysis of Customer Behaviour on B2C e-Commerce Sites. J. Simul.; 2017; 11, pp. 335-345. [DOI: https://dx.doi.org/10.1057/s41273-016-0034-9]

21. Khan, R.A.; Feng, G.; Zhu, J. Impact of Knowledge and Electronic Word of Mouth About AI-Based Chatbots on Customers’ Intention to Continue Using Chatbots in E-Commerce. J. Organ. Comput. Electron. Commer.; 2025; 36, pp. 78-99. [DOI: https://dx.doi.org/10.1080/10919392.2025.2563463]

22. Yatawara, K.M.B.; Sampath, T.; Kalupahana, P.L.; Rathnayake, S.O.; Jayasuriya, N.; Rathnayake, N. Bridging the Chatbot Connection: The Role of AI-Driven Chatbot Affordances in e-Commerce Purchase Intentions. Int. J. Sociol. Soc. Policy; 2025; pp. 1-21. [DOI: https://dx.doi.org/10.1108/IJSSP-01-2025-0003]

23. Yoon, N.; Choi, D.; Lee, H.K. Avoidance of AI-Empowered Digital Service Assistants in Fashion Shopping: The Negative Side of Personalized Recommendation of Chatbots. J. Glob. Fash. Mark.; 2025; 16, pp. 423-442. [DOI: https://dx.doi.org/10.1080/20932685.2025.2510946]

24. Pizzi, G.; Scarpi, D.; Pantano, E. Artificial Intelligence and the New Forms of Interaction: Who Has the Control When Interacting with a Chatbot?. J. Bus. Res.; 2021; 129, pp. 878-890. [DOI: https://dx.doi.org/10.1016/j.jbusres.2020.11.006]

25. Chopra, A.K.; Christie, V.S.H.; Singh, M.P. Interaction-Oriented Programming: An Application Semantics Approach for Engineering Decentralized Applications. Proceedings of the Annual ACM Symposium on Principles of Distributed Computing; Association for Computing Machinery: New York, NY, USA, 2021; pp. 575-576.

26. Patel, V.; Tejani, P.; Parekh, J.; Huang, K.; Tan, X. Developing A Chatbot: A Hybrid Approach Using Deep Learning and RAG. Proceedings of the—2024 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2024; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2024; pp. 273-280.

27. Dahiyat, E.A.R. Law and Software Agents: Are They “Agents” by the Way?. Artif. Intell. Law; 2021; 29, pp. 59-86. [DOI: https://dx.doi.org/10.1007/s10506-020-09265-1]

28. Lieder, M.; Asif, F.M.A.; Rashid, A. Towards Circular Economy Implementation: An Agent-Based Simulation Approach for Business Model Changes. Auton. Agents Multi Agent Syst.; 2017; 31, pp. 1377-1402. [DOI: https://dx.doi.org/10.1007/s10458-017-9365-9]

29. Sabeg, S.; Maarouk, M.; El, M.; Souidi, H. A New Formal Multi-Agent Organization Based on the DD-LOTOS Language. J. Inf. Sci. Eng.; 2024; 40, pp. 1273-1295. [DOI: https://dx.doi.org/10.6688/JISE.202411_40(6).0008]

30. Hassan, N.; Abdelraouf, M.; El-Shihy, D. The Moderating Role of Personalized Recommendations in the Trust–Satisfaction–Loyalty Relationship: An Empirical Study of AI-Driven e-Commerce. Future Bus. J.; 2025; 11, 66. [DOI: https://dx.doi.org/10.1186/s43093-025-00476-z]

31. Ding, Y.; Najaf, M. Interactivity, Humanness, and Trust: A Psychological Approach to AI Chatbot Adoption in e-Commerce. BMC Psychol.; 2024; 12, 595. [DOI: https://dx.doi.org/10.1186/s40359-024-02083-z]

32. Song, S.W.; Shin, M. Uncanny Valley Effects on Chatbot Trust, Purchase Intention, and Adoption Intention in the Context of E-Commerce: The Moderating Role of Avatar Familiarity. Int. J. Hum. Comput. Interact.; 2024; 40, pp. 441-456. [DOI: https://dx.doi.org/10.1080/10447318.2022.2121038]

33. Chakraborty, D.; Kumar Kar, A.; Patre, S.; Gupta, S. Enhancing Trust in Online Grocery Shopping through Generative AI Chatbots. J. Bus. Res.; 2024; 180, 114737. [DOI: https://dx.doi.org/10.1016/j.jbusres.2024.114737]

34. Sousa, R.G.; Ferreira, P.M.; Costa, P.M.; Azevedo, P.; Costeira, J.P.; Santiago, C.; Magalhaes, J.; Semedo, D.; Ferreira, R.; Rudnicky, A.I. . IFetch: Multimodal Conversational Agents for the Online Fashion Marketplace. Proceedings of the MuCAI 2021—Proceedings of the 2nd ACM Multimedia Workshop on Multimodal Conversational AI, Co-Located with ACM MM 2021; Association for Computing Machinery, Inc.: New York, NY, USA, 2021; pp. 25-26.

35. Park, J.; Rahman, H.A.; Suh, J.; Hussin, H. A Study of Integrative Bargaining Model with Argumentation-Based Negotiation. Sustainability; 2019; 11, 6832. [DOI: https://dx.doi.org/10.3390/su11236832]

36. Xu, J.; Bi, Y. An Agent-Based Modeling Approach for The Diffusion Analysis Of Electric Vehicles With Two-Stage Purchase Choice Modeling. J. Comput. Inf. Sci. Eng.; 2023; 24, 064502. [DOI: https://dx.doi.org/10.1115/1.4064623]

37. Guo, J.; Zhou, Y.; Burlion, L.; Savkin, A.V.; Huang, C. Autonomous UAV Last-Mile Delivery in Urban Environments: A Survey on Deep Learning and Reinforcement Learning Solutions. Control Eng. Pract.; 2025; 165, 106491. [DOI: https://dx.doi.org/10.1016/j.conengprac.2025.106491]

38. Hu, X.; Xu, X.; Chen, C. Empowering Consumers: The Role of Avatar Choice in Consumer-Chatbot Interaction and Psychological Ownership. Int. J. Hum. Comput. Interact.; 2025; 41, pp. 15311-15322. [DOI: https://dx.doi.org/10.1080/10447318.2025.2495845]

39. Liu, J.; Chen, J. Chatbot-Aided Product Purchases among Generation Z: The Role of Personality Traits. Front. Psychol.; 2025; 16, 1454197. [DOI: https://dx.doi.org/10.3389/fpsyg.2025.1454197]

40. Frank, D.A.; Otterbring, T. Autonomy, Power and the Special Case of Scarcity: Consumer Adoption of Highly Autonomous Artificial Intelligence. Br. J. Manag.; 2024; 35, pp. 1700-1723. [DOI: https://dx.doi.org/10.1111/1467-8551.12780]

41. Frank, D.A.; Folwarczny, M.; Otterbring, T. Consumer Acceptance of High-Autonomy AI Assistants Is Driven by Perceived Benefits in Online Shopping Settings Characterized by Scarcity. Psychol. Mark.; 2025; 43, pp. 538-555. [DOI: https://dx.doi.org/10.1002/mar.70074]

42. Le, M.H.; Duong, Q.N.; Nguyen, H.N.; Au, Q.N.; Pham, N.N. The Impact of Digital Innovation on E-Commerce Young Customer Satisfaction In Vietnam. J. Cent. Bank. Law Inst.; 2025; 4, pp. 79-112. [DOI: https://dx.doi.org/10.21098/jcli.v4i1.264]

43. Chua, A.J.; Cardino, J.L.; Filamor, J.K.; Lao, M.R.; Bernardo, E.L.; Tangsoc, J. Human-Likeness Design Combinations That Affect User Trust in M-Commerce Chatbots. Proceedings of the ACM International Conference Proceeding Series; Association for Computing Machinery, Inc.: New York, NY, USA, 2023; pp. 255-260.

44. Vebrianti, R.; Aras, M.; Putri, M.S.S.; Swandewi, I.A. AI Chatbots in E-Commerce: Enhancing Customer Engagement, Satisfaction and Loyalty. PaperASIA; 2025; 41, pp. 248-260. [DOI: https://dx.doi.org/10.59953/paperasia.v41i2b.445]

45. Ranjan, N.; Gupta, A.; Sharma, L.K.; Kapoor, S. Adoption of AI Chatbots: Revolutionizing E-Grocery Shopping in Delhi-NCR. Proceedings of the 2024 International Conference on Artificial Intelligence and Emerging Technology, Global AI Summit 2024; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2024; pp. 1292-1297.

46. Fan, Y.; Liu, X. Exploring the Role of AI Algorithmic Agents: The Impact of Algorithmic Decision Autonomy on Consumer Purchase Decisions. Front. Psychol.; 2022; 13, 1009173. [DOI: https://dx.doi.org/10.3389/fpsyg.2022.1009173]

47. Luo, Y.; Kumar, N.; Yazdanmehr, A. AI Nudging and Decision Quality: Evidence from Randomized Experiments in Online Recommendation Setting. Decis. Support Syst.; 2026; 200, 114565. [DOI: https://dx.doi.org/10.1016/j.dss.2025.114565]

48. Gao, J.; Opute, A.P.; Jawad, C.; Zhan, M. The Influence of Artificial Intelligence Chatbot Problem Solving on Customers’ Continued Usage Intention in e-Commerce Platforms: An Expectation-Confirmation Model Approach. J. Bus. Res.; 2025; 200, 115661. [DOI: https://dx.doi.org/10.1016/j.jbusres.2025.115661]

49. Chang, T.S.; Hsiao, W.H. Understand Resist Use Online Customer Service Chatbot: An Integrated Innovation Resist Theory and Negative Emotion Perspective. Aslib J. Inf. Manag.; 2025; 77, pp. 962-989. [DOI: https://dx.doi.org/10.1108/AJIM-12-2023-0551]

50. Sundjaja, A.M.; Utomo, P.; Colline, F. The Determinant Factors of Continuance Use of Customer Service Chatbot in Indonesia E-Commerce: Extended Expectation Confirmation Theory. J. Sci. Technol. Policy Manag.; 2025; 16, pp. 182-203. [DOI: https://dx.doi.org/10.1108/JSTPM-04-2024-0137]

51. Cheng, X.; Bao, Y.; Zarifis, A.; Gong, W.; Mou, J. Exploring Consumers’ Response to Text-Based Chatbots in e-Commerce: The Moderating Role of Task Complexity and Chatbot Disclosure. Internet Res.; 2022; 32, pp. 496-517. [DOI: https://dx.doi.org/10.1108/INTR-08-2020-0460]

52. Gharib, M.; Trivedi, V.; Trivedi, A. Conversational AI in E-Commerce: Strategic Implications of Voice-Based Chatbots for Consumer Engagement and Trust. Int. Rev. Manag. Mark.; 2025; 15, pp. 1-9. [DOI: https://dx.doi.org/10.32479/irmm.19352]

53. Christopher, Y.; Sundjaja, A.M.; Mulvono,. The Role of AI Chatbots on E-Commerce Platforms: Understanding Its Influence on Customer Trust and Dependability. Proceedings of the 2024 9th International Conference on Information Technology and Digital Applications, ICITDA 2024; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2024.

54. Kim, S.; Priluck, R. Consumer Responses to Generative AI Chatbots Versus Search Engines for Product Evaluation. J. Theor. Appl. Electron. Commer. Res.; 2025; 20, 93. [DOI: https://dx.doi.org/10.3390/jtaer20020093]

55. Ovalle, C. Integrated Recommendation System in ChatGPT to Analyze Post-Purchase Behavior of E-Commerce Store Users. Proceedings of the LACCEI International Multi-Conference for Engineering, Education and Technology; Latin American and Caribbean Consortium of Engineering Institutions: Boca Raton, FL, USA, 2024.

56. Ooi, M.Y.; Lo, P.S.; Cheng-Xi Aw, E.; Dastane, O.; Tan, G.W.H. From Chat to Cart: How AI Boosts Online Impulse Buying. J. Comput. Inf. Syst.; 2025; [DOI: https://dx.doi.org/10.1080/08874417.2025.2576214]

57. Nikolov, A.N.; Iyer, P.; Rokonuzzaman, M.; Batra, G.; Eskridge, B.; Sen, S. The AI chatbot anthropomorphism dilemma. Mark. Intell. Plan.; 2025; pp. 1-24. [DOI: https://dx.doi.org/10.1108/MIP-09-2024-0623]

58. Foroughi, B.; Huy, T.Q.; Iranmanesh, M.; Ghobakhloo, M.; Rejeb, A.; Nikbin, D. Why Users Continue E-Commerce Chatbots? Insights from PLS-FsQCA-NCA Approach. Serv. Ind. J.; 2025; 45, pp. 935-965. [DOI: https://dx.doi.org/10.1080/02642069.2024.2371910]

59. Kamoonpuri, S.Z.; Sengar, A. Predicting Consumer Intentions for Shopping Using Generative AI Chatbots. J. Decis. Syst.; 2025; 34, 2521614. [DOI: https://dx.doi.org/10.1080/12460125.2025.2521614]

60. Enriquez, T.; Moreno, D.E.; Mendoza, H.; Claire Roxas, A.J.; Janel Rodriguez, K.; Tesoro, E.M. Do Consumers’ Perceptions of Artificial Intelligence Drive Chatbot Adoption in Shopping? The Impact of Perceived Information Quality on Intent to Use. Proceedings of the 2025 16th International Conference on E-Education, E-Business, E-Management and E-Learning, IC4e 2025; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2025; pp. 143-147.

61. Park, S.; Jung, H.; Ryskeldiev, B.; Lee, J.; Lee, G.; Jeong, H.; Chun, M. Toward Closed-Domain Conversational Item Listing Assistant for Improvement of Experiences of Older Adults in Customer-to-Customer (C2C) Marketplaces. Proceedings of the ASSETS 2024—Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility; Association for Computing Machinery, Inc.: New York, NY, USA, 2024.

62. Han, M.C. The Impact of Anthropomorphism on Consumers’ Purchase Decision in Chatbot Commerce. J. Internet Commer.; 2021; 20, pp. 46-65. [DOI: https://dx.doi.org/10.1080/15332861.2020.1863022]

63. Matosas-López, L. The Effect of Chatbots Humanness and Brand Associations on Consumer Purchase Intentions: An Experimental Approach. Electron. Commer. Res.; 2025; [DOI: https://dx.doi.org/10.1007/s10660-025-10064-6]

64. Li, Y.; Gan, Z.; Zheng, B. How Do Artificial Intelligence Chatbots Affect Customer Purchase? Uncovering the Dual Pathways of Anthropomorphism on Service Evaluation. Inf. Syst. Front.; 2025; 27, pp. 283-300. [DOI: https://dx.doi.org/10.1007/s10796-023-10438-x]

65. Kumar, N.; Garg, P.; Gupta, N.; Upreti, K. Examining Humanized Bot Interaction for the Retail Revenue Growth. J. Revenue Pricing Manag.; 2025; 25, pp. 25-40. [DOI: https://dx.doi.org/10.1057/s41272-025-00549-2]

66. Kim, J.; Li, Y.; Choi, J. Understanding the Continuance Intention to Use Chatbot Services. Asia Mark. J.; 2023; 25, pp. 99-110. [DOI: https://dx.doi.org/10.53728/2765-6500.1613]

67. Ionescu, I.-M.; Ardelean, A. The Romanian Consumer’s Perspective on the Integration of Artifi Cial Intelligence in E-Commerce. Rom. Stat. Rev.; 2024; Available online: https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://www.revistadestatistica.ro/2024/12/the-romanian-consumers-perspective-on-the-integration-of-artificial-intelligence-in-e-commerce/&ved=2ahUKEwigo7jzyviSAxXfKtAFHXflHOMQFnoECBwQAQ&usg=AOvVaw2T7upPRWaNo-21PkBVTiff (accessed on 23 February 2026).

68. Becan, C.; Çeber, B. How Technology Readiness Influences Behavioral and Purchasing Intention: Serial Multiple Mediating Role of Attitude toward AI and AI-Driven Consumer Chatbot Experience. Digit. Transform. Soc.; 2025; 5, pp. 78-104. [DOI: https://dx.doi.org/10.1108/DTS-04-2025-0082]

69. Jia, J.; Chen, L.; Zhang, L.; Xiao, M.; Wu, C. A Study on the Factors That Influence Consumers’ Continuance Intention to Use Artificial Intelligence Chatbots in a Pharmaceutical e-Commerce Context. Electron Libre; 2025; 43, pp. 303-321. [DOI: https://dx.doi.org/10.1108/EL-09-2024-0275]

70. Akdemir, D.M.; Bulut, Z.A. Business and Customer-Based Chatbot Activities: The Role of Customer Satisfaction in Online Purchase Intention and Intention to Reuse Chatbots. J. Theor. Appl. Electron. Commer. Res.; 2024; 19, pp. 2961-2979. [DOI: https://dx.doi.org/10.3390/jtaer19040142]

71. Commer, P.J.; Sci, S.; Yue, Y.; Ng, S.-I.; Kamal Basha, N. Consumption Values, Attitudes and Continuance Intention to Adopt ChatGPT-Driven E-Commerce AI Chatbot (LazzieChat). Pak. J. Commer. Soc. Sci.; 2024; 18, pp. 249-284.

72. Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E. . The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ; 2021; 372, n71. [DOI: https://dx.doi.org/10.1136/bmj.n71]

Word count: 24230

Show less

© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Agentic AI is increasingly framed as enabling consumers to delegate commerce decisions and actions to digital assistants, yet consumer-facing evidence still centers on assistive chatbots and recommender-like systems, with scarce evaluation of execution-level delegation. This study provides an evidence-mapping review of empirical work on agentic commerce and synthesizes determinants and outcomes of delegation across three questions: (RQ1) how systems are operationalized (autonomy, task scope, interaction mode, and transaction capability/evidence realism), (RQ2) what facilitates or inhibits delegation, and (RQ3) what downstream outcomes follow for marketing performance and consumer experience. We searched Scopus and Web of Science for English-language, peer-reviewed primary studies (2015–2026) and applied conservative coding rules that distinguish claimed capability from simulated or demonstrated execution. The mapped literature is concentrated in text-based, low-autonomy assistants focused on recommendation and post-purchase support; coverage drops sharply for workflow-level autonomy, cart building, checkout/payment execution, and negotiation. Across studies, findings cluster into two motifs: a utility/assurance pathway in which performance cues and interaction quality increase perceived usefulness, satisfaction, and trust, and a governance pathway in which autonomy cues and system-initiated control trigger reactance/powerlessness and reduce acceptance unless mitigated by safeguards; urgency can attenuate governance resistance. Because most outcomes are intention- or vignette-based, calibration, verification, and error-recovery behaviors remain under-measured. Overall, delegation appears to depend less on maximizing autonomy than on coupling capability with user governance (consent, oversight, recourse, accountability), and we outline measurement priorities for evaluating execution-capable agents.

Details

Title

From Recommendations to Delegation: A Systematic Review Mapping Agentic AI in E-Commerce and Its Consumer Effects

Author

Balaskas Stefanos

First page

222

Publication year

2026

Publication date