Content area
Iran’s Konkur exam (national university entrance test) assesses EFL proficiency solely through multiple-choice items, neglecting writing/speaking despite their academic importance. This study compares Intelligent Computer-Assisted Language Assessment (ICALA) and traditional assessments to address this gap. This 12-week mixed-methods study examined how ICALA affected motivation, anxiety, and proficiency in 120 intermediate Iranian EFL learners (CEFR B1–B2). The experimental group (n = 60) used ICALA via DeepSeek, while the control group (n = 60) received traditional instructor-led assessments with identical tasks (250-word essays, 2-min oral responses). Quantitative data from standardized measures (motivation, anxiety, and proficiency scales) and qualitative data from interviews and reflective journals were analyzed. ICALA demonstrated stronger benefits for motivation, anxiety reduction, and proficiency gains compared to traditional assessments, particularly among upper-intermediate (B2) learners. Qualitative analysis revealed three dominant themes: (1) enhanced competence through specific feedback, (2) reduced evaluation pressure, and (3) systematic skill improvement. While B2 learners thrived with ICALA’s detailed feedback (e.g., cohesion suggestions), some B1 learners required simplified guidance due to cognitive load. Although Konkur omits productive skills, ICALA improves writing and speaking proficiency, bridging the gap between exam preparation and academic needs. Simplified feedback for B1 learners, along with balanced speaking tasks, could further enhance outcomes. These findings inform EFL instruction reform in Konkur-driven contexts and contribute to Asia–Pacific and global AI assessment research.
Introduction
The integration of artificial intelligence (AI) into language assessment has opened transformative possibilities for English as a Foreign Language (EFL) education, particularly through Intelligent Computer-Assisted Language Assessment (ICALA). By offering automated scoring and real-time feedback, ICALA has the potential to reshape high-stakes testing environments, reducing anxiety and enhancing motivation (Xi, 2023). In the Asia–Pacific region, contemporary high-stakes testing systems continue prioritizing summative outcomes despite growing advocacy for communicative competence (Kirkpatrick, 2016; Nunan, 2003; Xi, 2023). However, its implementation varies due to differences in technological infrastructure, cultural attitudes toward AI, and assessment priorities across contexts.
In Iran, the high-stakes Konkur exam—a multiple-choice test covering vocabulary, grammar, and reading—shapes EFL instruction toward receptive skills, leaving writing and speaking unassessed (Parviz, 2024). Yet, academic success demands these neglected competencies. This contrasts with China's robust AI infrastructure (Li et al., 2024) and Saudi Arabia's growing focus on oral skills (Alkhateeb et al., 2025). These regional differences highlight how cultural and technological contexts shape assessment approaches. In Iran specifically, cultural reliance on instructor-led assessments reflects broader skepticism toward AI in education (Parviz, 2024), creating tension with Self-Determination Theory's (SDT; Ryan & Deci, 2000) assumption that autonomy-supportive environments universally enhance motivation. This tension frames our central investigation: whether ICALA's automated feedback can satisfy SDT's psychological needs (autonomy, competence, relatedness) despite cultural preferences for instructor-led evaluation. Notably, ICALA's prompt, neutral feedback may reduce students'fear of instructor scrutiny—a key concern in Konkur preparation—while addressing Iran's unique constraints of limited AI infrastructure and rural–urban disparities (Oskoui et al., 2024; Xi, 2023). Thus, this study offers a model for high-stakes exam systems that prioritizes both immediate feedback and construct validity.
However, despite ICALA's potential to enhance assessment efficiency (Alkhateeb et al., 2025), significant challenges remain. Beyond Iran's infrastructural limitations, concerns about construct validity and ethical issues like algorithmic bias (i.e., systematic errors in AI systems that disadvantage specific learner groups) persist (Voss et al., 2023; Xi, 2023). For instance, while AI tools struggle to assess writing coherence in Persian-influenced English (Mirzaeian, 2025), the psychological effects of ICALA on motivation and anxiety remain understudied in Iran's high-stakes context—a critical gap given anxiety's documented impact on achievement (Xin et al., 2025). Unlike neighboring countries'research on speaking anxiety (Abdulhussein Dakhil et al., 2025) or oral skills (Makhlouf, 2021), Iran lacks studies on ICALA's impact on university-level writing and speaking assessments (Parviz, 2024), making this investigation both timely and necessary.
While Konkur’s format excludes productive skills, writing/speaking practice remains critical for post-exam academic demands (e.g., university coursework). This study investigates whether ICALA can bridge this gap by enhancing skills beyond Konkur’s scope but essential for learners’ long-term success.
To address these challenges and gaps, this study employs a mixed-methods approach to investigate ICALA’s effects on motivation, anxiety, and speaking/writing proficiency among Iranian university EFL students. By focusing on Iran’s Konkur-driven context, it offers novel insights into tailoring AI-mediated assessment to high-stakes environments. The findings aim to inform regional assessment policies, such as potential Konkur reforms in Iran, and contribute to equitable AI-driven assessment practices in the Asia–Pacific and global contexts.
Literature review
AI in language assessment: global trends and applications
This section examines global advancements in artificial intelligence (AI) for English as a Foreign Language (EFL) assessment, highlighting benefits and ethical challenges. AI tools like DeepSeek and ChatGPT provide rapid feedback on writing/speaking, enhancing efficiency and motivation (Jiang, 2022). Studies on other AI writing assistants, such as Grammarly, have also demonstrated efficacy in improving specific linguistic features like grammatical accuracy among EFL learners (Ebadi et al., 2023). However, their validity for assessing authentic proficiency remains debated (Amin, 2023). While AI tools like ChatGPT show promise for teacher training (Far-hat & Ouchouid, 2025), their “black box” nature and bias risks (Oskoui et al., 2024; Parviz, 2024; Xi, 2023) complicate adoption in resource-constrained contexts like Iran. Ethical challenges in AI-mediated assessment, such as algorithmic bias, require rigorous oversight (Voss, 2024). These findings underscore AI’s transformative potential, contingent on ethical frameworks and teacher training. While AI tools like DeepSeek show global promise, their psychological impact on learners—particularly in high-stakes contexts—requires closer examination.
Psychological factors in AI-mediated assessment
Iran’s Konkur exam reinforces the ‘Ought-to L2 Self’ (Dörnyei, 2014), prioritizing extrinsic motivation (e.g., exam pressure) over intrinsic growth. ICALA’s feedback (Ryan & Deci, 2000) may instead cultivate an'Ideal L2 Self'through competence-supportive practice. In Iran, reliance on instructor authority may initially overshadow trust in automated feedback (Oskoui et al., 2024), creating tension with SDT’s autonomy principle. This duality suggests ICALA’s success hinges on balancing technological novelty with culturally familiar pedagogical scaffolds. Abdulhussein Dakhil et al. (2025) found that tools like ELSA Speech Analyzer boost self-efficacy via immediate feedback. Conversely, low-proficiency learners may face heightened anxiety with unreliable AI systems (Zaiarna et al., 2024), particularly in Iran’s context (Parviz, 2024). Test-taker engagement, critical for equitable assessment, is often underexplored, with studies noting limited focus on cognitive impacts of AI tools like chatbots (Jin & Fan, 2023). These insights highlight the need to balance AI’s feedback with psychological support in high-stakes contexts.
Language assessment challenges in the Asia–Pacific region
Writing-centric high-stakes exams dominate Asia–Pacific education systems, yet their AI integration strategies vary. China’s Gaokao uses AI for automated essay scoring but retains human raters for coherence (Huang, 2024), while Japan’s Center Test—like Iran’s Konkur—marginalizes speaking skills entirely (Kirkpatrick, 2016). South Korea’s CSAT, however, balances writing and speaking through AI-powered chatbots (Heathco, 2023). Iran’s unique constraints (e.g., sanctions, limited AI infrastructure) make it a critical case study for adapting ICALA in resource-constrained, exam-driven contexts. In Bangladesh, infrastructural barriers hinder AI adoption, a challenge echoed in Iran’s rural universities (Uddin et al., 2024). This receptive-skills focus in Konkur (Kirkpatrick, 2016) mismatches university-level language demands. Unlike Iraq or Saudi Arabia's AI-supported speaking assessments (Abdulhussein Dakhil et al., 2025; Alkhateeb et al., 2025), Iran's Konkur prioritizes writing. Iranian EFL teachers’ unreadiness for AI, driven by economic constraints and limited technological literacy, necessitates tailored training (Ghiasvand et al., 2024). Limited AI access, due to sanctions and internet filtering, exacerbates inequities, underscoring the need for infrastructure and training (Oskoui et al., 2024).
These challenges align with broader systemic barriers identified in EAP contexts. Mirsanjari’s (2025) study of Iranian EAP instructors revealed that high-stakes testing cultures (reported by 91.7% of participants), institutional resistance (83.3%), and resource constraints (80%) hinder effective assessment practices, even after teacher development courses (TDCs). While TDCs improved assessment literacy—particularly in formative feedback (mean difference = + 6.5, p <.001)—entrenched norms limited implementation, especially among veteran teachers (21 + years'experience). Notably, digital tools like Moodle (M = 3.8) and Google Classroom (M = 4.0) showed promise but faced low integration rates (23%) due to infrastructural gaps. These findings underscore the interplay between teacher readiness, institutional policies, and technological access in exam-driven systems like Iran’s Konkur—a gap ICALA must address to achieve scalable impact.
Research gap and current study
Despite global progress in AI-driven EFL assessment, and while the theoretical plausibility of AI integration in Iranian classrooms has been acknowledged (Teimourtash, 2024), few empirical studies examine its psychological impacts in Iran's Konkur-driven context, where motivation, anxiety, and learning outcomes are critical. While AI tools like writing assistants show promise in enhancing proficiency among Iranian high school students, their effects on university learners’ affective states remain underexplored (Delavari & Talebi, 2024). The debate over AI-assisted assessment tools highlights tensions between authenticity and validity, yet Iran-specific insights are scarce (Voss et al., 2023). This study investigates how Intelligent Computer-Assisted Language Assessment (ICALA) influences Iranian university EFL students'motivation, anxiety, and proficiency, yielding three actionable insights for: (1) pedagogical design (e.g., simplified feedback protocols for B1 learners), (2) technological adaptation (e.g., Persian-English parallel corpora development), and (3) assessment policy (e.g., Konkur rubric modifications to include writing/speaking criteria). These findings directly inform the context-specific integration of AI, while proposing concrete steps for exam reform.
While studies in Iraq (Abdulhussein Dakhil et al., 2025) and China (Jiang, 2022) demonstrate ICALA’s efficacy in reducing speaking anxiety and enhancing motivation, these findings may not generalize to Iran’s Konkur-driven context, where writing proficiency dominates high-stakes outcomes. Furthermore, Iran’s infrastructural barriers (e.g., sanctions, internet filtering) and teacher readiness gaps (Ghiasvand et al., 2024) necessitate tailored ICALA frameworks that address both pedagogical and systemic challenges. This study bridges this gap by examining ICALA’s psychological and proficiency impacts within Iran’s distinct ecosystem.
Research questions
How does ICALA affect Iranian university EFL students’ motivation in speaking and writing assessments?
To what extent does ICALA influence test anxiety among Iranian university EFL students during speaking and writing assessments?
What are the impacts of ICALA on Iranian university EFL students’ speaking and writing proficiency?
Methodology
Research design
This study employed an explanatory sequential mixed-methods design to examine the effects of Intelligent Computer-Assisted Language Assessment (ICALA) on 120 intermediate EFL learners’ (CEFR B1–B2) motivation, anxiety, and speaking/writing proficiency in Iran’s Konkur-driven context. The quantitative phase measured outcomes using validated instruments, followed by a qualitative phase with interviews and reflective journals to explore participant experiences. The experimental group (n = 60) used ICALA via DeepSeek, while the control group (n = 60) used traditional instructor-led assessments, ensuring ecological validity (Parviz, 2024). The study’s sequential phases are summarized in Fig. 1.
[See PDF for image]
Fig. 1
Explanatory sequential mixed-methods design of this study
Participants
The study involved 120 intermediate EFL learners (aged 18–22, CEFR B1–B2) from three public Tehran universities, selected via stratified random sampling with equal gender distribution (60 male, 60 female) and balanced representation across STEM (n = 64) and humanities (n = 56) disciplines. Participants reported varying socioeconomic backgrounds (32% from families with monthly incomes < 50 million IRR, 68% ≥ 50 million IRR) and prior AI tool exposure (41% regular users, 59% minimal experience). Participants were stratified by proficiency using the Oxford Placement Test (OPT) (Allan, 2004), a standardized assessment aligned with CEFR levels. The OPT’s 50-item grammar/vocabulary/reading battery (score range: 0–100) demonstrated strong reliability in our pilot (α = 0.88). We applied Oxford University Press’s (2023) benchmarks:
B1 (Intermediate): 41–60 points
B2 (Upper-Intermediate): 61–80 points
This stratification ensured balanced groups despite natural enrollment disparities (e.g., fewer B1 learners in Tehran universities; see Parviz, 2024). The B1 subsample was smaller (n = 45 vs. B2, n = 75) because fewer Tehran University students test at the intermediate (B1) level versus the upper-intermediate (B2) level in mandatory English courses—a documented enrollment pattern (Parviz, 2024). While the B2 subsample (n = 75) achieved adequate power (0.80), the B1 group (n = 45) fell below the recommended threshold (0.65–0.68), reflecting lower intermediate-level enrollment in Tehran universities. Post hoc power analysis (GPower 3.1) confirmed the B2 group (n = 75) achieved 0.80 power for medium effects (f = 0.25, α =.05), while the B1 group (n = 45) reached 0.68—acceptable for exploratory comparisons (Cohen, 2013). This imbalance necessitates caution in generalizing B1-specific findings. For qualitative data, 30 participants (15 per group) were purposively selected, including extreme and typical motivation/anxiety scores to capture diverse experiences. For qualitative subsampling, we defined'extreme'scores as ± 1.5 SDs from mean motivation (AMTB) and anxiety (FLCAS) scores, and'typical'as ± 0.5 SDs, using pre-test survey data. This yielded four strata: high/low motivation × high/low anxiety. From each stratum, we randomly selected 7–8 participants (total n = 30) to mitigate selection bias while ensuring diverse experience representation. All participants provided written informed consent prior to their involvement in the study. They were informed of the study’s purpose, procedures, and their right to withdraw at any time without consequences.
Ethics approval
This study was conducted in accordance with the Declaration of Helsinki and approved by the Internal Review Board (IRB) of Tehran University. All participants provided informed consent. All procedures involving human participants adhered to ethical standards, and written informed consent was obtained from all participants. The study was approved by the Internal Review Board (IRB) of Tehran University.
Instruments
Five instruments, validated through pilot testing (n = 20, excluded from the main study), ensured methodological rigor, with Persian adaptations reviewed by bilingual linguists (α ≥ 0.79, κ ≥ 0.85):
Intrinsic Motivation Inventory (IMI): An 18-item Likert-scale measure (Ryan & Deci, 2000) assessed intrinsic motivation (interest, competence, value; α = 0.79).
Language Learning Motivation Questionnaire (LLMQ): Adapted from Alrabai (2014), 15 items measured extrinsic motivation (e.g., exam preparation; α = 0.84), validated by EFL experts.
Foreign Language Classroom Anxiety Scale (FLCAS): Ten modified items from Horwitz et al. (1986) targeted AI-mediated assessment anxiety (α = 0.88).
Second Language Writing Anxiety Inventory (SLWAI): Cheng’s (2004) 22-item scale assessed writing-specific anxiety (α = 0.85), tailored to ICALA.
Proficiency Tests: Speaking (2-min responses) and writing (250-word essays), based on Konkur-style prompts, were scored using IELTS/TOEFL rubrics by two raters (κ = 0.85). DeepSeek’s Performance Score supplemented writing scores (r = 0.79 with human ratings).
Qualitative Tools: Semi-structured interviews (30 participants, 60–90 min) followed Jin and Fan’s (2023) framework. Reflective journals (240 entries) used prompts (e.g., “How did AI feedback affect your confidence?”). All qualitative data underwent thematic analysis (Braun & Clarke, 2006) using NVivo 14, with a hybrid inductive-deductive coding approach. Two researchers independently coded 20% of the data, achieving strong inter-coder reliability (κ =.82, 95% CI [.76,.88]) before resolving discrepancies through discussion. Codes were organized into themes through iterative refinement, with member checking (10 participants) and peer debriefing (3 EFL experts) ensuring trustworthiness. Data saturation was confirmed after the 24th interview.
Writing/speaking tasks were scored using IELTS rubrics, as Konkur lacks a productive skill assessment. This aligns with university proficiency benchmarks. A pre-intervention AI literacy survey (10 items, α = 0.80) assessed participants’ AI familiarity, addressing Iran’s low AI literacy (Oskoui et al., 2024).
Procedure
The 12-week intervention, conducted in spring 2025 at Tehran universities, aligned with prior EFL/AI studies (Delavari & Talebi, 2024). Though Konkur excludes writing/speaking, study tasks (essays, oral responses) were designed to:
Reinforce grammar/vocabulary tested in Konkur (e.g., tense accuracy in essays mirrors MC grammar items),
Prepare learners for post-Konkur academic needs.
Participants completed pre-tests (OPT, motivation/anxiety surveys, proficiency tests) in dedicated computer labs, using institutional devices with standardized technical setups (Windows 10, 8GB RAM). Technical support was provided by trained research assistants during all sessions to ensure consistent platform access (DeepSeek web interface) and to troubleshoot connectivity issues. The experimental group received six bi-weekly ICALA tasks (three writing: argumentative/descriptive/narrative essays; three speaking: opinion summaries/picture descriptions/role-plays), progressively increasing in complexity (e.g., Essay 1: 200 words on familiar topics; Essay 3: 250 words with academic citations). Each automated score from DeepSeek underwent bi-weekly manual adjustment by two trained raters using a Persian-English error typology checklist (e.g., overriding false positives in preposition usage), with final scores reflecting 70% AI and 30% human weighting.
To mitigate the Hawthorne effect, the study’s focus on anxiety/motivation was concealed until debriefing, and task completion times were standardized (30 min for essays, 2 min for oral responses). The experimental group received bi-weekly ICALA tasks with immediate feedback, while the control group’s instructor feedback was delayed by 48–72 h to mirror typical classroom practices. This design controlled for novelty bias while maintaining ecological validity.
Deep AI functionality and feedback delivery
DeepSeek is an AI-driven language assessment platform that provides automated scoring and real-time feedback on EFL learners’ writing and speaking tasks. Feedback was delivered immediately after each task submission, with learners able to request clarification through a dedicated query interface (e.g., clicking"Explain this correction"for annotated errors). For this study, DeepSeek was configured to deliver granular feedback on grammar (e.g., tense errors), coherence (e.g., transitional phrases), and lexical variety (e.g., word-choice suggestions), aligned with IELTS/TOEFL rubrics. Its feedback mechanism includes:
Performance Scores (1–9 scale, validated against human raters; r = 0.79 in pilot testing),
Text-based suggestions (3–5 per task) targeting specific errors (e.g.,'Use therefore for cohesion'), and
Pronunciation/fluency analytics for speaking tasks (e.g.,'Improve stress on multi-syllabic words').
To address Persian-influenced errors (e.g., preposition misuse), a linguist-developed checklist was applied bi-weekly to adjust scores manually, ensuring cultural fairness (e.g., overriding L1 transfer patterns unrelated to proficiency). While effective for discrete errors, DeepSeek has recognized limitations: (1) difficulty detecting nuanced coherence issues in Persian-influenced English writing (e.g., paragraph-level logical flow), and (2) reduced accuracy in assessing Persian-accented English phonemes (e.g.,/æ/vs./ɑ/distinctions). These were mitigated through weekly human rater oversight.
Data analysis
Quantitative data were analyzed using SPSS 28:
Proficiency: Paired t-tests compared pre- and post-test speaking/writing scores; independent t-tests examined group differences (Cohen’s d). Missing task data (3.7% of cases) were addressed using pairwise deletion after confirming missingness was completely at random (Little’s MCAR test: χ2 = 5.21, p =.27).
Motivation and Anxiety: Multiple regression analyzed predictors including feedback specificity (operationalized as number of error-specific suggestions per task), task repetition, proficiency level, and AI literacy on IMI, LLMQ, FLCAS, and SLWAI scores. Three-step hierarchical regression was employed, entering control variables (age, gender) in Block 1, followed by proficiency level in Block 2, and experimental predictors in Block 3. ANOVA assessed Group (ICALA vs. Control) × Proficiency (B1 vs. B2) effects.
AI Literacy: Pearson’s r correlated baseline AI familiarity with outcomes.
Qualitative data underwent thematic analysis (Braun & Clarke, 2006) using NVivo 14. Two researchers independently coded 20% of transcripts, achieving inter-coder reliability (κ = 0.82) before resolving discrepancies through consensus discussions with a third arbitrator. Codes were refined iteratively through constant comparison, with final themes validated via member checking (10 participants) and peer debriefing (3 EFL experts).
Mixed methods integration was achieved through: (1) joint displays comparing effect sizes with theme prevalence, (2) triangulation of regression predictors with qualitative patterns, and (3) explanatory follow-up where quantitative and qualitative results diverged (e.g., B1 learners’ perceived vs. measured gains).
Validity and reliability
Pilot testing ensured cultural appropriateness, with Cronbach’s α (≥ 0.79) and inter-rater reliability (κ ≥ 0.85). Qualitative credibility was enhanced through member checking, peer debriefing, and thick description. Ethical safeguards included IRB approval, voluntary participation, data anonymization, and protocols to prevent AI over-reliance. DeepSeek’s coherence detection limitations (Mirzaeian, 2025) were mitigated through rater triangulation.
Limitations
Three limitations underscore the need for cautious implementation. First, the Hawthorne effect may inflate motivation gains, though task standardization mitigated this. Second, AI-based feedback’s bias toward Western English norms could disadvantage learners with Persian-influenced dialects, a challenge also observed in China’s Gaokao AI raters (Huang, 2024). Third, Iran’s internet filtering and sanctions necessitate locally hosted AI, which limits access to cloud-based advancements (e.g., real-time accent adaptation).
Results
Research Question 1: How does ICALA influence Iranian university EFL students’ intrinsic and extrinsic motivation?
Quantitative findings
Pre- and post-intervention data were collected from 120 intermediate EFL learners (CEFR B1–B2) using the Intrinsic Motivation Inventory (IMI; Ryan & Deci, 2000) and Language Learning Motivation Questionnaire (LLMQ; Alrabai, 2014) to evaluate Intelligent Computer-Assisted Language Assessment (ICALA) effects on motivation. The experimental group (n = 60) used ICALA via DeepSeek, while the control group (n = 60) used traditional instructor-led assessments with identical tasks (250-word essays, 2-min oral responses). Paired t-tests compared within-group changes, independent t-tests assessed group differences, and a 2 (Group: Experimental, Control) × 2 (Proficiency: B1, B2) ANOVA examined interactions. Effect sizes were interpreted using Cohen’s (2013) guidelines, where d = 0.2 indicates a small effect, 0.5 medium, and 0.8 large. Effect sizes (Cohen’s d, η2) and 95% confidence intervals ensured precision. The B1 subsample (n = 45, power = 0.65–0.68) suggests cautious interpretation of proficiency-level effects; future studies with larger B1 samples (n ≥ 60) could enhance statistical power.
Motivation scores by group and scale
Table 1 shows pre- and post-test scores for IMI Interest/Enjoyment, Perceived Competence, and LLMQ Instrumental Goals. The experimental group significantly improved in Interest/Enjoyment (t (59) = 6.12, p <.001, d = 0.79), Perceived Competence (t (59) = 7.45, p <.001, d = 0.96), and Instrumental Goals (t (59) = 4.02, p <.001, d = 0.52). The control group showed smaller gains in Interest/Enjoyment (t (59) = 2.75, p =.008, d = 0.35), non-significant changes in Perceived Competence (t (59) = 1.60, p =.115, d = 0.20), and marginal gains in Instrumental Goals (t (59) = 1.90, p =.062, d = 0.24). Post-test differences were significant (Interest/Enjoyment: t (118) = 4.90, p <.001, d = 0.90; Perceived Competence: t (118) = 5.70, p <.001, d = 1.04; Instrumental Goals: t (118) = 3.30, p =.001, d = 0.60).
Table 1. Pre- and post-test motivation scores by group and scale (N = 120)
Group | Scale | Pre-Test M (SD) | Post-Test M (SD) | t (59) | p | Cohen’s d [95% CI] |
|---|---|---|---|---|---|---|
Experimental | ||||||
IMI: Interest/Enjoyment | 3.10 (0.87) | 3.89 (0.74) | 6.12 | <.001 | 0.79 [0.62, 0.97] | |
IMI: Perceived Competence | 2.93 (0.90) | 3.85 (0.80) | 7.45 | <.001 | 0.96 [0.78, 1.15] | |
LLMQ: Instrumental Goals | 4.00 (0.84) | 4.42 (0.77) | 4.02 | <.001 | 0.52 [0.35, 0.69] | |
Control | ||||||
IMI: Interest/Enjoyment | 3.14 (0.91) | 3.42 (0.82) | 2.75 | .008 | 0.35 [0.19, 0.52] | |
IMI: Perceived Competence | 2.97 (0.92) | 3.16 (0.86) | 1.60 | .115 | 0.20 [0.04, 0.37] | |
LLMQ: Instrumental Goals | 4.04 (0.86) | 4.22 (0.83) | 1.90 | .062 | 0.24 [0.08, 0.41] | |
Scores range from 1 (low) to 5 (high). All tests met normality (Shapiro–Wilk p >.05) and homogeneity (Levene’s p >.05) assumptions
IMI Intrinsic Motivation Inventory, LLMQ Language Learning Motivation Questionnaire
Proficiency-level effects
A 2 × 2 ANOVA revealed significant group × proficiency interactions for intrinsic motivation (F (1,116) = 7.95, p =.006, η2 =.06) and extrinsic motivation (F (1,116) = 5.50, p =.021, η2 =.05). B2 learners (n = 75, experimental: n = 38, control: n = 37) outperformed B1 learners (n = 45, experimental: n = 22, control: n = 23), particularly in the experimental group (Table 2). Figure 2 below indicates the learners'intrinsic and extrinsic motivation gains by group and proficiency level.
Table 2. Motivation gains by CEFR proficiency level (Experimental Group, n = 60)
Level | n | Intrinsic ΔM (SE) | Extrinsic ΔM (SE) | η2 [95% CI] |
|---|---|---|---|---|
B1 | 22 | 0.58 (0.11) | 0.27 (0.08) | .16 [.06,.27] |
B2 | 38 | 0.92 (0.09) | 0.49 (0.06) | .34 [.22,.46] |
ΔM Mean difference (post-test – pre-test), η2 Proportion of variance explained
[See PDF for image]
Fig. 2
Intrinsic and extrinsic motivation gains by group and proficiency level
Regression analysis
Feedback specificity (β =.34, p =.001), AI literacy (β =.28, p =.012), and proficiency level (β =.25, p =.017) predicted intrinsic motivation; task repetition (β =.30, p =.003) and proficiency level (β =.22, p =.029) predicted extrinsic motivation (R2 =.62,.49; see Supplementary Table S1).
Qualitative findings
Thematic analysis of 30 interviews (mean = 63 min, SD = 12.4; 15 per group) and 360 journal entries (240 experimental, 120 control) identified three themes (κ =.82, 95% CI [.76,.88]; α =.89, 95% CI [.85,.93]). Saturation occurred by the 25th interview. Themes reflect Konkur pressures and AI skepticism (Ghiasvand et al., 2024), with DeepSeek’s coherence feedback limitations addressed through weekly instructor-AI calibration sessions where: (1) human raters reviewed 20% of AI-scored essays to identify systematic coherence assessment gaps, (2) linguists developed supplemental prompts to guide students on Persian-to-English rhetorical transitions, and (3) borderline cases (scores within ± 0.5 of band thresholds) received dual human-AI evaluation.
These results align with Jiang (2022), where AI feedback enhanced motivation in Chinese EFL contexts, but highlight Konkur’s unique writing focus.
1. Competence Development (182/240 experimental journals, 75.8%; 51/60 experimental participants, 85%; 72/120 control journals, 60%; 32/60 control participants, 53.3%)
DeepSeek’s specific feedback, such as “Correct ‘I go to school yesterday’ to ‘I went to school yesterday’,” enhanced writing competence, while control group instructor feedback, e.g., “Improve past tense usage,” was less detailed
Exemplar Quote (Experimental): “My coherence score improved from 4.1 to 5.0, showing clear progress.” (Participant 39, Journal, Week 10)
Exemplar Quote (Control): “Teacher feedback helped, but it was less specific than I needed.” (Participant 78, Journal, Week 9)
2. Exam-Focused Motivation (159/240 experimental journals, 66.3%; 47/60 experimental participants, 78%; 65/120 control journals, 54.2%; 30/60 control participants, 50%) ICALA’s Konkur-aligned feedback, such as “Use more cohesive devices like ‘therefore’,” boosted exam preparation; control feedback, e.g., “Write clearer arguments,” was less targeted.
Exemplar Quote (Experimental): “AI grammar corrections targeted Konkur errors, helping me prepare.” (Participant 15, Interview)
Exemplar Quote (Control): “Instructor feedback focused on exams but felt repetitive.” (Participant 82, Interview)
3. Proficiency-Dependent Reactions (28/30 interviews, 93.3%; 55/60 experimental participants, 91.7%; 50/60 control participants, 83.3%). B1 learners (experimental: 13/22, 59.1%; control: 14/23, 60.9%) reported frustration with feedback complexity or vagueness; B2 learners (experimental: 6/38, 15.8%; control: 8/37, 21.6%) found feedback motivating.
Exemplar Quote (Experimental, B1): “Too many corrections overwhelmed me.” (Participant 08, Journal, Week 6)
Exemplar Quote (Experimental, B2): “Detailed feedback refined my fluency.” (Participant 27, Interview) Exemplar Quote (Control, B1): “Teacher comments were too vague, slowing my progress.” (Participant 66, Journal, Week 7)
B1 learners’ frustration with DeepSeek’s detailed feedback, such as multiple grammar and coherence suggestions, may reflect higher cognitive load, particularly for lower proficiency levels (Sweller, 2011). B2 learners, with stronger linguistic foundations, leveraged specific feedback to enhance fluency and exam preparation, suggesting a need for tailored simplification in AI feedback for B1 learners to reduce cognitive overwhelm.
AI-based feedback’s immediate, specific feedback drove larger motivation gains (d = 0.52–0.96) compared to the control group’s less detailed instructor feedback (d = 0.20–0.35), explaining quantitative differences in Perceived Competence and Instrumental Goals. Supplementary Codebook S1 includes coding details.
Research Question 2: How does ICALA influence Iranian university EFL students’ anxiety in speaking and writing assessments?
Quantitative findings
Pre- and post-intervention data were collected from 120 intermediate EFL learners (CEFR B1–B2) using the Foreign Language Classroom Anxiety Scale (FLCAS; Horwitz et al., 1986) and Second Language Writing Anxiety Inventory (SLWAI; Cheng, 2004) to assess Intelligent Computer-Assisted Language Assessment (ICALA) effects on anxiety. The experimental group (n = 60) used ICALA via DeepSeek, while the control group (n = 60) used traditional instructor-led assessments with identical tasks (250-word essays, 2-min oral responses). Paired t-tests evaluated within-group changes, independent t-tests compared group differences, and a 2 (Group: Experimental, Control) × 2 (Proficiency: B1, B2) ANOVA examined interactions. Effect sizes (Cohen’s d, η2) and 95% confidence intervals ensured precision. The B1 subsample (n = 45, power = 0.65–0.68) suggests cautious interpretation of proficiency-level effects; future studies with larger B1 samples (n ≥ 60) could enhance statistical power.
Anxiety scores by group and scale
Table 3 presents pre- and post-test scores for FLCAS Overall, FLCAS Speaking, and SLWAI Writing. The experimental group significantly reduced anxiety in FLCAS Overall (t (59) = −4.15, p <.001, d = −0.54), FLCAS Speaking (t (59) = −3.48, p =.001, d = −0.45), and SLWAI Writing (t (59) = −5.36, p <.001, d = −0.70). The control group showed non-significant reductions in FLCAS Overall (t (59) = −1.60, p =.115, d = −0.20), FLCAS Speaking (t (59) = −1.40, p =.167, d = −0.18), and SLWAI Writing (t (59) = −1.80, p =.077, d = −0.23). Post-test differences were significant (FLCAS Overall: t (118) = −4.05, p <.001, d = −0.74; FLCAS Speaking: t (118) = −3.25, p =.002, d = −0.60; SLWAI Writing: t (118) = −4.80, p <.001, d = −0.87).
Table 3. Pre- and post-test anxiety scores by group and scale (N = 120)
Group | Scale | Pre-Test M (SD) | Post-Test M (SD) | t (59) | p | Cohen’s d [95% CI] |
|---|---|---|---|---|---|---|
Experimental | ||||||
FLCAS Overall | 3.45 (0.78) | 3.02 (0.82) | −4.15 | <.001 | −0.54 [−0.67, −0.24] | |
FLCAS Speaking | 3.52 (0.81) | 3.18 (0.77) | −3.48 | .001 | −0.45 [−0.57, −0.20] | |
SLWAI Writing | 3.68 (0.85) | 3.12 (0.79) | −5.36 | <.001 | −0.70 [−0.84, −0.40] | |
Control | ||||||
FLCAS Overall | 3.47 (0.80) | 3.29 (0.85) | −1.60 | .115 | −0.20 [−0.36, −0.04] | |
FLCAS Speaking | 3.54 (0.83) | 3.38 (0.81) | −1.40 | .167 | −0.18 [−0.34, −0.02] | |
SLWAI Writing | 3.70 (0.87) | 3.49 (0.83) | −1.80 | .077 | −0.23 [−0.39, −0.07] | |
Scores range from 1 (low) to 5 (high). All tests met normality (Shapiro–Wilk p >.05) and homogeneity (Levene’s p >.05) assumptions
FLCAS Foreign Language Classroom Anxiety Scale, SLWAI Second Language Writing Anxiety Inventory
Proficiency-level effects
As Table 4 indicates, A 2 × 2 ANOVA showed significant group × proficiency interactions for FLCAS Overall (F (1,116) = 6.82, p =.010, η2 =.06) and SLWAI Writing (F (1,116) = 8.14, p =.005, η2 =.07), but not FLCAS Speaking (F (1,116) = 2.95, p =.089, η2 =.03). The non-significant interaction for speaking anxiety may reflect Konkur’s writing-centric focus, limiting speaking practice opportunities (Oskoui et al., 2024). B2 learners (n = 75, experimental: n = 38, control: n = 37) exhibited greater anxiety reduction than B1 learners (n = 45, experimental: n = 22, control: n = 23), particularly in writing (η2 =.12 vs. B2’s.28 for SLWAI). Figure 3 also depicts the result.
Table 4. Anxiety reduction by group
Level | n | FLCAS Overall ΔM (SE) | SLWAI Writing ΔM (SE) | η2 [95% CI] |
|---|---|---|---|---|
B1 | 45 | −0.32 (0.09) | −0.41 (0.10) | .12 [.04,.21] |
B2 | 75 | −0.48 (0.06) | −0.66 (0.07) | .28 [.17,.40] |
ΔM Mean difference (post-test – pre-test), η2 Proportion of variance explained
[See PDF for image]
Fig. 3
Anxiety reduction by group and proficiency level
Research Question 3: How does ICALA influence Iranian university EFL students’ proficiency in speaking and writing assessments?
Quantitative findings
Pre- and post-intervention data were collected from 120 intermediate EFL learners (CEFR B1–B2) using standardized IELTS Speaking and Writing rubrics (British Council, 2023) to evaluate Intelligent Computer-Assisted Language Assessment (ICALA) effects on proficiency. IELTS rubrics were applied by trained raters (κ = .85). The experimental group (n = 60) used ICALA via DeepSeek, while the control group (n = 60) used traditional instructor-led assessments with identical tasks (250-word essays, 2-min oral responses). Paired t-tests assessed within-group changes, independent t-tests compared group differences, and a 2 (Group: Experimental, Control) × 2 (Proficiency: B1, B2) ANOVA examined interactions. Effect sizes (Cohen’s d, η2) and 95% confidence intervals ensured precision. The B1 subsample (n = 45, power = 0.65–0.68) suggests cautious interpretation of proficiency-level effects; future studies with larger B1 samples (n ≥ 60) could enhance statistical power.
Proficiency scores by group and skill
Table 5 presents pre- and post-test scores for IELTS Speaking and Writing. The experimental group significantly improved in Speaking (t (59) = 5.88, p <.001, d = 0.76) and Writing (t (59) = 6.62, p <.001, d = 0.85). The control group showed smaller gains in Speaking (t (59) = 2.45, p =.017, d = 0.31) and Writing (t (59) = 3.10, p =.003, d = 0.40). Post-test differences were significant (Speaking: t (118) = 4.75, p <.001, d = 0.87; Writing: t (118) = 5.20, p <.001, d = 0.95).
Table 5. Pre- and post-test proficiency scores by group and skill (N = 120)
Group | Skill | Pre-Test M (SD) | Post-Test M (SD) | t (59) | p | Cohen’s d [95% CI] |
|---|---|---|---|---|---|---|
Experimental | ||||||
IELTS Speaking | 5.10 (0.65) | 5.65 (0.62) | 5.88 | <.001 | 0.76 [0.59, 0.94] | |
IELTS Writing | 5.05 (0.70) | 5.70 (0.66) | 6.62 | <.001 | 0.85 [0.67, 1.03] | |
Control | ||||||
IELTS Speaking | 5.12 (0.67) | 5.32 (0.64) | 2.45 | .017 | 0.31 [0.15, 0.47] | |
IELTS Writing | 5.08 (0.72) | 5.38 (0.69) | 3.10 | .003 | 0.40 [0.23, 0.57] | |
IELTS scores range from 1 (low) to 9 (high). All tests met normality (Shapiro–Wilk p >.05) and homogeneity (Levene’s p >.05) assumptions
Proficiency-level effects
As indicated in Table 6, A 2 × 2 ANOVA revealed significant group × proficiency interactions for IELTS Writing (F (1,116) = 9.35, p =.003, η2 =.08) but not Speaking (F (1,116) = 3.20, p =.077, η2 =.03). The non-significant interaction for speaking may reflect Konkur’s writing-centric focus and fewer ICALA speaking tasks relative to writing, constrained by curriculum design (Delavari & Talebi, 2024; Oskoui et al., 2024). B2 learners (n = 75, experimental: n = 38, control: n = 37) showed greater writing gains than B1 learners (n = 45, experimental: n = 22, control: n = 23), particularly in the experimental group (η2 =.15 vs. B2’s.32) (See Fig. 4).
Table 6. Proficiency gains by CEFR proficiency level (Experimental Group, n = 60)
Level | n | IELTS Speaking ΔM (SE) | IELTS Writing ΔM (SE) | η2 [95% CI] |
|---|---|---|---|---|
B1 | 22 | 0.42 (0.08) | 0.50 (0.09) | .15 [.06,.25] |
B2 | 38 | 0.62 (0.06) | 0.78 (0.07) | .32 [.20,.44] |
ΔM Mean difference (post-test – pre-test), η2 Proportion of variance explained
[See PDF for image]
Fig. 4
Proficiency gains by group and proficiency level
Regression analysis
Feedback specificity (β =.35, p <.001), task repetition (β =.28, p =.010), and proficiency level (β =.25, p =.017) predicted IELTS Speaking gains; feedback specificity (β =.40, p <.001), AI literacy (β =.30, p =.005), and proficiency level (β =.27, p =.009) predicted IELTS Writing gains (R2 =.65,.70; see Supplementary Table S3).
Qualitative findings
Thematic analysis of 30 interviews (mean = 63 min, SD = 12.4; 15 per group) and 360 journal entries (240 experimental, 120 control) identified three themes (κ =.82, 95% CI [.76,.88]; α =.89, 95% CI [.85,.93]). Saturation occurred by the 25th interview. Themes reflect Konkur’s high-stakes demands and initial AI skepticism (Ghiasvand et al., 2024), with DeepSeek’s coherence feedback limitations mitigated by instructor oversight (Mirzaeian, 2025). Proficiency gains complemented RQ1’s motivation increases and RQ2’s anxiety reductions, as ICALA’s autonomy-supportive feedback enhanced engagement, reduced evaluation fears, and mediated skill development (Ryan & Deci, 2000). These findings align with Jiang (2022), where AI feedback improved proficiency in Chinese EFL contexts, but highlight Konkur’s unique writing focus.
1. Feedback-Driven Skill Improvement (190/240 experimental journals, 79.2%; 52/60 experimental participants, 86.7%; 75/120 control journals, 62.5%; 34/60 control participants, 56.7%).
DeepSeek’s specific feedback, such as “Replace ‘because’ with ‘since’ for formal tone,” improved writing and speaking skills; instructor feedback, e.g., “Use more formal language,” was less actionable. The control group’s lower prevalence of skill improvement may reflect instructor feedback’s delayed delivery, reducing its impact on skill refinement (Richards & Rodgers, 2014).
Exemplar Quote (Experimental): “AI’s grammar corrections improved my essay from 5.0 to 5.5.” (Participant 45, Journal, Week 10)
Exemplar Quote (Control): “Teacher feedback helped, but it didn’t focus on my weak areas.” (Participant 88, Journal, Week 9)
2. Exam-Aligned Proficiency Gains (165/240 experimental journals, 68.8%; 48/60 experimental participants, 80%; 70/120 control journals, 58.3%; 32/60 control participants, 53.3%)
ICALA’s Konkurs-aligned feedback, such as “Use cohesive devices like ‘therefore’,” boosted writing proficiency; instructor feedback, e.g., “Write clearer arguments,” was less exam-specific.”
Exemplar Quote (Experimental): “AI’s coherence tips matched Konkurs’s scoring, helping my writing.” (Participant 17, Interview)
Exemplar Quote (Control): “Instructor feedback was broad, not Konkurs-focused.” (Participant 81, Interview)
3. Proficiency-Dependent Progress (28/30 interviews, 93.3%; 54/60 experimental participants, 90%; 49/60 control participants, 81.7%)
B1 learners struggled with complex feedback (experimental: 63.6%; control: 65.2%), likely due to cognitive overload (Sweller, 2011). In contrast, most B2 learners progressed well (experimental: 81.6%; control: 75.7%). B1 learners’ challenges with DeepSeek’s detailed feedback, such as multiple grammar and coherence suggestions, may reflect high cognitive load; simplifying AI feedback, such as prioritizing one correction per task, could enhance proficiency gains for lower proficiency levels (Sweller, 2011). B2 learners, with stronger linguistic foundations, leveraged specific feedback to improve fluency and exam performance.
Exemplar Quote (Experimental, B1): “AI’s detailed feedback was hard to follow at first.” (Participant 09, Journal, Week 6)
Exemplar Quote: (Participant, B2): “AI’s suggestions sharpened my speaking fluency.” (Participant 29, Interview)
Exemplar Quote: (Control, B1): “Teacher’s vague feedback slowed my writing improvement.” (Participant 69, Journal, Week 7
DeepSeek’s immediate, Konkurs-aligned feedback drove larger proficiency gains (d = 0.76–0.85) than the control group’s less granular instructor feedback (d = 0.31–0.40), aligning with CLT’s emphasis on meaningful communication (Richards & Rodgers, 2014). Supplementary Codebook S3 includes coding details.
Discussion
This study investigated the impact of Intelligent Computer-Assisted Language Assessment (ICALA) via DeepSeek on Iranian university EFL students’ motivation, anxiety, and proficiency in a Konkur-driven context, comparing an experimental group (n = 60) using ICALA with a control group (n = 60) receiving traditional instructor-led assessments. While Konkur assesses only receptive skills, ICALA’s writing/speaking feedback improved competencies transferable to exam performance (e.g., grammar accuracy) and academic readiness. The findings across three research questions (RQs) demonstrate ICALA’s superiority in enhancing motivation (RQ1), reducing anxiety (RQ2), and improving proficiency (RQ3), particularly for B2 learners, with implications for Konkur reform and AI-driven EFL assessment in Asia–Pacific and global contexts (Jiang, 2022; Chapelle & Voss, 2021).
RQ1: Motivation
ICALA significantly increased intrinsic and extrinsic motivation (d = 0.52–0.96) compared to the control group’s smaller gains (d = 0.20–0.35), aligning with Self-Determination Theory (SDT; Ryan & Deci, 2000). DeepSeek’s immediate, specific feedback (e.g., “Correct ‘I go to school yesterday’ to ‘I went to school yesterday’”) fostered autonomy and competence, as evidenced by qualitative themes of Competence Development (75.8% of experimental journals) and Exam-Focused Motivation (66.3%). In contrast, the control group’s less granular instructor feedback (e.g., “Improve past tense usage”), often delivered with delays, reduced learners’ sense of agency, limiting motivation (Horwitz et al., 1986). B2 learners benefited most (η2 =.34), while B1 learners’ smaller gains (η2 =.16) suggest feedback complexity may increase cognitive load (Sweller, 2011), consistent with Jiang’s (2022) findings on AI feedback in Chinese EFL contexts.
Consistent with Ryan and Deci’s (2000) theory, DeepSeek’s feedback fostered autonomy and competence. However, in Iran’s cultural context, this impact was moderated by learners’ skepticism toward technology (Oskoui et al., 2024). While B2 learners embraced AI’s specificity (e.g., grammar corrections), B1 learners’ initial resistance (59.1% frustration) reflected broader tensions with instructor authority. This suggests ICALA’s autonomy-supportive potential requires scaffolding feedback to align with cultural expectations—a nuance SDT must address in high-stakes, instructor-centric environments.
RQ2: Anxiety
ICALA significantly reduced anxiety in speaking and writing assessments (d = −0.45 to −0.87) compared to the control group’s non-significant reductions (d = −0.18 to −0.23). DeepSeek’s non-judgmental feedback (e.g., “Your pronunciation score improved from 4.2 to 4.8”) lowered evaluation pressure (72.9% of experimental journals) and boosted confidence (67.5%), supporting SDT’s autonomy-supportive environments (Ryan & Deci, 2000). The control group’s instructor feedback (e.g., “Speak more clearly”), perceived as subjective, heightened scrutiny fears, explaining the lower prevalence of Reduced Evaluation Pressure (42.5%) (Horwitz et al., 1986). B2 learners showed greater anxiety reduction (η2 =.28), while B1 learners’ persistent anxiety (η2 =.12) may reflect cognitive overload from complex feedback (Sweller, 2011), underscoring the need for simplified AI feedback.
RQ3: Proficiency
ICALA's proficiency gains (d = 0.76–0.85) surpassed the control group’s (d = 0.31–0.40). These results support Richards and Rodgers’ (2014) CLT framework, which prioritizes meaningful communication. DeepSeek’s Konkur-aligned feedback (e.g., “Use cohesive devices like ‘therefore’”) enhanced writing skills (79.2% of experimental journals), particularly for B2 learners (η2 =.32), while the control group’s less exam-specific feedback (e.g., “Write clearer arguments”) limited progress (58.3%). The non-significant speaking interaction (η2 =.03) reflects Konkur’s writing-centric curriculum and fewer speaking tasks, with ICALA’s speaking feedback limited by fewer task iterations requiring adaptive algorithms for pronunciation and fluency (Delavari & Talebi, 2024; Liu & Wang, 2024; Oskoui et al., 2024). B1 learners struggled with detailed feedback, suggesting cognitive overload (Sweller, 2011). Simplifying feedback—perhaps one grammar and one coherence suggestion per task—might help these learners more.
Interplay of motivation, anxiety, and proficiency
The findings reveal a synergistic interplay among motivation, anxiety, and proficiency, mediated by ICALA’s autonomy-supportive feedback (Ryan & Deci, 2000). This triad aligns with SDT’s core needs: motivation gains (competence via granular corrections), anxiety reduction (autonomy through self-paced practice), and proficiency development (relatedness to learning goals). Increased motivation (RQ1) and reduced anxiety (RQ2) facilitated proficiency gains (RQ3), as DeepSeek’s specific, immediate feedback enhanced engagement, lowered evaluation fears, and supported skill development. For example, B2 learners’ motivation gains (d = 0.79–0.96) and anxiety reductions (d = −0.60 to −0.87) correlated with larger writing proficiency improvements (d = 0.85), directly exemplifying SDT’s postulate that satisfaction of psychological needs enhances performance. The control group’s less actionable feedback constrained this interplay, yielding smaller effects across all outcomes.
Comparative lessons from other writing-focused exams highlight ICALA’s scalability. For instance, Japan’s juken preparatory schools have successfully integrated AI writing tutors despite the Center Test’s silence on speaking (Busso & Sanchez, 2024), suggesting ICALA could similarly augment Konkur prep. However, Iran’s lack of AI policy frameworks contrasts with China’s top-down Gaokao reforms (Xi, 2023), underscoring the need for institutional buy-in. Future collaborations with regional partners (e.g., Bangladesh’s nascent AI efforts; Uddin et al., 2024) could accelerate context-sensitive solutions. Beyond exam preparation, these findings suggest ICALA's potential to transform broader EFL education through: (1) curriculum redesign emphasizing writing-as-process with AI formative feedback, (2) flipped classroom models where AI handles routine corrections, freeing teachers for higher-order instruction, and (3) longitudinal assessment systems tracking skill development across university programs.
Implications for practice and policy
Three stakeholder groups should consider these evidence-based recommendations:
For learners:
Provide simplified AI feedback (1 grammar + 1 coherence suggestion per task) to B1 learners to reduce cognitive overload (per d = 0.52–0.76 for B1 vs. B2 gains).
Use Konkur-aligned writing prompts (e.g., grammar-error detection) to bridge exam-academia gaps, as demonstrated by experimental group gains (d = 0.85).
For educators:
Implement teacher training programs to contextualize AI feedback (e.g., distinguishing L1 transfer from errors), addressing skepticism reported by 59.1% of B1 learners.
Combine AI and instructor feedback (e.g., AI for grammar, teachers for discourse coherence), as qualitative themes (79.2% Feedback-Driven Improvement) suggest.
For policymakers:
Develop offline ICALA modules for rural areas (cf. Bangladesh’s model; Uddin et al., 2024), given Iran’s infrastructural disparities (Oskoui et al., 2024).
Audit AI tools for Persian-English bias (e.g., SOV word order penalties) through partnerships with linguists, mirroring China’s Gaokao reforms (Xi, 2023).
Establish institutional frameworks for AI-enhanced assessment literacy, preparing educators for technology-integrated classrooms (cf. Far-hat & Ouchould, 2025).
Limitations and future research
The B1 subsample’s low power (n = 45, power = 0.65–0.68) limits proficiency-level comparisons, warranting future studies with larger B1 samples (n ≥ 60). Konkur’s writing-centric focus constrained speaking gains, suggesting research with balanced speaking and writing tasks (Delavari & Talebi, 2024). DeepSeek’s coherence feedback limitations, mitigated by instructor oversight (Mirzaeian, 2025), indicate a need for advanced AI models to enhance feedback accuracy.
Three prioritized directions for future research emerge from this study:
Longitudinal Konkur Performance: Track ICALA’s sustained impact by comparing Konkur writing/speaking scores over 2–3 years, controlling for instructor variability and AI tool updates (e.g., as done in China’s Gaokao reforms; Jiang, 2022).
Cognitive Load Optimization: Test simplified feedback models for B1 learners (e.g., 1 grammar + 1 coherence suggestion per task) through a 3-phase design-based research (DBR) approach: (a) iterative co-design with instructors to prototype feedback templates, (b) implementation across multiple proficiency-level cohorts, and (c) evaluation using multimodal data (eye-tracking for cognitive load, think-aloud protocols for usability) to reduce overwhelm (Sweller, 2011; Anderson & Shattuck, 2012).
Cross-Context Validation: Replicate this study in rural Iranian universities and comparable writing-centric systems (e.g., Japan's Center Test) with stratified sampling by AI literacy levels (assessed via pre-study surveys) and socioeconomic status (using standardized income/access metrics) to assess ICALA's scalability in low-resource settings.
Additionally, interdisciplinary collaborations with computational linguists could address Persian-specific AI biases (e.g., verb-tense errors) through Farsi-English parallel dataset development while controlling for technological access disparities across urban/rural divides (Oskoui et al., 2024).
Ethical considerations for ICALA in Iran
The integration of ICALA in Iran raises three critical ethical dilemmas. First, algorithmic bias persists in DeepSeek’s training data, which prioritizes Standard American English. This risks penalizing Persian-influenced syntax (e.g., SOV word order) and pronunciation (e.g., vowel length distinctions), disadvantaging rural learners with stronger L1 transfer (Mirzaeian, 2025). Second, Iran’s lack of comprehensive data privacy laws—coupled with U.S. sanctions on AI tools—necessitates local hosting (Oskoui et al., 2024), which may limit feedback quality due to smaller training datasets. Third, urban-rural disparities in internet access could exacerbate inequities if ICALA becomes mandatory for Konkur prep, highlighting a common ethical dilemma in AI-edtech integration where technology can widen existing gaps (Li, 2023). Mitigating these risks requires:
Collaborative development of Farsi-English parallel corpora to reduce bias.
Teacher training programs to contextualize AI feedback (e.g., distinguishing L1 transfer from errors).
Offline ICALA modules for underserved regions, as piloted in Bangladesh (Uddin et al., 2024)."
In sum, ICALA via DeepSeek significantly outperforms traditional instructor-led assessments in enhancing motivation, reducing anxiety, and improving proficiency among Iranian EFL learners, particularly B2 learners, in a Konkur-driven context. By fostering autonomy and competence (Ryan & Deci, 2000), ICALA offers a scalable solution for EFL assessment reform, with tailored feedback design critical for B1 learners. These findings contribute to Asia–Pacific and global AI assessment research, underscoring ICALA’s potential to transform high-stakes EFL education.
Conclusion
This study demonstrated that Intelligent Computer-Assisted Language Assessment (ICALA) via DeepSeek significantly enhances motivation, reduces anxiety, and improves proficiency among Iranian university EFL learners in a Konkur-driven context, outperforming traditional instructor-led assessments. The experimental group (n = 60), using ICALA, achieved substantial motivation gains (d = 0.52–0.96), anxiety reductions (d = −0.45 to −0.87), and proficiency improvements (d = 0.76–0.85), particularly for B2 learners, compared to the control group’s (n = 60) smaller outcomes (d = 0.20–0.40). The AI system’s immediate, specific feedback (e.g., “Use cohesive devices like ‘therefore’”) fostered autonomy and competence, as evidenced by qualitative themes such as Competence Development (75.8%), Reduced Evaluation Pressure (72.9%), and Feedback-Driven Skill Improvement (79.2%), aligning with Self-Determination Theory (Ryan & Deci, 2000). In contrast, the control group’s delayed, less granular instructor feedback (e.g., “Write clearer arguments”), which constrained learner agency and progress, highlighted ICALA’s advantages.
B1 learners’ challenges with feedback complexity underscore the need for simplified AI feedback, such as one grammar correction and one coherence tip per task, to reduce cognitive load (Sweller, 2011). These findings extend Asia–Pacific (Jiang, 2022) and global AI assessment trends (Chapelle & Voss, 2021), offering scalable solutions for diverse high-stakes EFL contexts like Konkur. ICALA complements Konkur preparation by targeting skills the exam overlooks (e.g., writing fluency), while reinforcing tested areas (e.g., grammar) through productive practice. However, limitations include the B1 subsample’s low power (n = 45) and Konkur’s writing-centric focus. To maximize ICALA’s potential, future work must address B1 learners’ cognitive load through adaptive feedback designs and validate findings in rural contexts where AI infrastructure is limited. Cross-national collaborations—such as benchmarking Iran’s ICALA model against Japan’s juken schools or Bangladesh’s AI literacy initiatives (Uddin et al., 2024)—could accelerate equitable solutions for high-stakes exam systems globally.
Clinical trial number
Not applicable.
Author’s contributions
Z.M. conducted all phases of the study, including conceptualization, data collection, analysis, figure preparation, and manuscript writing.
Funding
This research is funded by Semnan University, research grant No. 14041138.
Data availability
No datasets were generated or analysed during the current study.
Declarations
Ethics approval and consent to participate
This study was conducted in accordance with the ethical standards of the Declaration of Helsinki and was approved by the Internal Review Board (IRB) of Tehran University. Written informed consent was obtained from all participants before their involvement in the study. Participants were informed of their right to withdraw at any time without consequences.
Consent for publication
All participants provided consent for their anonymized data to be published in this study.
Competing interests
The authors declare no competing interests.
Abbreviations
English for Academic Purposes
English as a Foreign Language
Teacher Development Courses
Artificial Intelligence
Automated Evaluation System
Attitude/Motivation Test Battery
Analysis of Variance
B1 Level (Intermediate)
B2 Level (Upper Intermediate)
Common European Framework of Reference for Languages
Common English Test (Saudi Arabia)
Communicative Language Teaching
Design-Based Research
Foreign Language Classroom Anxiety Scale
National Higher Education Entrance Examination (China)
Intelligent Computer-Assisted Language Assessment
International English Language Testing System
Intrinsic Motivation Inventory
Institutional Review Board
Iranian University Entrance Examination
Language Learning Motivation Questionnaire
Multiple Choice
Oxford Placement Test
Self-Determination Theory
Second Language Writing Anxiety Inventory
Test of English as a Foreign Language
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Abdulhussein Dakhil, T; Karimi, F; Al-Jashami, RAU; Ghabanchi, Z. The effect of artificial intelligence (AI)-mediated speaking assessment on speaking performance and willingness to communicate of Iraqi EFL learners. International Journal of Language Testing; 2025; 15,
Allan, D. (2004). Oxford placement test. Oxford: Oxford University Press.
Alkhateeb, A; Hezam, AMM; Almuraikhi, AA. Assessing the use of AI tools for EFL exam preparation at Saudi universities: Efficiency, benefits, and challenges. Cogent Education; 2025; 12,
Alrabai, F. (2014). Reducing language anxiety & promoting learner motivation: A practical guide for teachers of English as a foreign language. Lulu.com.
Amin, MYM. AI and Chat GPT in language teaching: Enhancing EFL classroom support and transforming assessment techniques. International Journal of Higher Education Pedagogies; 2023; 4,
Braun, V; Clarke, V. Using thematic analysis in psychology. Qualitative Research in Psychology; 2006; 3,
British Council. (2023). IELTS Speaking Band Description. https://takeielts.britishcouncil.org/sites/default/files/ielts_speaking_band_descriptors.pdf
Busso, A; Sanchez, B. Advancing communicative competence in the digital age: A case for AI tools in Japanese EFL education. Technology in Language Teaching & Learning; 2024; 6,
Chapelle, C. A., & Voss, E. (Eds.). (2021). Validity argument in language testing: Case studies of validation research. Cambridge University Press.
Cheng, YS. A measure of second language writing anxiety: Scale development and preliminary validation. Journal of Second Language Writing; 2004; 13,
Cohen, J. (2013). Statistical power analysis for the behavioral sciences (2nd ed.). Routledge.
Delavari, H; Talebi, ME. The relationship between AI language tool usage and EFL competence: A correlational study of high school students in Yazd, Iran. Research in English and Education; 2024; 9,
Dörnyei, Z. (2014). The psychology of the language learner: Individual differences in second language acquisition (2nd ed.). Routledge.
Ebadi, S; Gholami, M; Vakili, S. Investigating the effects of using Grammarly in EFL writing: The case of articles. Computers in the Schools; 2023; 40,
Far-hat, M., & Ouchouid, J. (2025). An empirical assessment of Moroccan EFL teachers' use of generative AI for EFL formal assessment: An intervention study. Social Sciences, 14(1), 8–21. https://doi.org/10.11648/5.55.20251401.12
Ghiasvand, F., Kogani, M., & Alipoor, A. (2024). "I'm not ready for this metamorphosis": An ecological approach to Iranian and Italian EFL teachers' readiness for artificial intelligence-mediated instruction. Teaching English with Technology, 24(3), 19–40. https://doi.org/10.36297/jwad6841/JRFYD1037J9K10001
Heathco, G. J. (2023). High stakes assessment preparation experiences in South Korea: An interpretive phenomenological study [Doctoral dissertation, The University of West Florida].
Horwitz, EK; Horwitz, MB; Cope, J. Foreign language classroom anxiety. The Modern Language Journal; 1986; 70,
Huang, Y. (2024). The Examination-driven Education and the Examination Habitus. The Bloomsbury Handbook of Bourdieu and Educational Research, 89.
Jiang, R. How does artificial intelligence empower EFL teaching and learning nowadays? A review on artificial intelligence in the EFL context. Frontiers in Psychology; 2022; 13, 1049401. [DOI: https://dx.doi.org/10.3389/fpsyg.2022.1049401]
Jin, Y; Fan, J. Test-taker engagement in AI technology-mediated language assessment. Language Assessment Quarterly; 2023; 20,
Kirkpatrick, A. (2016). The learning and teaching of English as an international language in Asia-Pacific universities: Issues and challenges. In C.-h. C. Ng, B. Fox, & M. Nakano (Eds.), Reforming learning and teaching in Asia-Pacific universities (pp. 267–287). Springer. https://doi.org/10.1007/978-981-10-0431-5_16
Li, H. AI in education: Bridging the divide or widening the gap? Exploring equity, opportunities, and challenges in the digital age. Advances in Education, Humanities and Social Science Research; 2023; 8,
Liu, W; Wang, Y. The effects of using AI tools on critical thinking in English literature classes among EFL learners: An intervention study. European Journal of Education; 2024; 59,
Makhlouf, MKI. Effect of artificial intelligence-based application on Saudi preparatory-year students' EFL speaking skills at Albaha University. International Journal of English Language Education; 2021; 9,
Mirsanjari, Z. Enhancing assessment literacy in EAP instruction: The role of teacher development courses in overcoming systemic barriers. Language Testing in Asia; 2025; 15, 30. [DOI: https://dx.doi.org/10.1186/s40468-025-00368-7]
Mirzaeian, V. R. (2025). A comparative evaluation of artificial intelligence scoring versus human scoring of EFL students' essays. Teaching English as a Second Language Quarterly (TESLQ), 44(1), 97–117. https://doi.org/10.22099/tesl.2025.50852.3313
Nunan, D. (2003). The impact of English as a global language on educational policies and practices in the Asia-Pacific region. TESOL Quarterly,37(4), 589–613. https://doi.org/10.2307/3588214
Oskoui, K., Mirzaeian, V. R., & Nafissi, Z. (2024). AI-assisted EAP testing: A case of academic IELTS writing by Iranian EFL learners. Journal of English Language Teaching and Learning, 16(34), 307–330. https://doi.org/10.22034/elt.2024.63345.2691
Parviz, M. The double-edged sword": AI integration in English language education from the perspectives of Iranian EFL instructors. Complutense Journal of English Studies; 2024; [DOI: https://dx.doi.org/10.5209/cjes.97261]
Richards, J. C., & Rodgers, T. S. (2014). Approaches and methods in language teaching (3rd ed.). Cambridge University Press.
Ryan, RM; Deci, EL. Self-determination theory and the facilitation of intrinsic motivation, social development, and well-being. American Psychologist; 2000; 55,
Sweller, J. (2011). Cognitive load theory. In J. P. Mestre & B. H. Ross (Eds.), Psychology of learning and motivation (Vol. 55, pp. 37–76). Academic Press. https://doi.org/10.1016/B978-0-12-387691-1.00002-8
Teimourtash, M. On the plausibility of integrating synthetic vs. analytic artificial intelligence (AI)-powered academic writing tasks into Iranian EFL classrooms: State-of-the-art. Practical and Pedagogical Issues in English Education; 2024; 2,
Uddin, MS; Islam, MN; Nirjon, MIH; Hilaly, MR; Mazed, MFH; Hasan, MM. University EFL teachers' perceptions about the effectiveness of AI-enhanced e-assessments in Bangladesh: A phenomenological study. Bulletin of Advanced English Studies; 2024; 9,
Voss, E; Cushing, ST; Ockey, GJ; Yan, X. The use of assistive technologies including generative AI by test takers in language assessment: A debate of theory and practice. Language Assessment Quarterly; 2023; 20,
Voss, E. (2024). Language assessment and artificial intelligence. In The concise companion to language assessment (pp. 112–125). Routledge.
Xi, X. Advancing language assessment with AI and ML - Leaning into AI is inevitable, but can theory keep up?. Language Assessment Quarterly; 2023; 20,
Xin, Q; Alibakhshi, G; Javaheri, R. A phenomenographic study on Chinese EFL teachers' cognitions of positive and negative educational, social, and psychological consequences of high-stake tests. Scientific Reports; 2025; 15,
Zaiarna, I., Zhyhadlo, O., & Dumaevska, O. (2024). ChatGPT in foreign language teaching and assessment: Exploring EFL instructors' experience. Information Technologies and Learning Tools, 102(4), 176–191. https://doi.org/10.33407/itlt.v102i4.5716
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.