Content area
International large-scale assessments excel at robust cross-national comparisons, and detailed insight into students learning progressions emerges when paired with complementary diagnostic approaches. This study employed an Al-enhanced approach-the Collective Intelligence Model for Education (CIME)-to transform item-level responses from students in Ireland who participated in the PISA 2022 mathematics assessment into multidimensional diagnostic profiles. The profiles highlighted pronounced strengths in chance and probability and identified areas for continued development in geometric relationships and measurement. Students demonstrated proficiency in working with data and established mathematical representations, and showed developing proficiency in devising solution strategies, formalising complex situations, and applying mathematical models to structure and analyse real-world contexts. The findings indicate that Al-enhanced profiling yields fine-grained, policy-relevant diagnostics that complement headline scores and inform curriculum planning and system-level improvement.
Abstract
International large-scale assessments excel at robust cross-national comparisons, and detailed insight into students learning progressions emerges when paired with complementary diagnostic approaches. This study employed an Al-enhanced approach-the Collective Intelligence Model for Education (CIME)-to transform item-level responses from students in Ireland who participated in the PISA 2022 mathematics assessment into multidimensional diagnostic profiles. The profiles highlighted pronounced strengths in chance and probability and identified areas for continued development in geometric relationships and measurement. Students demonstrated proficiency in working with data and established mathematical representations, and showed developing proficiency in devising solution strategies, formalising complex situations, and applying mathematical models to structure and analyse real-world contexts. The findings indicate that Al-enhanced profiling yields fine-grained, policy-relevant diagnostics that complement headline scores and inform curriculum planning and system-level improvement.
1 Beyond traditional assessment approaches
International large-scale assessments: from comparison to diagnosis
International large-scale assessments (ILSAs), most prominently the OECD's Programme for International Student Assessment (PISA) and the IEA's Trends in International Mathematics and Science Study (TIMSS), have become influential instruments in policymaking and international benchmarks of educational quality. Their comparative indicators, such as mathematics scores, are widely disseminated and are often treated by policymakers as evidence of system performance and a basis for developing reform agendas.
ILSAs publish not only reports on performance but also public-use files (PUFs) containing item-level responses, background questionnaire data (OECD, n.d.[1]), calibrated item parameters, and methodological documentation (OECD, 2017[2]; OECD, 2023[3]. These resources enable researchers to investigate relationships between performance and contextual factors in depth. The principal focuses of published work include inequalities associated with socioeconomic status (SES), immigrant status and language background, gender, and school-level factors (Hopfenbeck et al., 2018[4]
Governments frequently use these data and reports to develop strategies aimed at improving effectiveness and equity, and the OECD's dissemination of PISA results has contributed to a transnational policy space in which data serve as a common reference point for reform (Grek, 2009[5]; Martens and Niemann, 2013[6]. However, headline scores are rarely sufficient on their own. Knowing whether a country/economy sits above or below the international mean may satisfy accountability demands, yet it sheds little light on what learners find difficult or why particular misconceptions persist (Fullan, 2016[7].
Reliance on broad composite scores risks reducing complex proficiencies and educational processes to simplified rankings. This reduction can narrow policy debate and obscure diagnostic information about learning progressions (Hopfenbeck et al., 20184). Aggregate domain scores, оп their own, offer limited insight into learning trajectories.
Access to assessment datasets extends the utility of ILSAs beyond comparisons of scores. In addition to the correlational analyses that dominate current research, close analysis of learners' responses can yield granular diagnostic information. Realising this potential requires analytic approaches capable of extracting multidimensional evidence from raw responses; such approaches were largely impracticable before the advent of large language models (LLMs) because of constraints on human resources (Okubo, 2025[8]). Fine-grained learner diagnostics enable policymakers to move from descriptive cross-country scores to fine-grained analyses that identify learners' strengths and weaknesses.
From composite scores to multidimensional diagnostics
Traditional educational assessments treat each learner's response as a dichotomous (correct or incorrect) or polytomous (partial credit) outcome along a single proficiency dimension. This positions learners on a unidimensional proficiency continuum for a broad construct, such as scientific literacy. Open-ended responses are judged by human raters, whereas multiple-choice and short-answer items are scored automatically. In both cases, scoring rubrics are designed to align with a pre-specified assessment framework (Jolly and Dalton, 2018[9]), and the resulting score is the sole evidence used to infer proficiency. This practice imposes a technical constraint: once a response is reduced to a single categorical decision, further substantive information cannot be recovered.
In ILSAs, assessment items are categorised along several dimensions. For example, in PISA 2022 mathematics, each item was assigned to a category within each of three dimensions: cognitive process, context, and content (OECD, 2023[3]; OECD, 2024[10]. Using these classifications, subdomain scores were estimated alongside the overall mathematics score. At the country or economy level, PISA 2022 reported both the overall mathematics score and the subdomain scores.
If the aim is to obtain richer diagnostic insight rather than a unidimensional estimate of proficiency at the domain level, itis essential to extract the structural information embedded in assessment items and learners' responses. In practice, assessors exercise nuanced judgement when evaluating performance: partial solutions, misconceptions, solution paths, and linguistic cues all inform their judgements (Eva, 2005[11]. In everyday teaching, they diagnose learners by drawing on all available evidence from the classroom. Replicating this level of diagnostic detail at scale has, until recently, been infeasible.
The advent of LLMs is beginning to reshape this landscape. Empirical studies have demonstrated the feasibility of LLM-based automated scoring of open-ended responses (Okubo et al., 2023[12]; Lee et al., 2024[13]. Beyond automated scoring, emerging research suggests the possibility of more detailed, multidimensional diagnostic inference. The Collective Intelligence Model for Education (CIME; Okubo (2025[18]) exemplifies this direction. CIME leverages the representational capacity of LLMs to infer multiple latent traits from a learner response, providing greater analytic granularity than conventional item scoring.
Applying CIME to PISA 2022 mathematics data
This study demonstrates how LLM-enhanced analytics can augment the diagnostic value of ILSA datasets, thereby supporting policy measures such as curriculum design. Headline scores at the domain level provide robust, internationally comparable benchmarks; however, they often lack the conceptual and developmental detail needed to reveal learning trajectories. To address this gap, we apply CIME (Okubo, 2025[8]) to the PISA 2022 mathematics dataset for Ireland. Prior research has established the feasibility of LLM-based automated scoring of open-ended responses (Okubo etal., 20231[12]; Lee et al., 2024[13], creating the conditions for a shift from scoring to multidimensional diagnostic inference.
CIME is adapted to item-level response data to generate multifaceted, multidimensional proficiency profiles that complement the main and subdomain proficiency metrics published by the OECD (OECD, 2019[14] OECD, 2023[15]. Construct scores are estimated for each cognitive facet (i.e., cognitive operator) specified in the PISA 2022 assessment and analytical framework (OECD, 202[3]) and are interpreted through established learning-progression lenses. The resulting evidence is intended to inform curricular decisions while remaining compatible with ILSA psychometric standards.
The empirical scope is limited to the 5569 fifteen-year-old learners sampled in Ireland for PISA 2022, including English-medium (N = 5 460) and Irish-medium (N = 109) cohorts. Although the present analysis focuses on a single cycle, domain, and jurisdiction, the approach is designed to be transferable to other assessments and contexts.
2 Overview of Ireland's mathematics performance in PISA 2022
PISA 2022 mathematics: assessment overview
In PISA 2022, mathematics is the major domain. The PISA 2022 mathematics assessment and analytical framework defines mathematical literacy as an individual's capacity to reason mathematically and to formulate, employ, and interpret mathematics to solve problems in a variety of real-world contexts (OECD, 2023[3]. Mathematical reasoning includes both deductive and statistical (inductive) forms, and computational thinking is recognised as part of problem-solving practice. The modelling cycle in mathematical literacy presented in earlier frameworks (OECD, 2013[16] is retained and operationalised through three processes: formulating situations mathematically; employing mathematical concepts, facts and procedures; and interpreting, applying and evaluating mathematical outcomes.
The domain is organised into three interrelated aspects (Table 1): (i) Mathematical Reasoning and the three processes just noted; (ii) Content Area grouped into four categories-Change and Relationships; Space and Shape; Quantity; Uncertainty and Data; and (iii) Context-Personal, Occupational, Societal, Scientific- together with a selected set of 21st century skills that support and are developed by mathematical literacy. The context categories guide item development rather than reporting; therefore, PISA 2022 (OECD, 2023[51]) does not report results by Context.
PISA targets balanced coverage: approximately 25% of score points for Mathematical Reasoning and approximately 25% for each cognitive process, and approximately 25% for each content area. Items are set in real-world contexts and delivered primarily through computer-based assessment of mathematics (СВАМ), enabling simulations, spreadsheet-like interactions, and richer response formats while keeping mathematical demand central. Item formats include open constructed-response, closed constructed-response, and selected-response; partial credit is used where appropriate.
Assessment framework for mathematics: content topics and cognitive operators
In the PISA 2022 mathematics assessment and analytical framework, items are classified along two axes relevant to this study. First, each item is assigned to one of four content categories: Change and Relationships, Space and Shape, Quantity, and Uncertainty and Data. These categories are specified at a finer grain by content topics. Second, each item is assigned either to Mathematical Reasoning or to one of the three problem-solving processes: Formulating Situations Mathematically (Formulating), Employing Mathematical Concepts, Facts and Procedures (Employing), and Interpreting, Applying and Evaluating Mathematical Outcomes (Interpreting). For each cognitive process, the framework specifies expected student actions, and for Mathematical Reasoning, it sets out the conceptual foundations that enable reasoning. In this study, we adopt these descriptors verbatim as cognitive operators.
For content topics, the framework provides an illustrative (not exhaustive) set of 19 topics appropriate for assessing 15-year-olds. These include Functions, Algebraic Expressions, Equations and Inequalities, Coordinate Systems, Relationships within and among Geometrical Objects in Two and Three Dimensions, Measurement, Numbers and Units, Arithmetic Operations, Percents, Ratios and Proportions, Counting Principles, Estimation, Data Collection, Representation and Interpretation, Data Variability and its Description, Samples and Sampling, and Chance and Probability. In addition, four topical emphases connect mathematics to contemporary applications: Growth Phenomena (Change and Relationships), Geometric Approximation (Space and Shape), Computer Simulations (Quantity), and Conditional Decision Making (Uncertainty and Data). Table 2 reproduces the framework's topics list. For cognitive operators, the framework details the actions that learners are expected to carry out in mathematical reasoning and in each process: Formulating, Employing, and Interpreting. Table 3 to Table 6 provide the full cognitive operator sets by Cognitive Process.
PISA 2022 mathematics results for Ireland
PISA 2022 Results (OECD, 2023[15]) reports system-level means for the overall mathematics scale, together with described subscales for Mathematical Reasoning and the three problem-solving processes, and summary results for the four content categories. Table 7 presents the Ireland and OECD means, with standard errors in parentheses.
Ireland's mean on the overall mathematics score is 491.6 points, 19.2 points above the OECD mean of 472.4. Process-score differences are consistent. Ireland records 486.8 on Formulating, 493.6 on Employing, 494.9 on Interpreting, and 489.8 on Mathematical Reasoning, exceeding the corresponding OECD means (approximately [469, 474]) on each score. The largest margin appears on Employing, while the smallest is on Mathematical Reasoning.
A similar pattern is evident for the content categories. Ireland attains 493.6 on Quantity and 491.6 on Change and Relationships, both about 20 points above the OECD means (472.4 and 469.8, respectively). The most pronounced advantage is in Uncertainty and Data (498.6 versus 473.7; +24.9). By contrast, Space and Shape (474.5) is only slightly above the OECD mean, and the difference is not statistically significant given the reported standard errors.
Taken together, these results indicate that learners in Ireland perform above the OECD benchmark overall, across the reasoning and process subscales, and in three of the four content areas. All comparisons should be interpreted with reference to the standard errors reported in Table 7.
3 Analytical methods
Development of multidimensional item-facet rubrics using CIME
The CIME procedure (Okubo, 2025[8] begins by deconstructing each cognitive item and developing item-facet-specific scoring rubrics. An expert panel, supported by LLMs, identifies the relevant content categories and content topics, the cognitive operators to be evidenced, and any representational demands of a complete solution. On this basis, the panel drafts rubrics that specify, in precise language, the observable evidence in raw item responses required to infer mastery of each targeted construct. Each item is mapped to several rubrics, each corresponding to a distinct facet of performance. The rubrics are designed to detect the presence, and where appropriate the strength, of the underlying skill or concept rather than to deliver a single judgement of correctness.
Provisional rubrics undergo iterative validation to ensure coverage of the full spectrum of learner responses. Using authentic responses, supplemented by targeted synthetic responses to probe rare strategies and misconceptions, expert raters, assisted by LLMs, compare observed response patterns with rubric criteria. Any mismatch prompts focused revision. Through successive cycles, the rubrics are refined until they accommodate the strategies, misconceptions, and partial solutions present in the data.
Once content validity (Messick, 1989[17]; Messick, 1996[18] is established, the full response set is machine-scored by LLMs using the current rubric version. Human raters then audit a stratified sample. Any systematic divergence between machine and human judgements triggers another revision cycle that fine-tunes descriptors and decision rules. This multiple-coding procedure supports face validity and improves the reliability of machine-scoring.
Psychometric quality is then examined. Correlational structure is evaluated using confirmatory factor analysis; rubrics must show substantively meaningful and statistically robust loadings on their designated latent traits to be retained. Measures that fail to meet prespecified criteria are revised or discarded. In this study, item-facet rubrics with negative factor loadings were excluded from the analysis. Concurrent validity is assessed by correlating rubric-based scores with recognised external benchmarks, such as the overall mathematics literacy scale. Only constructs that demonstrate adequate reliability and convergent evidence of validity advance to substantive analyses.
The analysis then tests construct validity at scale level. Multidimensional confirmatory factor analyses are applied to the rubric-based scores to evaluate the intended latent architecture. Convergent validity is indicated by strong factor loadings for rubrics designed to tap the same construct and adequate composite reliability. Discriminant validity is indicated by low cross-loadings and modest inter-factor correlations among theoretically distinct skills. The rubric set is accepted as a coherent and defensible representation of the targeted abilities only when both conditions are satisfied.
This cycle is repeated until the rubrics for every item converge, meaning that successive rounds of expert and LLM review yield stable criteria and psychometric indices that no longer require substantive modification.
Selection and analysis of constructs
In PISA 2022, 234 mathematics items were used to assess mathematical literacy. One item was excluded from the analysis in Ireland (OECD, 2023[15]. Detailed information on item properties and parameters is available on the OECD website (OECD, n.d.[1]).
The PISA 2022 mathematics assessment framework defines 19 topics under the content topic category (Table 2), together with 12 cognitive operators in Employing (Table 3), 12 in Formulating (Table 4), 9 in Interpreting (Table 5), and 6 in Mathematical Reasoning (Table 6). In this study, constructs measured by fewer than 12 items or represented by fewer than 3 000 responses are excluded from further analysis to ensure sufficient validity of the target measures.
Following the selection of the target constructs, 13 of the 19 content topics, 8 of the 12 cognitive operators in Employing, 6 of the 12 in Formulating, 6 of the 9 in Interpreting, and 6 of the 6 in Mathematical Reasoning are retained for further analysis. These constructs are assessed using the scored data sets generated from the rubrics.
The items, the associated student responses, and the scoring rubrics for multiple constructs for each item are used to estimate the country profile at the population level. In total, 1 827 new scoring rubrics were developed for this study, compared with the 233 rubrics used in the conventional PISA 2022 mathematics scoring scheme. Thus, the CIME approach captures almost 8 times as much diagnostic information as the standard methodology and increases the number of rubrics per item.
Scoring responses across multiple dimensions
Using the scoring rubrics, and focusing on the constructs retained in Section 10345021, each student response was scored on R occasions, each using a distinct combination of LLMs and parameters. For each construct, the resulting data array had dimensions N x J x R, where N is the number of responses, J the number of items, and R the number of the models; R is set to 4 in this study. Scores were recorded on a three-level ordinal scale: 0 =not satisfactory, 1 =partially achieved, and 2 =fully achieved.
For each construct, generalisability analysis (Brennan and Johnson, 199519); Brennan, 200170) decomposed the observed score variance into components attributable to respondents, item-facets, models, and residual error. Particular attention was paid to the sizes of the variance of the models and residual components. Scorings were judged sufficiently reliable only when both were negligible relative to the total variance. In those cases, a definitive score for each response-item combination was obtained by taking the statistical mode across the R ratings.
Once scoring reliability had been verified under generalisability analysis, the data sets were finalised for subsequent modelling. The resulting files met all predetermined criteria for reliability and validity.
Scaling of scored response data
Item-facet parameters for each construct were estimated using the generalised partial credit model (GPCM Muraki (1992[21])) of item response theory (Birnbaum, 1968[22] Lord and Novick, 196[23]. To anchor these estimates to the PISA scale, the international and national parameters for the multiple-choice items were fixed, allowing the item-facet parameters of the open-ended items scored under the new scoring rubrics to be freely estimated. Consequently, the proficiency estimates reported in this study are expressed on the PISA scale. However, the conceptual alignment of the newly defined constructs with the original mathematical literacy framework warrants careful scrutiny, as the new scales rely on item parameters derived from the overall construct (i.e., mathematical literacy).
In PISA 2022 mathematics, scores for the content area and cognitive process facets are estimated using the same item parameters as those used for the mathematical literacy scale. The present study follows this practice: the facet scores are estimated using the item parameters established for overall proficiency. This approach ensures that the resulting facet scores are directly comparable with historical PISA mathematics results, although the reservations about construct equivalence noted above still apply.
The principal procedural departure from standard ILSA practice is the omission of cross-language equivalence checks, which is known as Differential Item Functioning (DIF) analysis (Meredith, 1993[24]; Chen, 2007[25]; Asparouhov and Muthén, 2014[26]. ILSAs check parameter invariance across language versions using statistics such as the root mean square deviation (RMSD) between expected and observed item response functions (OECD, 2017[2] Okubo, 2024[27]. Although the Ireland's sample includes assessments administered in both English and Irish, the Irish-medium cohort is small (N = 109); therefore, DIF analyses were not undertaken. International cross-cultural comparability is therefore not fully established for the new constructs, even though the fixed PISA parameters have previously met the relevant invariance criteria. For the purpose of a within-country profile, however, this limitation is not consequential; it is sufficient that the scores are interpretable on the established PISA scale and comparable across constructs within Ireland.
Estimation of country-profile scores
Population-level proficiency distributions were estimated from all scored responses and the calibrated item- facet parameters. In practice, the population-modelling framework was used to implement multiple imputation (Rubin, 1987[28]) in line with ILSA practice. Prior means for each student, by facet, were obtained from a latent regression model (LRM) that included relevant background variables. These prior means, together with their covariance matrix and the likelihood, were used to draw 10 plausible values (PVs) per respondent for each facet. The theoretical basis for this approach is discussed in OECD (2017[21].
Population-level proficiency distributions were then computed from the PVs using the final student weights, yielding nationally representative estimates of the mean and variance of proficiency for Ireland on the PISA scale. Subgroup statistics (e.g., gender-specific means and variances) can be computed analogously. All stages of the analysis followed standard ILSA procedures.
Three consistency checks were conducted. First, the weighted PV mean was required to reproduce the population-level mean of the proficiency distribution estimated via marginal maximum likelihood with the EM algorithm (MMLE-EM; Dempster, Laird and Rubin (1977[29]; Bock and Aitkin (1981[30]). Second, subgroup means computed from the PVs were required to match design-based estimates derived from MMLE-EM. Third, within each subgroup, residual distributions from the LRM were examined for bias. All three criteria were satisfied, supporting the validity of the PV generation and the resulting estimates (Okubo, 2022[31].
4 Country profile of Ireland's mathematics performance
Scale construction and reporting criteria
This section presents the results of an AI-enhanced assessment of Ireland's performance in PISA 2022 mathematics. All scores are reported on the PISA mathematical literacy scale (Mean = 500, SD = 100), defined by the proficiency distribution of OECD member countries (OECD, 2010[32]), to preserve full comparability with OECD benchmarks. National representativeness was ensured through the application of student sampling weights and replicate weights. Consequently, the point estimates and associated standard errors reflect the achievement of fifteen-year-olds enrolled in Irish post-primary schools. The analytic sample comprises 5 569 students: 5 460 in English-medium instruction and 109 in Irish-medium instruction, thereby capturing the full linguistic diversity of the cohort.
Item parameters were estimated in line with ILSA standards. Model fit was evaluated using root-mean-square deviation (RMSD), following the PISA methodology (OECD, 2017[2]). PVs were generated via population modelling (Rubin, 1987[28]; OECD, 2017[2]; Okubo, 2022[31]) incorporating relevant background variables. 10 PVs per respondent were drawn, and all reported statistics are weighted averages across these multiple imputations.
The analyses adopt a fine-grained diagnostic lens. In addition to the overall mathematics literacy scale, separate estimates are produced for two nested sets of subscales: Content Topic (13 topics) and the cognitive operators within four cognitive processes: Formulating Situations Mathematically (Formulating); Employing Mathematical Concepts, Facts and Procedures (Employing); Interpreting, Applying and Evaluating Mathematical Outcomes (Interpreting); and Mathematical Reasoning. Each subscale is reported only when its item bank satisfies the reliability and validity criteria specified in Section 10345021 (minimum item count and minimum number of student responses). This requirement safeguards against over-interpretation of sparse or poorly targeted scales.
Performance profiles across content topics
Figure 1 displays population-level content-topic scores with 95% confidence intervals. The blue line indicates Ireland's mean mathematical-literacy score (491.6). Table 8 reports the corresponding point estimates, their standard errors (SEs), and the differences (??) from Ireland's overall mathematical-literacy mean. Topic descriptions follow the PISA 2022 Assessment and Analytical Framework (OECD, 2023[3]).
High scores in Chance and Probability (Δ = 19.4), Data Collection, Representation and Interpretation (Δ = 4.5), and Data Variability and its Description (Δ = 5.2) indicate comparative strength in the content category Uncertainty and Data. In the assessment framework, Uncertainty and Data "includes recognising the place of variation in the real world including, having a sense of the quantification of that variation, and acknowledging its uncertainty and error in related inferences." It "also includes forming, interpreting and evaluating conclusions drawn in situations where uncertainty is present." Moreover, "The presentation and interpretation of data are key concepts in this category." This category also includes the topic of Conditional Decision Making, which the framework links to two-way tables: "These inter-relationships can often be represented in two-way tables that provide the basis for making conditional decisions (inferences)." Such tables provide "the probabilities of the four joint events, the two marginal, and the conditional probabilities which play the central role in what we have termed conditional decision making."
Low scores in Growth Phenomena (Δ = -7.9), Relationships within and among Geometrical Objects in Two and Three Dimensions (Δ = -18.2), and Measurement(Δ = -12.7) indicate comparatively weaker performance on tasks involving non-linear growth, and on core Space and Shape knowledge of geometric relationships and measurement, as defined in the framework.
For Growth Phenomena, the framework specifies that identifying this as a focal point "is not to signal that there is an expectation that participating students should have studied the exponential function and certainly the items will not require knowledge of the exponential function." It adds that items expect students to "(a) recognise that not all growth is linear, (b) that non -linear growth has particular and profound implications on how we understand certain situations, and (с) appreciate the intuitive meaning of "exponential growth" as an extremely rapid rate of growth, for example in the earthquake scale, every increase by 1 unit on the Richter scale does not mean a proportional increase in its effect, but rather by 10, 100, and 1000 times etc."
For Relationships within and among Geometrical Objects in Two and Three Dimensions, the framework gives this topic description: "Relationships within and among geometrical objects in two and three dimensions: Static relationships such as algebraic connections among elements of figures (e.g. the Pythagorean theorem as defining the relationship between the lengths of the sides of a right triangle), relative position, similarity and congruence, and dynamic relationships involving transformation and motion of objects, as well as correspondences between two - and three -dimensional objects." Lower scores suggest difficulty handling these relations and mappings.
For Measurement, the framework defines the topic as: "Measurement: Quantification of features of and among shapes and objects, such as angle measures, distance, length, perimeter, circumference, area and volume." Low scores, therefore, point to challenges in applying these measurements in context.
Both geometry-related topics belong to Space and Shape, whose description stresses that "Geometry serves as an essential foundation for space and shape, but the category extends beyond traditional geometry in content, meaning and method, drawing on elements of other mathematical areas such as spatial visualisation, measurement and algebra." It also notes that "Measurement formulas are central in this area." Hence, lower performance in these two topics indicates the weaknesses in Space and Shape content knowledge and application.
Performance profiles across cognitive operators
Performance profiles in Formulating Situations Mathematically
Figure 2 displays population-level scores for cognitive operators within Formulating Situations Mathematically, with 95% confidence intervals. The blue line marks Ireland's mean mathematical literacy score (491.6). Table 9 reports the corresponding point estimates, standard errors (SEs), and differences (Δ) from Ireland's overall mathematical literacy mean. Descriptions of the cognitive operators follow the PISA 2022 Assessment and Analytical Framework (OECD, 2023[3]).
High performance on the cognitive operator 'simplifying a situation or problem to make it amenable to mathematical analysis (for example, by decomposing)' (Δ = 7.0) indicates strength in this formulating operator.
By contrast, low performance on 'representing a situation mathematically using appropriate variables, symbols, diagrams, and standard models' (Δ = -6.2) and on 'recognising mathematical structure (including regularities, relationships, and patterns) in problems or situations' (Δ = -9.9) indicates weaknesses in representation and in recognising structure.
Overall, this profile suggests that students are more comfortable reducing complexity than identifying deep structure and encoding it in mathematical representations within the cognitive process of Formulating Situations Mathematically.
Performance profiles in Employing Mathematical Concepts, Facts and Procedures
Figure 3 presents population-level scores for cognitive operators within Employing Mathematical Concepts, Facts, and Procedures, with 95% confidence intervals. The blue line indicates Ireland's mean mathematical literacy score (491.6). Table 10 reports the corresponding point estimates, standard errors (SEs), and differences (Δ) from Ireland's overall mathematical literacy mean. Descriptions of the cognitive operators follow the PISA 2022 Assessment and Analytical Framework (OECD, 2023[3]).
Ireland's high scores on 'making mathematical diagrams, graphs, simulations, and constructions and extracting mathematical information from them' (Δ = 9.9) indicate strength in constructing and interpreting mathematical representations.
By contrast, low scores on 'devising and implementing strategies for finding mathematical solutions' (Δ = -17.6) point to comparatively weaker performance in strategic problem solving.
Overall, these results suggest that, relative to other assessed cognitive operators, students in Ireland are more secure in handling mathematical representations than in devising and implementing strategies for solving mathematical problems.
Performance profiles in Interpreting, Applying and Evaluating Mathematical Outcomes
Figure 4 presents population-level scores for cognitive operators within Interpreting, Applying, and Evaluating Mathematical Outcomes, with 95% confidence intervals. The blue line indicates Ireland's mean mathematical literacy score (491.6). Table 11 reports the corresponding point estimates, standard errors (SEs), and differences (Δ) from Ireland's overall mathematical literacy mean. Descriptions of the cognitive operators follow the PISA 2022 Assessment and Analytical Framework (OECD, 2023[3]).
Scores for the cognitive operators within Interpreting (Interpreting, Applying and Evaluating Mathematical Outcomes) cluster around the overall mean on the mathematics scale, indicating that no cognitive operator is stronger or weaker than the overall mathematical proficiency.
For example, differences from the overall mean are small for 'evaluating a mathematical outcome in terms of the context' (Δ = 1.8), 'interpreting a mathematical result back into the real-world context' (Δ = 2.3), and 'explaining why a mathematical result or conclusion does, or does not, make sense given the context of a problem' (Δ = 2.5). This pattern indicates broadly similar proficiency in evaluating, interpreting, and explaining outcomes under the cognitive process of Interpreting, Applying and Evaluating Mathematical Outcomes.
Performance profiles in Mathematical Reasoning
Figure 5 presents population-level scores for cognitive operators within Mathematical Reasoning, with 95% confidence intervals. The blue line indicates Ireland's mean mathematical literacy score (491.6). Table 12 reports the corresponding point estimates, standard errors (SEs), and differences (Δ) from Irelands overall mathematical literacy mean. Descriptions of the cognitive operators follow the PISA 2022 Assessment and Analytical Framework (OECD, 2023[3]).
High scores on 'appreciating the power of abstraction and symbolic representation' (Δ = 15.5) in mathematical reasoning indicates strength in working with mathematical representations for reasoning. For the cognitive operator, the framework states, "Students use representations - whether text-based, symbolic, graphical, numerical, geometric or in programming code - to organise and communicate their mathematical thinking." It further emphasises that "Representations are also a core element of mathematical modelling, allowing students to abstract a simplified or idealised formulation of a real-world problem." Moreover, "Having an appreciation of abstraction and symbolic representation supports reasoning in the real-world applications of mathematics envisaged by this framework by allowing students to move from the specific details of a situation to the more general features and to describe these in an efficient way."
Ireland's high score on "understanding variation as the heart of statistics" (Δ = 23.2) indicates strong acquisition of the skill emphasised in the framework, namely: "In statistics, accounting for variability is one, if not the central, defining element around which the discipline is based." It also mentions that "Statistics is essentially about accounting for or modelling variation ..." and concludes that "Understanding variation as a central feature of statistics supports reasoning about the real-world applications of mathematics envisaged in this framework in that students are encouraged to engage with data-based arguments with awareness of the limitations of the conclusions that can be drawn."
By contrast, low scores on 'using mathematical modelling as a lens onto the real world' (Δ = -6.1) suggest comparatively weaker performance in model-centred reasoning. The framework notes that "Models are simplifications of reality that foreground certain features of a phenomenon while approximating or ignoring other features." and that "Mathematical models are formulated in mathematical language and use a wide variety of mathematical tools and results (e.g., from arithmetic, algebra, geometry, etc.)." In addition, "Models can be operated - that is, made to run over time or with varying inputs, thus producing a simulation. When this is done, it is possible to make predictions, study consequences, and evaluate the adequacy and accuracy of the models."
Overall, this profile indicates that, relative to other assessed operators, students in Ireland are more secure in using representations and in mathematical reasoning about variability than in using mathematical modelling to understand the mathematical problems. They appear less assured when using mathematical models to structure real-world situations, to specify assumptions and parameters, to operate models, and to evaluate their adequacy in context.
5 Discussion of findings and limitations
Summary of findings
Ireland's country profile in PISA 2022 mathematics shows areas of strength and targeted areas for development. Performance is strongest in the content category Uncertainty and Data, particularly the topic Chance and Probability, and in cognitive operators associated with working with representations and reasoning about variability. Relative weaknesses appear in tasks involving non-linear growth and geometric relationships, and in recognising mathematical structure and devising strategies for solving mathematical problems. Item difficulties are taken into account in the scoring model; therefore, observed high or low scores in this study should not be attributed to the inherent difficulty of the content topics or cognitive operators.
From a content-topic perspective, the Uncertainty and Data topics Chance and Probability, Data Collection, Representation and Interpretation, and Data Variability and its Description indicate solid competence with variability, data displays, and probabilistic reasoning. In contrast, lower scores in Growth Phenomena, Relationships within and among Geometrical Objects in Two and Three Dimensions, and Measurement point to difficulties with representing and analysing change when growth is non-linear, and with geometric relationships and measurement.
In Formulating Situations Mathematically, students perform well on 'simplifying a situation or problem to make it amenable to mathematical analysis (for example, by decomposing)', while lower scores on 'representing a situation mathematically using appropriate variables, symbols, diagrams, and standard models' and on 'recognising mathematical structure (including regularities, relationships, and patterns) in problems or situations' suggest challenges with identifying deep structure and encoding it in formal representations. In Employing Mathematical Concepts, Facts and Procedures, high performance on 'making mathematical diagrams, graphs, simulations, and constructions and extracting mathematical information from them' contrasts with lower performance on 'devising and implementing strategies for finding mathematical solutions', indicating a relative weakness in strategic problem solving. In Mathematical Reasoning, strengths are evident in 'appreciating the power of abstraction and symbolic representation' and in 'understanding variation as the heart of statistics', whereas 'using mathematical modelling as a lens onto the real world' is lower, signalling less secure use of mathematical models in reasoning.
Overall, the country-profile analysis highlights a distinction between extracting information from given representations and mapping complex contexts into mathematical forms. The pattern suggests that students in Ireland are proficient in working with data and established representations, and less assured when required to devise strategies, formalise complex situations, and use mathematical models to structure and analyse real-world problems.
Implications of the findings
The results of this additional analysis are observations about the relative strengths and weaknesses in mathematics of the PISA 2022 cohort in Ireland. An open question, is what actions or policy recommendations follow from these finer-grained results; in essence, the question is, "so what?" We propose that these new results should be considered not by us, but by those who know thoroughly and deeply the context and system of school education in Ireland. However, we will go so far as to propose a way of reflecting on the results, both in the hope that it will allow for the results to be used by policymakers in Ireland, but also as a way of demonstrating the policy relevance of this additional analysis.
Take, as an example, the relative strengths and weaknesses shown in the results by Content Topic. We propose that the first guiding question for stakeholders when approaching the results ought to be along the lines of, "is this what our curriculum intends?" Every curriculum tries to cover the full range of content of a subject, however there is only so much time in a school year and so choices must be made about to what depth a given content area can be covered, in any given year. So, does the curriculum that would have been delivered in the years leading up to the grade most 15-year-old students are in, place more emphasis on 'Chance and Data' than on 'Relationships within and among Geometrical Objects in Two and Three Dimensions'? If so, then these results may speak to the success of teachers and schools effectively delivering the curriculum as it has been designed. However, if this is not the intended relative emphasis of the curriculum, then further questions naturally follow about why the results appear as they do, and what might be necessary to achieve the intentions of the curriculum.
A similar approach can be taken towards the cognitive operators elaborated within the processes, with the same essential questions being rephrased as, "are the students stronger in the areas that we emphasise in our curriculum and teaching?" With the cognitive operators, though, it is less about content and much more about practice. When students are in their mathematics classes, what sorts of tasks are they given? Are they typically given many more opportunities for "Recognising aspects of a problem that correspond with known problems or mathematical concepts, facts or procedures" than they are for "Recognising mathematical structure (including regularities, relationships, and patterns)"? If the answer is yes, then there is a plausible hypothesis for why students in Ireland have been observed to be stronger in the former than the latter, and this can be examined further. If the answer is no, then perhaps a hypothesis from experts in the education system may emerge. The ensuing questions, again, would revolve around whether this is intentional and what stakeholders expect, or whether there may be a case for closer review and a potential future readjustment.
In short, the implications for policy of this additional analysis is that it affords opportunities for critical reflection on whether the observed performance in content areas and skills match well with what the education system is trying to achieve. It is not about absolute levels of performance, but rather it is about relative strengths and weaknesses, and how these match with the different prioritisation and relative importance of different skills and areas of knowledge in the curriculum and in classroom practice.
Limitations of the analysis
A limitation concerns the empirical scope. The study uses the PISA 2022 mathematics dataset for Ireland. Although PISA procedures are internationally recognised and methodologically robust, the sample comprises only 15-year-old students and pertains solely to mathematics. Any inferences about Ireland's performance, therefore, apply to this age cohort and subject area alone.
A further limitation arises from the automated scoring system, which is based on LLMs. Expert review and multiple statistical checks reduced many sources of error; however, residual bias associated with prompt sensitivity or subtle linguistic variation may persist. Additional rounds of human verification could enhance confidence, although the reliability of human coders would itself require monitoring.
Each construct coverage is narrower than the full mathematics framework. To satisfy validity, constructs represented by fewer than 12 items or fewer than 3 000 responses were excluded. This safeguard against unstable parameter estimates removed six content topics and several cognitive operators, thereby constraining the breadth of the resulting proficiency profiles.
The estimation strategy presents an additional methodological constraint. Newly estimated item-facet parameters were anchored to international and national trend-item parameters that were originally estimated under a unidimensional mathematical literacy framework. This approach follows PISA practice and aligns the new scales with the established metric; nevertheless, it assumes that each newly defined construct remains conceptually consistent with the original latent variable. Thus, the possibility of construct driftcannot be dismissed.
International comparability also remains limited. Although item parameters were placed on the international scale, responses from other language versions and cultural contexts were not examined, and standard cross-language invariance checks were not conducted. As a result, the comparability of the newly estimated parameters across countries remains untested, and no claims are made regarding their performance in cross-national analyses.
References
Asparouhov, T. and B. Muthén (2014), "Multiple-group factor analysis alignment", Structural Equation Modeling, Vol. 21, pp. 495-508, https://doi.org/10.1080/10705511.2014.919210. [26]
Birnbaum, A. (1968), Some latent trait models and their use in inferring an examinee's ability, Addison-Wesley. [22]
Bock, R. and M. Aitkin (1981), "Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm", Psychometrika, Vol. 46, pp. 443-459, https://doi.org/10.1007/BF02293801. [30]
Brennan, R. (2001), Generalizability theory, Springer-Verlag, New York. [20]
Chen, F. (2007), "Sensitivity of goodness of fit indexes to lack of measurement invariance", Structural Equation Modeling, Vol. 14, pp. 464-504, https://doi.org/10.1080/10705510701301834. [25]
Dempster, A., N. Laird and D. Rubin (1977), "Maximum likelihood from incomplete data via the EM algorithm", Journal of the Royal Statistical Society, Series B., Vol. 39, pp. 1-38, https://doi.org/10.1111/j.2517-6161.1977.tb01600.x. [29]
Eva, K. (2005), "What every teacher needs to know about clinical reasoning", Medical Education, Vol. 39/1, pp. 98-106, https://doi.org/10.1111/j.1365-2929.2004.01972.x. [11]
Fullan, M. (2016), The new meaning of educational change, Teachers college press Columbia University Routledge Ontario Principals' Council. [7]
Grek, S. (2009), "Governing by numbers: the PISA 'effect' in Europe", Journal of Education Policy, Vol. 24/1, pp. 23-37, https://doi.org/10.1080/02680930802412669. [5]
Hopfenbeck, T. et al. (2018), "Lessons Learned from PISA: A Systematic Review of Peer- Reviewed Articles on the Programme for International Student Assessment", Scandinavian Journal of Educational Research, Vol. 62, pp. 333-353, https://doi.org/10.1080/00313831.2016.1258726. [4]
Jolly, B. and M. Dalton (2018), Written Examinations, Wiley-Blackwell, London, https://doi.org/10.1002/9781119373780.ch21. [9]
Lee, G. et al. (2024), "Applying large language models and chain-of-thought for automatic scoring", Computers and Education: Artificial Intelligence, Vol. 6/100213, https://doi.org/10.1016/j.caeai.2024.100213. [13]
Lord, F. and M. Novick (1968), Statistical theories of mental test scores, Addison-Wesley, Menlo Park. [23].
Martens, K. and D. Niemann (2013), "When Do Numbers Count? The Differential Impact of the PISA Rating and Ranking on Education Policy in Germany and the US", German Politics, Vol. 22/3, pp. 314-332, https://doi.org/10.1080/09644008.2013.794455. [6]
Meredith, W. (1993), "Measurement invariance, factor analysis and factorial invariance", Psychometrika, Vol. 58, pp. 525-543, https://doi.org/10.1007/BF02294825. [24]
Messick, S. (1996), "Validity and washback in language testing", Language Testing, Vol. 13, pp. 241-256, https://doi.org/10.1177/026553229601300302. [18]
Messick, S. (1989), Validity, American Council on Education/Macmillan, New York. [17] Muraki, E. (1992), "A generalized partial credit model: Application of an EM algorithm", Applied Psychological Measurement, Vol. 16, pp. 159-176, https://doi.org/10.1177/014662169201600206. [21]
OECD (2024), PISA 2022 Technical Report, OECD Publishing, https://doi.org/10.1787/01820d6den. [10]
OECD (2023), PISA 2022 Assessment and Analytical Framework, PISA, OECD Publishing, https://doi.org/10.1787/dfe0bf9c-en. [3]
OECD (2023), PISA 2022 Results (Volume I): The State of Learning and Equity in Education, PISA, OECD Publishing, https://doi.org/10.1787/53f23881-en. [15]
OECD (2019), PISA 2018 Results (Volume I): What Students Know and Can Do, PISA, OECD Publishing, Paris, https://doi.org/10.1787/5f07c754-en. [14]
OECD (2017), PISA 2015 technical report, OECD Publishing, https://www.oecd.org/content/dam/oecd/en/about/programmes/edu/pisa/publications/technical -report/PISA2015_TechRep_Final.pdf. [2]
OECD (2013), PISA 2012 Assessment and Analytical Framework: Mathematics, Reading, Science, Problem Solving and Financial Literacy, OECD Publishing, https://doi.org/10.1787/9789264190511-en. [16]
OECD (2010), PISA 2009 Results: What Students Know and Can Do: Student Performance in Reading, Mathematics and Science (Volume I), PISA, OECD Publishing, Paris, https://doi.org/10.1787/9789264091450-en. [32]
OECD (n.d.), PISA 2022 Database, https://www.oecd.org/en/data/datasets/pisa-2022- database.html (accessed on July 2025). [1]
Okubo, T. (2025), "Collective Intelligence Model for Education (CIME)", OECD Education Working Papers, Vol. 325, pp. 1-35, https://doi.org/10.1787/c673cc25-en. [8]
Okubo, T. (2024), "Towards more diverse and flexible international large-scale assessments", OECD Education Working Papers, Vol. 310, pp. 1-59, https://doi.org/10.1787/0417b5ec-en. [27]
Okubo, T. (2022), "Theoretical considerations on scaling methodology in PISA", OECD Education Working Papers, No. 282, OECD Publishing, Paris, https://doi.org/10.1787/c224dbeb-en. [31]
Okubo, T. et al. (2023), "AI scoring for international large-scale assessments using a deep learning model and multilingual data", OECD Education Working Papers, No. 287, OECD Publishing, Paris, https://doi.org/10.1787/9918e1fb-en. [12]
Phillips, G. (ed.) (1995), "Generalizability of Performance Assessments", Educational Measurement: Issues and Practice, Vol. 14, pp. 9-12, https://onlinelibrary.wiley.com/doi/epdf/10.1111/j.1745-3992.1995.tb00882.x. [19]
Rubin, D. (1987), Multiple imputation for nonresponse in surveys, John Wiley and Sons. [28]
Copyright Organisation for Economic Cooperation and Development (OECD) 2025
