Content area
Educational attainment is vital in social science research for analysing socioeconomic inequalities, labour market outcomes, and health disparities. Harmonisation schemes such as the International Standard Classification of Education (ISCED) and its survey-specific adaptation EDULVLB aim to standardise educational classifications across countries, enabling international comparability. Despite their widespread use, concerns persist regarding the reliability of these harmonised measures, particularly at the individual level and across different survey modes. This study evaluates the reliability of harmonised educational attainment measurements using test-retest data from Estonia, Slovenia, and the United Kingdom. Respondents’ answers from the face-to-face European Social Survey Round 8 (2016) and the online CRONOS Panel Wave 6 (2018) were analysed, with reliability coefficients estimated for both the one-digit ISCED and the more detailed EDULVLB classification. The results reveal notable individual-level inconsistencies, especially in the United Kingdom, challenging assumptions of high reliability in harmonised education data. Inconsistencies were most common between adjacent educational levels, suggesting difficulties distinguishing similar qualifications. Device effects were also observed, with smartphone users displaying lower consistency than computers or tablets users. While mode effects could not be fully disentangled from measurement error, the findings underscore the need for systematic reliability assessments and improved instrument design to ensure the comparability and validity of educational measures in cross-national survey research.
Introduction
Educational attainment plays a crucial role in social science research. It helps us understand socioeconomic inequality, labour market outcomes, and health disparities. Beyond academia, it influences policy decisions and benchmarks data quality in national and international surveys (Ortmanns, 2020b; Schneider, 2010b). Harmonisation schemes like the International Standard Classification of Education (ISCED) and its survey-specific adaptations, like the EDULVLB of the European Social Survey, strive to standardise educational classifications across different national systems (Schneider, 2009, 2016; UNESCO, 2012). These schemes enable international comparability by classifying educational attainment based on dimensions such as level, orientation, duration, and destination. They offer flexibility for analyses, allowing variables to be treated as nominal or ordinal. Despite their widespread adoption, concerns persist about their reliability and ability to capture consistent data across individuals, time points, and survey modes.
Reliability is crucial for high-quality measurement, yet its levels in harmonised data may not always receive thorough examination. Much of the existing research focuses on aggregated distributions, which might miss inconsistencies at the individual level that can affect data accuracy. For instance, Ortmanns and Schneider (2016) assessed the stability of aggregated educational attainment distributions across large-scale surveys, interpreting stable distributions as indicators of reliable measurements. This approach, however, may not detect errors at the individual level, where discrepancies could be more pronounced. Early studies by Porst and Zeifang (1987) on socio-demographic variables showed limited reliability, suggesting that even factual data like educational attainment can be subject to measurement errors. These observations underscore the importance of conducting systematic test-retest analyses to identify hidden inconsistencies.
Mixed-mode survey designs add further complexity to the reliability of educational measurements. To reduce costs and improve response rates, surveys like the European Social Survey (ESS) are shifting from traditional face-to-face formats to mixed-mode designs that include self-completion via web or mobile devices (European Social Survey, 2024). The COVID-19 pandemic has accelerated this transition. However, this shift introduces mode effects that can interact with measurement errors, making data comparability across modes more challenging. Respondents may report their educational qualifications differently across survey modes due to several factors. For instance, self-completion modes can increase the cognitive burden on respondents, particularly when questions require careful attention to detail without interviewer assistance. Respondents facing demanding or lengthy questions may provide less precise answers or select approximate options to minimise effort (Krosnick, 1991; Roßmann et al., 2018; Tourangeau, 2003). Similarly, the presence of an interviewer can influence how respondents answer by keeping them engaged and attentive throughout the questionnaire. At the same time, it may introduce social desirability bias, leading individuals to adjust their responses based on perceived expectations (de Leeuw, 2005; Kreuter et al., 2008). Another source of inconsistency is the device used in self-completion surveys. Differences in screen size, touchscreen interfaces, and formatting across smartphones and computers can contribute to selection errors or accidental responses, further complicating data comparability (Struminskaya et al., 2015; Tourangeau et al., 2018). These challenges raise concerns about the broader implications of mixed-mode designs for cross-national comparability and reliability.
This study aims to fill these gaps by evaluating the reliability of harmonised educational attainment measurements using test-retest data from Estonia, Slovenia, and the United Kingdom. We analyse responses from two data collection points: the face-to-face ESS Round 8 (2016) and the self-completion online CRONOS Panel Wave 6 (2018). By comparing reliability across these modes, we examine individual-level inconsistencies and their implications for cross-national comparability. We also explore how mode effects interact with measurement errors, contributing to the growing literature on mixed-mode survey design and reliability.
Our findings reveal unexpectedly low reliability in harmonised educational attainment measures, even when using widely adopted schemes like ISCED and EDULVLB. These results challenge the assumption that harmonised data are inherently comparable and highlight the need for more systematic test-retest studies to evaluate measurement consistency. The inability to fully separate mode effects from measurement errors remains an unresolved issue in mixed-mode research. Future studies should focus on developing instruments that can isolate these sources of error, ensuring robust reliability across modes and over time. By addressing these challenges, our research lays the groundwork for improving the comparability and validity of educational measurements in cross-national surveys.
Background and literature review
Importance of harmonisation of measurements of education to comparability
Educational attainment is defined as the highest level of formal education an individual has completed, capturing key socioeconomic constructs such as human capital (skills and knowledge), cultural capital (cultural competence and resources), and socioeconomic status (position within social and economic hierarchies) (Schneider, 2009). It plays a central role in social science research, serving as an independent or dependent variable, a proxy for other variables, or as a control variable both inside and outside the social sciences. In survey methodology, educational attainment serves to assess survey representativeness and correct for non-response error through post-stratification weights (Lynn and Anghelescu, 2018; Ortmanns, 2020b). Given its importance, large-scale surveys prioritise high-quality educational measures while balancing flexibility and international comparability (Schneider, 2022).
However, achieving international comparability in measuring educational attainment is challenging due to the inherent diversity of educational systems. These systems differ in institutions, certificates, and structures, shaped by historical and cultural factors, complicating direct comparisons (Ortmanns, 2020b; Schneider, 2009). Harmonisation schemes such as ISCED (UNESCO, 2012), CASMIN (Brauns et al., 2003), ES-ISCED from the European Social Survey (Schneider, 2010b, 2016) or GISCED (Schneider, 2022) aim to bridge these differences. These schemes strive to ensure construct validity while enabling meaningful cross-national comparisons of educational attainment.
The importance of these harmonisation efforts is underscored by persistent inconsistencies in the distribution of educational attainment data across surveys. Schneider (2009) found these issues by comparing the data from the European Labour Force Survey, the European Survey of Income and Living Conditions, and the ESS across 26 European countries and multiple years. Ortmanns and Schneider (2016) confirmed similar issues in other international surveys, such as the International Social Survey Programme, the European Values Survey, and the Eurobarometer. Further, Ortmanns (2020a) showed that the observed inconsistencies across the aforementioned and other surveys like the Adult Education Survey, the European Quality of Life Survey, and the European Working Condition Survey, stemmed primarily from mapping errors and differences in response categories, even after accounting for other potential error sources. These findings have promoted the revisions in harmonisation protocols and instruments and impacted how education is measured.
The impact of harmonisation on survey measurement instruments
The introduction of harmonisation frameworks for measuring education has necessitated a close examination of the design of measurement instruments. Comparability in educational measures is enhanced when harmonisation is embedded into the instrument design through ex-ante harmonisation, ensuring consistency across countries before data collection begins (Granda et al., 2010). This approach minimises the challenges associated with harmonising data retrospectively, which often involves reconciling incomplete or ambiguous information (Schneider, 2022). As a result, some surveys have revised or redesigned their instruments to incorporate harmonisation principles effectively.
A prominent example of such adaptation is the European Social Survey (ESS), which undertook a comprehensive revision of its educational measurement for its 5th round of data collection. This process involved collaboration with national and international experts in education, the tailoring of the ISCED to European societies, and the implementation of standardised harmonisation procedures across participating countries. Additionally, national instruments, known as Country-Specific Educational Variables (CSEVs), were refined to align with these harmonised standards (Schneider, 2010a). Post-revision analyses of the distribution of educational attainment in the ESS indicate enhanced comparability, suggesting the effectiveness of ex-ante harmonisation in improving data quality (Ortmanns and Schneider, 2016).
One notable outcome of this harmonisation effort was the development of more nuanced educational measures that capture programme orientation, duration, and educational outcomes. Prior to harmonisation, some educational classifications were overlooked due to their limited relevance within specific national contexts. For instance, vocational education might be deemed less significant in some countries, yet comparative studies of its societal impact require consistent and harmonised measurement across nations (Smyth and Steinmetz, 2015).
Another significant implication of the harmonisation framework has been a shift toward instruments that classify education based on the highest completed qualifications rather than years of education (Ortmanns, 2020b). Empirical evidence supports this approach, with educational classifications exhibiting higher construct validity and cross-national comparability compared to years of education. Frameworks like the ESS version of the ISCED (ES-ISCED) and its more detailed form, EDULVLB, are designed with academic aims in mind, further enhancing the reliability and applicability of these measures in comparative research (Schneider, 2010b).
Figure 1, adapted from Schneider (2009, p. 57), illustrates the process of operationalizing educational attainment for comparable measurement using educational classifications. The measurement instrument serves as a bridge, linking the target indicator of educational attainment (to ensure construct validity) with a harmonisation framework (to achieve comparability in the analytical measure). It captures the highest educational qualification completed by survey respondents, using classifications tailored to the specific country’s educational system. Additionally, it incorporates detailed information on multiple dimensions essential for harmonisation with international standards. These dimensions include level, orientation, destination, and duration, which vary according to the characteristics of each educational system. The outcome is a harmonised educational classification that functions effectively as a nominal or ordinal scale of educational attainment.
[See PDF for image]
Fig. 1
Operationalization of educational variables for comparability.
Adapted from Schneider (2009, p. 57).
Changes that concern the reliability of the measurement
Achieving ex-ante harmonised measurement requires capturing detailed information about the highest educational qualification completed. Broad categories, such as “university degree,” are too vague to capture the diversity of qualifications granted by universities. In practice, this often necessitates expanding the list of classifications or adding more detailed descriptions to the educational options presented to respondents. While this increased detail can improve accuracy, it may also complicate the response process, especially if respondents struggle to differentiate between similar levels of education.
Changes in the length and complexity of survey questions can also affect the reliability of educational measurement. Generally, longer questions and questions with a larger number of answer categories are less reliable (Alwin, 2007; Alwin et al., 2018). However, as a factual question, educational attainment tends to exhibit higher reliability compared to subjective or attitudinal questions (Alwin, 2007; Hout and Hastings, 2016). Previous studies have highlighted the challenges posed by instruments with long lists of educational qualifications. While alternative approaches aim to reduce respondents’ burden and maintain validity and comparability, these solutions also encounter limitations (Herzing, 2020; Schneider et al., 2016, 2018).
The European Social Survey (ESS) illustrates the effects of harmonisation on the complexity of educational measurement. In Slovenia, the Country-Specific Educational Variable (CSEV) had 7 answer categories and 38 words in Round 1. By Round 8, this expanded to 13 categories with 340 words. Similarly, Estonia’s initial participation in Round 2 included 13 answer options with a word count of 53 in Estonian and 73 in Russian. By Round 8, the number of options rose to 15, with word counts increasing to 103 in Estonian and 170 in Russian. In the United Kingdom, Round 1 included 5 options with 161 words, while Round 8 had 18 options split across two lists, totalling 322 words. These examples suggest harmonisation efforts have increased the complexity of response options, both in terms of the number of categories and the length of descriptions.
Current and future developments in survey methodology present new challenges for ensuring the comparability of educational measurements. Survey methodologists expect a growing adoption of mixed-mode data collection, which is driven by goals to improve response rates, enhance data quality, and reduce costs (Couper, 2011; DeLeeuw, 2018). The COVID-19 pandemic has accelerated this shift, exposing the vulnerabilities of relying solely on face-to-face data collection and encouraging a broader evaluation of alternative methods (Gummer et al., 2020; Scherpenzeel et al., 2020). Even the ESS, which has long adhered to face-to-face data collection as a strict standard, plans to incorporate self-completion and mixed-device approaches in the near future (European Social Survey, 2024).
As these changes unfold, survey users and social science researchers will expect educational qualification measures to be sufficiently reliable across modes to ensure comparability. Additionally, survey methodologists will rely on educational variables to evaluate sample composition across modes. The comparability of educational measurements across collection modes will directly influence the quality of both substantive and methodological research that depends on these variables for their conclusions. Ensuring robust measurement across modes is, therefore, critical for the continued utility of educational data in cross-national surveys.
How Reliable are Harmonised Survey Instruments of Educational Classifications?
Despite the crucial role of educational attainment data, few studies have examined individual-level measurement errors for reliability. Porst and Zeifang (1987) analysed the reliability of socio-demographic variables using the German General Social Survey (ALLBUS) and found minimally acceptable reliability for occupational education, with a Kendall’s Tau as low as 72% (Porst and Zeifang, 1987: p194-196). Although this study included the orientation of education with vocational education, the classifications of educational attainment were relatively broad compared to those used in internationally harmonised instruments. The results of this study have limited comparability with the ex-ante harmonised instruments used in contemporary surveys.
Hout and Hasting (2016) investigated the reliability of years of education using repeated measurements from the U.S. General Social Survey (GSS), reporting high reliability for this measure. However, prior research has highlighted both conceptual and empirical limitations in using years of education to measure educational attainment for international comparisons (Schneider, 2009, 2010b).
Schneider (2009), Ortmanns and Schneider (2016), and Ortmanns (2020a) provide an extensive evaluation of the aggregated-level stability of educational measurements by looking at multiple large-scale international surveys. These studies compared the distributions of educational attainment using the ISCED across different surveys and rounds. Inconsistencies in distributions were quantified using the Duncan Dissimilarity Index (Duncan and Duncan, 1955), which measures the proportion of cases requiring reclassification to achieve identical distributions.
Ortmanns and Schneider (2016) analysed data from 2002 to 2008 across 34 countries using the European Social Survey (ESS), the Eurobarometer (EB), and the International Social Survey Programme (ISSP). They found that, on average, 8.6% of respondents in each sample would need to change their educational classification to achieve consistency with the distribution of the previous wave. The degree of inconsistency varied across both countries and surveys, with the ISSP showing the highest average dissimilarity. Some countries reached values as high as 16%, while others were as low as 3% (Ortmanns and Schneider, 2016, pp. 573–574). Ortmanns (2020a) compared data from European countries to the European Union Labour Force Survey (EU-LFS) as an external benchmark. The average dissimilarity was around 13% across several survey programmes, which included the European Union Statistics on Income and Living Conditions (EU-SILC), the European Values Study (EVS), the Programme for the International Assessment of Adult Competencies (PIAAC), the Adult Education Survey (AES), the European Quality of Life Survey (EQLS), the European Working Conditions Survey (EWCS), as well as the EB, ISSP, and ESS (Ortmanns, 2020a, p. 385).
The underlying assumption of these aggregated-level studies of educational measurement is that valid and reliable measurements should produce similar educational distributions across representative surveys of the same country and year, as well as over time, with only gradual changes expected. One possible explanation for observed inconsistencies is low reliability at the individual level. However, other factors can also lead to distributional inconsistencies, including errors in harmonisation or mapping procedures, and issues related to representativeness, such as sampling error and non-response bias (Ortmanns, 2020a). Importantly, aggregate-level analyses cannot assess the reliability of the measurement itself, but only the consistency of the resulting category distributions within the sample across surveys or over time.
Aggregated analyses cannot detect individual-level inconsistencies or the effects of random measurement errors, which are better captured through test-retest reliability studies (Schermelleh-Engel and Werner, 2012). While equal distributions at the aggregated level may suggest stability, they can mask unstable individual-level measurements, allowing random errors to go undetected. For instance, even if every respondent were to change their reported educational level between two measurement points, the overall distribution could remain identical if the number of cases in each category stayed the same. In such a scenario, the measurement would be completely unreliable at the individual level, despite appearing consistent in aggregate. This highlights the importance of conducting test-retest studies to evaluate the reliability of harmonised educational measurements, ensuring their robustness for both individual and aggregated analyses.
Research design and methodology
Data collection and sample composition
The study employed a test-retest design using data from the same respondents (n = 1512) across two consecutive surveys conducted by the European Social Survey European Research Infrastructure Consortium (ESS ERIC) in Estonia, Slovenia, and the United Kingdom. Respondents first reported their educational qualifications during the main European Social Survey Round 8—ESS8—in 2016 (European Social Survey European Research Infrastructure, 2023a). They provided a second response to the same question during Wave 6 of the CRONOS Panel in 2018, a follow-up survey for ESS8 participants in the aforementioned countries (European Social Survey European Research Infrastructure, 2023b).
Temporal instability can affect reliability correlations, as observed changes might reflect true score variability rather than measurement error (Danner, 2015; Schermelleh-Engel and Werner, 2012). To address this potential issue, we excluded 70 respondents (Estonia = 26, United Kingdom = 19, Slovenia = 25) who completed their reported educational qualifications after the ESS8 fieldwork began in September 2016. The CRONOS Panel sample overrepresented younger and more highly educated respondents from ESS8, resulting in a larger proportion of these demographics in the analysis.
After applying exclusion criteria, the final sample sizes were 516 respondents in Estonia, 509 in the United Kingdom, and 487 in Slovenia. In the UK sample, we retained cases only if respondents provided information in all three CSEV questions necessary to map to the EDULVLB variable. A detailed breakdown of the sample size at different stages of the study can be found in the Supplementary Table S1.
To assess the representativeness of the analytical subsample used in the main models relative to the full ESS Round 8 sample, we compared key demographic and structural variables across the three countries. Specifically, we calculated the percentage of respondents in the full sample and in the analytical subsample using the unweighted data for the following categories: gender (gndr, female = 2), age groups (in 10-year bands from 15 to 75+), citizenship status (ctzcntr, citizen = 1), paid work status (pdwrk and crpdwk), and highest level of education (based on ISCED 1-digit codes derived from the face-to-face measurement). Percentages and 95% confidence intervals for each category were computed within each country. We then calculated the difference in percentage points between the subset and the full sample for each category. Over- or underrepresentation was flagged when the proportion observed in the subsample fell outside the 95% confidence interval of the corresponding full sample estimate. The comparison between the analytical subsample and the full ESS Round 8 sample (presented in Table S3 of the Supplementary Materials) shows that the subsample overrepresents respondents with higher educational attainment and those in paid employment. Both the youngest and oldest age groups are underrepresented. Some country-specific deviations were also observed: in Estonia, women and citizens were overrepresented in the analytical sample, a pattern not found in Slovenia or the United Kingdom.
While the CRONOS panel is a subsample of the ESS Round 8 sample, the overall external representativeness of the ESS samples has been independently evaluated. A comparative assessment using external benchmark data from the EU Labour Force Survey (LFS) found consistent patterns of misrepresentation across countries, with some demographic groups systematically over- or underrepresented. Specifically, the ESS tends to underrepresent younger individuals and non-nationals, while overrepresenting females and married persons. These patterns were also observed in Estonia, Slovenia, and the United Kingdom (Koch and Briceno-Rosas, 2021).
Instrument and measurement
The educational qualification question and its answer options were identical in ESS8 (t1) and CRONOS Wave 6 (t2), with consistent wording across both time points. Implementation differed slightly to accommodate the mode of data collection. In ESS8, a face-to-face survey, respondents received a showcard listing the answer options while an interviewer read the question aloud. In the web-based CRONOS survey, both the question and answer options appeared on the screen, including options for “don’t know” and “refusal.” Screenshots of the ESS8 showcards and the CRONOS web interface are available in the Supplementary Material (Figs. S1 to S9).
Educational qualifications were harmonised using the EDULVLB variable, a detailed three-digit ISCED code, and the aggregated 1-digit ISCED 2011 variable. This harmonisation allows for broader comparisons across different educational systems. The same mapping procedures were applied to educational reports from both ESS8 and CRONOS Wave 6 (see Supplementary Table S2 for details).
Analysis methods
We applied a series of statistical methods to assess the reliability and consistency of educational qualification data reported by respondents at two points in time (t1 and t2). Our objective was to understand not only the extent of consistency but also the underlying reasons for any discrepancies observed.
1. Reliability assessment of educational measurements
To evaluate the consistency of respondents’ educational qualifications between t1 and t2, we calculated reliability coefficients for both the one-digit ISCED and the detailed EDULVLB variables both as nominal and ordinal scales. We used Cramer’s V to assess nominal reliability of the educational measurement, recognising that while the ISCED codes are hierarchically ordered, their nominal interpretation also provides important insights. When treated as a nominal variable, the focus is on the qualitative differences between specific educational levels rather than on the ordinal distance between them. This perspective is particularly relevant when examining the impact of distinct educational categories, for example, the effect of post-secondary non-tertiary education versus other levels, the differences between lower secondary and higher secondary education on socio-economic outcomes or labour market implications. In these contexts, the absolute ordering is less informative than the substantive, qualitative differences between categories. In contrast, ordinal reliability measures prioritise the overall hierarchical structure and can downplay discrepancies between closely related levels.
We used Kendall’s Tau to assess the reliability of the educational measurement as an ordinal scale, as both ISCED and EDULVLB exhibit an ordinal structure with unequal distances between categories. Unlike Spearman’s Rho, which assumes a monotonic relationship and is more sensitive to large rank differences, Kendall’s Tau is particularly suited for non-numeric ordinal scales where the relative distances between levels are inconsistent. For example, the difference between lower and upper secondary education is not necessarily equivalent to that between a Bachelor’s and a Master’s degree. This issue is even more pronounced for the EDULVLB variable, where changes in the first digit (level of education) are substantially different to changes in second digit (orientation) or third digits (duration or destination) of the educational classification.
An advantage of Kendall’s Tau is its robustness to tied ranks, which are frequent in EDULVLB categories. Unlike Spearman’s Rho, which adjusts for ties but relies on rank differences, Kendall’s Tau directly accounts for ties, making it more appropriate for survey data. It is also less sensitive to extreme shifts, providing a more stable reliability estimate over time. Kendall’s Tau has been used in survey test-retest studies of educational measurement, including Porst and Zeifang (1987). Moreover, although test-retest measures can be affected by memory bias, where respondents recall and repeat their previous answers, a comprehensive review by Tourangeau (2021) found that over-time correlation estimates yield results similar to alternative reliability estimation methods, supporting their validity in survey research.
By assessing both nominal and ordinal reliability, our study captures the nuanced qualitative differences as well as the broader hierarchical order of educational attainment, ensuring a thorough evaluation of consistency across different scales of measurement.
Finally, we compared the reliability coefficients with Duncan’s Dissimilarity Index (Duncan and Duncan, 1955) as an indicator of aggregate-level consistency between distributions. This approach has previously served as an alternative method for evaluating the reliability of measurement (Ortmanns, 2020a; Ortmanns and Schneider, 2016).
2. Flow of inconsistencies
To pinpoint where inconsistencies occurred between t1 and t2, we analysed the patterns of respondents’ answers. Mapping the percentage of consistent and inconsistent responses enabled us to identify specific ISCED levels where discrepancies were most prevalent. We utilised cross-tabulation and flow diagrams to visualise the transitions between educational levels reported at the two time points.
This method allowed us to observe the movement of respondents between categories, highlighting any systematic shifts or patterns in reporting. By identifying the ISCED levels with the highest rates of inconsistency, we could investigate potential factors contributing to these discrepancies, such as ambiguities in question wording or differences in respondents’ interpretations of educational categories.
Understanding the flow of inconsistencies was crucial for diagnosing specific issues within the data collection process. It helped us to target areas where the survey instrument or methodology could be refined to improve data accuracy in future studies.
3. Exploration of potential causes for inconsistencies
Recognising that inconsistencies might stem from factors beyond respondent error, we explored potential explanations related to survey administration and data collection methods.
Interviewer effects
We assessed whether the interviewers conducting the face-to-face ESS8 survey influenced the consistency of responses. Interviewer effects can occur when interviewers’ behaviours, attitudes, or interpretations subtly affect how respondents understand and answer questions.
To estimate interviewer effects on the consistency of responses, we intended to use a logistic regression model with response consistency as the dependent variable and interviewers specified as random intercepts. However, for several combinations of country and educational variable, the models failed to converge due to boundary (singular) fits, indicating that the variance attributable to interviewers was effectively zero. To address this, we re-estimated the models using linear probability models (LPMs) with the same structure. While LPMs are less ideal for binary outcomes, they are a suitable alternative for estimating intra-class correlation coefficients (ICCs) when variance components are near-zero. We clustered responses within interviewers to account for the hierarchical data structure, recognising that respondents are nested within interviewer assignments.
The intra-class correlation coefficient from an empty model provided an estimate of the variance in consistency attributable to differences between interviewers. While this approach helps to examine interviewer-related variance, it does not fully isolate interviewer effects from other sources of clustering, such as geographic or regional factors. In face-to-face surveys like the ESS, interviewers are not randomly assigned to respondents but are typically deployed within specific areas. This introduces potential confounding due to neighbourhood or regional effects, which may overlap with interviewer assignments. As such, ICC estimates in this context must be interpreted with caution and cannot be assumed to reflect interviewer effects alone.
Device effects
We examined whether the type of device used in the self-completion CRONOS survey affected response consistency. Respondents were allowed to use a range of devices (desktop computers, tablets, and smartphones), introducing potential differences in screen size, interface design, and input methods that could influence how questions are perceived and answered. Research suggests that smaller screens and touchscreen interfaces can increase response errors due to scrolling requirements and accidental selections (Tourangeau et al., 2018).
While we do not have access to paradata on respondents’ locations, mobile devices are more likely to be used in settings that differ from face-to-face surveys, such as public places or while multitasking. This could introduce additional variability in response consistency, as distractions and environmental conditions may impact respondents’ ability to focus (Tourangeau et al., 2018). However, empirical studies have found limited evidence that environmental factors significantly affect data quality (Mavletova and Couper, 2013; Toninelli and Revilla, 2016). Given these considerations, our analysis focused on whether device type itself contributes to inconsistencies in educational attainment reporting, acknowledging that setting differences may also play a role.
We assessed the association between device type (computer/tablet versus smartphone) and the consistency of educational measurement using logistic regression models. Consistency was coded as a binary outcome, indicating whether respondents provided the same educational response across both time points. Device type, recorded during the CRONOS panel wave, was included as a categorical independent variable. Separate models were estimated for each country and for both educational variables (ISCED 1-digit and EDULVLB). Odds ratios and 95% confidence intervals were used to evaluate the strength and direction of the associations. This approach allows for direct estimation of the likelihood of consistent responses based on device type.
By analysing device effects, we aimed to determine if certain devices were associated with higher rates of inconsistency. A significant association would indicate that device-related factors, such as ease of navigation or readability, might influence respondents’ ability to provide consistent answers. Understanding these effects is essential for designing surveys that yield reliable data across various devices, ensuring that technological factors do not compromise data quality. All analyses were conducted using the statistical software R (version 2024.04.2 + 764).
Results
Reliability of educational measurement
Table 1 presents the reliability coefficients for the 1-digit ISCED. In Estonia, 23.6% of respondents showed inconsistencies in the 1-digit ISCED variable, with a Cramer’s V of 0.67 and a Kendall’s Tau of 0.88. Slovenia had 13.1% inconsistencies, with a Cramer’s V of 0.75 and a Kendall’s Tau of 0.88. The United Kingdom exhibited 35.4% inconsistencies, with a Cramer’s V of 0.61 and a Kendall’s Tau of 0.75. The overlap in confidence intervals between Slovenia and Estonia indicates that differences in reliability estimates between these two countries are likely not substantial. In contrast, the United Kingdom shows distinctly lower reliability, with confidence intervals clearly separated from Slovenia and only slightly overlapping with Estonia for Cramer’s V on the nominal scale.
Table 1. Reliability coefficients of measurement of education based on ISCED 1-digit variable, ESS round 8 (t1) and CRONOS wave 6 (t2).
Country | Na | Inconsistencies (%) | Cramer’s V (nominal scale) | Kendell’s Tau (ordinal scale) |
|---|---|---|---|---|
Estonia | 516 | 23.6 | 0.67 (0.62; 0.77) | 0.84 (0.80; 0.87) |
Slovenia | 487 | 13.1 | 0.75 (0.70; 0.83) | 0.88 (0.85; 0.91) |
United Kingdom | 509 | 35.4 | 0.61 (0.56; 0.66) | 0.75 (0.71; 0.79) |
aN refers to the number of respondents with valid data recorded at both t1 and t2, who completed their reported education before t1. The 95% confidence intervals are presented in parathesis. Cramer’s V and Kendell’s Tau confidence intervals were estimated using percentile bootstrapping with 2000 resamples. All coefficients are significant at p < 0.001.
For the EDULVLB variable in Table 2, Estonia showed 33.7% inconsistencies, with a Cramer’s V of 0.62 and a Kendall’s Tau of 0.81. Slovenia had 27.3% inconsistencies, with a Cramer’s V of 0.70 and a Kendall’s Tau of 0.80. The United Kingdom exhibited 43.0% inconsistencies, with a Cramer’s V of 0.48 and a Kendall’s Tau of 0.73. Similar to the ISCED 1-digit findings, the confidence intervals between Slovenia and Estonia overlap considerably, suggesting that differences in their reliability estimates may not be substantial. The United Kingdom again displays distinctly lower reliability values, with its confidence intervals clearly separated from Estonia and only slightly overlapping with Slovenia for Kendall’s Tau on the ordinal scale.
Table 2. Reliability coefficients of measurement of education based on EDULVLB variable, ESS round 8 (t1) and CRONOS wave 6 (t2).
Country | Na | Inconsistencies (%) | Cramer’s V (nominal scale) | Kendell’s Tau (ordinal scale) |
|---|---|---|---|---|
Estonia | 516 | 33.7 | 0.62 (0.58; 0.70) | 0.81 (0.78; 0.85) |
Slovenia | 487 | 27.3 | 0.70 (0.66; 0.77) | 0.80 (0.76; 0.84) |
United Kingdom | 509 | 43.0 | 0.48 (0.47; 0.55) | 0.73 (0.68; 0.77) |
aN refers to the number of respondents with valid data recorded at both t1 and t2, who completed their reported education before t1. The 95% confidence intervals are presented in parathesis. Cramer’s V and Kendell’s Tau confidence intervals were estimated using percentile bootstrapping with 2000 resamples. All coefficients are significant at p < 0.001.
These reliability coefficients indicate that while the reliability indicators (Cramer’s V and Kendall’s Tau) were relatively high, they did not reach the expected threshold of above 0.9. This suggests the presence of undesired errors in the measurement of educational qualifications. The inconsistencies were most pronounced in the United Kingdom, followed by Estonia and Slovenia. The higher inconsistency rates in the United Kingdom may reflect greater variability in how educational qualifications were reported in both data collection points.
Comparison with Duncan’s dissimilarity index
The comparison between the reliability coefficient and Duncan’s index for the distribution between t1 and t2 reveals notable discrepancies between individual-level reliability and aggregated distributional consistency. For ISCED 1-digit, the Duncan index was 7.4% in Estonia, 1.4% in Slovenia, and 8.4% in the United Kingdom. Using EDULVLB, the Duncan index slightly increased to 8.4% for Estonia, 3.5% for Slovenia, and 9.4% for the United Kingdom.
These results indicate that inconsistencies in the aggregated distribution are substantially lower than those observed at the individual level, suggesting that individual-level discrepancies exert only a minimal influence on the overall distribution of educational attainment in the sample. A notable pattern is that ISCED 1-digit appears to perform better than the more detailed EDULVLB across the countries analysed, both at the individual and aggregated levels. However, while inconsistencies at the aggregated level were most pronounced in the United Kingdom, followed by Estonia, the relative magnitude of these inconsistencies differed from those observed at the individual level, underscoring that these methods of reliability assessment are not interchangeable.
Flow of inconsistencies
To further understand the previously observed inconsistencies, we observe the characteristics of the inconsistencies between t1 and t2 and examine the relevance of specific combinations of inconsistencies. The flow of inconsistencies for the 1-digit ISCED variable, visualised in Fig. 2, maps the percentage of inconsistent responses between t1 and t2 across the three countries. Figure 3 provides the flow inconsistencies for the EDULVLB variable. The specific percentages are detailed in Supplementary Tables S4 to S9.
[See PDF for image]
Fig. 2
Flow of inconsistencies between t1 and t2 in percentages for measurement of education harmonised to ISCED 2011 1-digit (levels of education).
Description of ISCED 2011 1-digit codes: 0 = Less than primary education; 1 = Primary education; 2 = Lower secondary education; 3 = Upper secondary education; 4 = Post-secondary non-tertiary education; 5 = Short-cycle tertiary education; 6 = Bachelor’s or equivalent level; 7 = Master’s or equivalent level; 8 = Doctoral or equivalent level.
[See PDF for image]
Fig. 3
Flow of inconsistencies between t1 and t2 in percentages for measurement of education harmonised to EDULVLB (detailed measurement).
Description of EDULVLB codes with reference to ISCED-97: 000 = Not completed ISCED level 1; 113 = ISCED 1, completed primary education; 129 = Vocational ISCED 2C < 2 years, no access ISCED 3; 212 = General/pre-vocational ISCED 2A/2B, access ISCED 3 vocational; 213 = General ISCED 2A, access ISCED 3A general/all 3; 221 = Vocational ISCED 2C ≥ 2 years, no access ISCED 3; 222 = Vocational ISCED 2A/2B, access ISCED 3 vocational; 223 = Vocational ISCED 2, access ISCED 3 general/all; 229 = Vocational ISCED 3C < 2 years, no access ISCED 5; 311 = General ISCED 3 ≥ 2 years, no access ISCED 5; 312 = General ISCED 3A/3B, access ISCED 5B/lower tier 5A; 313 = General ISCED 3A, access upper tier ISCED 5A/all 5; 321 = Vocational ISCED 3C ≥ 2 years, no access ISCED 5; 323 = Vocational ISCED 3A, access upper tier ISCED 5A/all 5; 412 = General ISCED 4A/4B, access ISCED 5B/lower tier 5A; 413 = General ISCED 4A, access upper tier ISCED 5A/all 5; 421 = ISCED 4 programmes without access ISCED 5; 422 = Vocational ISCED 4A/4B, access ISCED 5B/lower tier 5A; 423 = Vocational ISCED 4A, access upper tier ISCED 5A/all 5; 510 = ISCED 5A short, intermediate/academic/general tertiary below; 520 = ISCED 5B short, advanced vocational qualifications; 610 = ISCED 5A medium, bachelor/equivalent from lower tier tertiary; 620 = ISCED 5A medium, bachelor/equivalent from upper/single tier; 710 = ISCED 5A long, master/equivalent from lower tier tertiary; 720 = ISCED 5A long, master/equivalent from upper/single tier tertiary; 800 = ISCED 6, doctoral degree.
First, we examined the distance of inconsistencies in the classification of education. Most inconsistencies occurred between adjacent or subsequent education classifications. In Slovenia, the majority of inconsistencies within ISCED 1-digit classifications were across adjacent levels, followed by Estonia. In contrast, the United Kingdom displayed the least tendency for adjacent-level inconsistencies, with a larger proportion involving differences of two or more ISCED levels. A similar pattern was observed in the more detailed EDULVLB measurement, where inconsistencies were predominantly adjacent across countries. However, EDULVLB also showed a higher proportion of discrepancies spanning two or more classifications, such as the inconsistencies in Slovenia between “313” (General ISCED 3A with access to upper-tier ISCED 5A/all 5) and “323” (Vocational ISCED 3A with access to upper-tier ISCED 5A/all 5). The predominance of short-distance inconsistencies helps explain why the measurements performed better as ordinal scales rather than nominal scales, as the closeness of classifications aligns more effectively with the ordinal nature of the data.
The direction of the inconsistencies is also relevant, as it could indicate biases in measurement. Inconsistencies that go up from t1 to t2 would indicate that the face-to-face survey (t1) tends to measure the education of respondents lower, while the opposite would indicate it measures it higher. For ISCED 1-digit, upward inconsistencies occurred in 50.0% of cases in Estonia, 43.8% in Slovenia, and 69.4% in the United Kingdom, with the difference being statistically significant for the United Kingdom at p < 0.001. A similar trend was observed for the EDULVLB variable, where the measurement in t1 was higher than in t2 in 49.2% of cases in Estonia, 46.6% in Slovenia, and 68.0% in the United Kingdom. Again, the difference was statistically significant only in the United Kingdom (p < 0.001).
Lastly, we examine the largest pairs of inconsistencies. In Estonia, the most frequent inconsistencies for the 1-digit ISCED variable are observed between ISCED levels 3 (Upper secondary) and 4 (Post-secondary non-tertiary), followed by inconsistencies between levels 6 (Bachelor) and 7 (Master). For the EDULVLB variable, inconsistencies often involve Master’s qualifications, specifically between lower-tier tertiary education (710) and upper/single-tier tertiary education (720).
In Slovenia, inconsistencies for the 1-digit ISCED variable are more evenly distributed across adjacent pairs between levels 2 and 7. For the more detailed EDULVLB variable, large pairs of inconsistencies are concentrated within upper secondary education. These primarily occur between general (313) and vocational (323) upper secondary qualifications with access to upper-tier short-cycle tertiary education, as well as between vocational qualifications without access (321) and with access (323) to short-cycle tertiary education. This pattern suggests respondents struggled to accurately differentiate between the orientation and the destination/duration of their qualifications when reporting educational attainment.
In the United Kingdom, the largest pair of inconsistencies for the 1-digit ISCED variable occurred between ISCED levels 3 (Upper secondary education) and 5 (Post-secondary non-tertiary), followed by inconsistencies between lower secondary (312) and upper secondary (620) education. For the more detailed EDULVLB variable, inconsistencies are more evenly distributed and do not highlight any particularly large pairs of discrepancies.
Exploring mode effects causing inconsistencies
Given the unexpectedly low levels of reliability in the reported educational qualifications, it is crucial to explore potential explanations for these inconsistencies. Understanding the sources of these inconsistencies can help improve the accuracy of future surveys. Two possible explanations were examined: interviewer effects and device effects.
Interviewer effects
Interviewer effects were assessed to determine whether the individual conducting the face-to-face interviews influenced the consistency of respondents’ answers. This was done using a logistic regression model, where the dependent variable was the consistency of the measurement between t1 and t2. The values were clustered within the interviewer who conducted the interview in t1. The distribution of the consistent and inconsistent responses per interviewer is presented in the Supplementary Materials (Figs. S10 to S12). An intra-class correlation coefficient (ICC) was calculated on an empty model to estimate the percentage of total variance attributable to the interviewers.
Table 3 summarises the ICC results, which indicate that interviewer effects were negligible across all three countries. The highest observed ICC was 0.010 (for EDULVLB in Slovenia), while the remaining values were at or near zero. These results suggest that the variability in the consistency of responses that can be attributed to differences between interviewers was very low. It should also be noted that in the UK, the reduced sample size resulted from a larger number of interviewers conducting only a few interviews, leading to the exclusion of many with fewer than five completed cases. While this limits the ability to estimate interviewer effects for all interviewers, it also reduces the influence of any single interviewer on the overall results, as each contributes relatively little to the total inconsistencies. Together, these findings support the conclusion that the inconsistencies in the reported educational qualifications are likely due to other factors, such as the mode of survey administration or respondent-related factors, rather than the influence of individual interviewers. For combinations where both logistic and linear models could be estimated, the resulting ICCs were nearly identical, reinforcing the robustness of the findings across model specifications.
Table 3. Interviewer effects on consistency of educational measurement t1-t2 based on intra-class correlation coefficient from multilevel linear regressions, empty model with interviewers at level 2.
Country | ICC for consistency EDULVLB | ICC for consistency of ISCED 1-digit |
|---|---|---|
Estonia | 0.000 | 0.005 |
Slovenia | 0.010 | 0.000 |
United Kingdom | 0.000 | 0.000 |
Consistency of answers equals 1 when the same value is recorded at both t1 and t2. The models require at least 5 respondents per interviewer and a minimum of 30 interviewers per country. The sample sizes were: EE (n = 481; m = 45), GB (n = 219; m = 32), SI (n = 470; m = 48); n refers to the number of respondents (level-1 units); m refers to the number of interviewers (level-2 units).
Device effects
The effect of the device used in the self-completion mode (computer/tablets versus smartphone) was examined using logistic regression models, with consistency in educational measurement as the dependent variable. Results are presented in Fig. 4, which shows the odds ratios and 95% confidence intervals for each country and variable.
[See PDF for image]
Fig. 4
Effect of device type on response consistency of educational measurement t1-t2 across countries.
Use of smartphones equals 1; consistency of answers between t1 and t2 equals 1.
In the United Kingdom, smartphone use was significantly associated with lower consistency in reporting educational attainment. Respondents using smartphones had 51% lower odds of consistent responses for EDULVLB (OR = 0.49, 95% CI: 0.34; 0.71, p < 0.001) and 34% lower odds for ISCED 1-digit (OR = 0.66, 95% CI: 0.45; 0.96, p < 0.05) compared to those using computers or tablets. These were the only statistically significant device effects observed across the three countries. In Estonia and Slovenia, the odds ratios also pointed in the same direction, indicating lower consistency among smartphone users, but these associations were smaller in magnitude and not statistically significant.
These findings suggest that the type of device used in self-completion surveys can influence respondents’ ability to report their educational qualifications consistently. Device-related factors such as screen size, navigation interface, and input method may contribute to inconsistencies, particularly in longer or more cognitively demanding questions. This may be especially relevant in the United Kingdom, where the educational measurement includes 18 response categories split across two questions, the most complex format among the countries studied. While statistically significant effects were only observed in one country, the results highlight the importance of accounting for device-related variation in data quality assessments within mixed-mode survey designs.
Summary of results
The analysis highlights inconsistencies in the reliability of harmonised educational measurement across the three countries. Reliability coefficients for the ISCED 1-digit and EDULVLB variables fell below the expected thresholds of above 0.9, indicating the presence of measurement errors. Inconsistencies were most pronounced in the United Kingdom, followed by Estonia and Slovenia. These inconsistencies were reflected both at the individual level, where higher rates of discrepancies were observed, and at the aggregated level, where the Duncan Dissimilarity Index indicated lower but notable variability. The results suggest that ISCED 1-digit performed better than EDULVLB in terms of consistency, likely due to its simpler classification structure.
Further examination of the flow of inconsistencies revealed that most discrepancies occurred between adjacent or subsequent education classifications, particularly in Slovenia and Estonia, whereas the United Kingdom exhibited a higher proportion of inconsistencies spanning two or more levels. This pattern suggests challenges in reporting finer distinctions, especially in systems with more detailed classifications like EDULVLB. Additionally, upward inconsistencies (where education levels reported at t2 were higher than at t1) were prevalent in all countries, particularly in the United Kingdom, suggesting potential biases in the mode of data collection.
Mode effects further illuminated the sources of inconsistencies. Minimal interviewer effects were observed, with intra-class correlation coefficients (ICCs) indicating negligible variance attributable to interviewers in Estonia and Slovenia. However, device effects emerged as significant in the United Kingdom, where respondents using smartphones exhibited lower consistency compared to those using computers or tablets. This finding underscores the influence of survey administration modes on data quality and highlights the need for careful consideration of device compatibility in mixed-mode surveys.
Discussion
The findings indicate lower-than-expected reliability for the educational attainment variable, emphasising the need for caution when using this measure. Inconsistencies observed at the sample level may mask even lower reliability at the individual level, as demonstrated by the comparison between the reliability coefficients and Duncan’s index. Previous studies that reported aggregate-level inconsistencies could be affected by much higher individual-level unreliability. Most importantly, aggregated inconsistencies should not be assumed to indicate individual reliability, as the two measures are not interchangeable.
The results also suggest that the measurement performs better as an ordinal scale rather than as a nominal scale. Broader categories, such as low, middle, and high education, could further reduce inconsistencies. However, this comes with trade-offs: broader categories make educational attainment a less precise indicator, and ordinal scales are unsuitable for certain research questions where nominal distinctions are essential.
The findings also have broader relevance beyond the ESS. Aggregated-level inconsistencies in educational measurement have been documented in other comparative survey programmes, including the ISSP, EB, and EVS. While these programmes also rely on harmonised classification schemes such as ISCED or their own variants, the extent of inconsistencies at the individual level remains largely unexplored due to the absence of test-retest data. Given the diversity of educational systems and the variability in how education is operationalised across surveys, it is plausible that these survey programmes face similar or even more pronounced reliability challenges.
Possible mode-specific explanations for lower reliability were explored, though the study design does not allow mode effects to be definitively ruled out (see Limitations). Although no interviewer effects were found, a small but significant correlation between mobile device use and inconsistencies was observed in the country with the lowest reliability. The findings suggest that response burden, driven by the length and complexity of the educational attainment classification, is likely a key source of inconsistency. The demanding nature of the detailed ISCED-based classification may increase cognitive load, leading respondents to provide inconsistent answers across survey waves. This burden is particularly relevant in self-completion modes, where respondents must navigate the question without interviewer support, potentially contributing to misclassification. Device-related factors, such as screen size, scrolling requirements, and response option formatting, may further amplify measurement error. The combination of high cognitive demand and mode effects underscores the importance of designing educational measurement instruments that reduce response burden while ensuring comparability across survey modes.
The particularly low reliability in the UK is unsurprising given its complex educational system, which includes distinct educational pathways across England, Wales, Scotland, and Northern Ireland and a strongly developed vocational training sector. This complexity is reflected in the length of the UK’s educational attainment question, which includes 18 response options and is the only measurement in the study that splits educational attainment into two separate questions. Another contributing factor may be that, compared to countries with similarly developed vocational training (e.g., Germany), specific educational qualifications play a lesser role in the labour market, making the granular distinctions captured by 3-digit ISCED classifications less salient.
Limitations
This study is based on data from only three countries, which limits the generalisability of the findings. While most reliability studies are conducted within single-country settings, the inclusion of only a few countries in cross-national research raises the possibility that the observed reliability issues are influenced by country-specific factors. Differences in educational systems, such as the complexity of educational pathways, the prevalence of multiple educational tracks, and how national qualifications align with ISCED classifications, could influence measurement reliability. The extent to which these findings apply to other ESS countries remains uncertain.
While the ESS harmonises educational attainment using ISCED and incorporates expert consultation in each survey round, differences in questionnaire design could still impact reliability. The results of this study are therefore specific to the ESS measurement approach and may not generalise to other surveys that use different educational classification frameworks.
A further limitation concerns the estimation of interviewer effects. In the ESS, interviewers are not randomly assigned to sample units but typically operate within geographically defined areas. This lack of a fully interpenetrated design limits the ability to disentangle interviewer effects from other sources of clustering, such as neighbourhood or regional characteristics. Although our ICC estimates were very low, suggesting negligible interviewer effects, the potential confounding between interviewer and geographic clustering remains a methodological constraint and should be considered when interpreting these results.
A key limitation is the inability to separate mode effects from questionnaire design effects. While our analysis of interviewer effects and device effects provides insight into potential mode influences, this evidence remains exploratory and inconclusive. To address this limitation, future research should implement experimental designs that measure educational attainment multiple times within face-to-face and self-completion settings to better isolate mode effects from instrument design effects.
Conclusions
This study examined the reliability of harmonised educational attainment measurements in comparative survey research, using test-retest data from three countries in the European Social Survey. The findings reveal notable inconsistencies in individual-level responses, even when broader classification schemes such as one-digit ISCED are applied. These results challenge the assumption that educational attainment, often considered a highly reliable factual variable, is measured consistently across modes, countries, or time.
Our findings show that measurement reliability is higher when education is treated as an ordinal rather than a nominal scale, and when broader categories are used. However, these strategies come at the cost of analytical precision and may not be appropriate for all research questions. Notably, inconsistencies were more pronounced in the United Kingdom, likely due to the complexity of its educational system and the demanding nature of its measurement instrument. We also found evidence of device-related effects, particularly in the United Kingdom, where respondents using smartphones were less likely to provide consistent answers than those using computers or tablets. This suggests that features of self-completion mode can contribute to inconsistency, especially when measurement instruments are long or complex. These findings point to the need to reduce cognitive burden and adapt instrument design for self-completion formats, particularly in mixed-mode surveys.
The study also contributes to a broader understanding of survey quality by highlighting that aggregate-level stability may obscure substantial individual-level inconsistency. This has implications not only for the ESS but also for other large-scale comparative surveys that aim to provide harmonised measures of educational attainment, including the ISSP, EB, PIAAC, and WVS. Given the absence of test-retest data in many of these programmes, our findings underscore the importance of including such assessments as part of ongoing quality assurance. The same logic applies to other core sociodemographic variables, which are similarly assumed to be reliable but may also be susceptible to hidden inconsistencies across survey contexts.
To support valid cross-national and cross-mode comparisons, survey programmes should consider redesigning educational attainment measures specifically for self-completion settings. The ESS has already begun exploring such redesigns in anticipation of wider adoption of self-completion modes. In addition, survey designers may benefit from integrating inconsistency cheques directly into web and computer-assisted surveys, especially when supported by repeated measurement designs. Furthermore, the use of artificial intelligence offers new possibilities for improving the classification of educational qualifications, for instance by integrating search-based tools that assist respondents in identifying their credentials using natural language processing within computer-assisted questionnaires (Briceno-Rosas et al., 2018; Schneider et al., 2018).
High-quality educational measurement is also critical for evaluating the overall quality of survey data. Educational attainment is often used as a benchmark variable in methodological assessments, such as the imputation of missing values, the evaluation of nonresponse bias, the comparison of sample compositions across survey modes, and the validation of survey estimates against external data sources. If the reliability of this key variable is compromised, the accuracy of such assessments may also be affected. This further highlights the importance of continuous monitoring and refinement of educational measures to ensure they provide a stable foundation for both substantive and methodological research.
Future research should build on the present findings to expand the empirical foundation of reliability assessments. Including additional countries and survey programmes will help assess the broader magnitude of the issue and improve our understanding of contextual effects. Experimental study designs are particularly needed to disentangle questionnaire effects from mode effects and to test the performance of identical versus mode-adapted measurement instruments. Such designs could also inform the development of more robust tools for educational measurement that perform consistently across survey conditions. Beyond the educational variable, future efforts should extend reliability assessments to other key sociodemographic measures and prioritise the design of instruments that are resilient to mode-specific challenges while preserving clarity and international comparability.
Educational attainment measures need to remain stable over time, comparable across national systems, and flexible enough to support diverse scales and analytical purposes. Meeting these expectations requires a high degree of reliability. Yet, as educational systems, survey instruments, and data collection modes continue to evolve, the reliability of even well-established variables cannot be taken for granted. Investigating and monitoring the reliability of education measurement is essential for ensuring the quality and comparability of survey data in the years ahead. By advancing the understanding of reliability in harmonised educational measurements, this paper provides a foundation for improving data quality in survey research, ultimately supporting the validity of social science analyses.
Acknowledgements
I extend my gratitude to GESIS Leibniz Institute for the Social Sciences and the University of Mannheim for supporting this research. I am also grateful to Achim Koch, Ana Villar, Rory Fitzgerald, and the national teams of the European Social Survey from Estonia, the United Kingdom, and Slovenia for their valuable contributions to the data collection processes. Special thanks go to my colleagues at GESIS for their constructive discussions and feedback, which have enriched this work. I also acknowledge the insightful guidance and support of Natalja Menold from the University of Dresden and Silke Schneider from GESIS, whose expertise provided critical perspectives that helped shape the research. The publication of this article was funded by GESIS Leibniz Institute for the Social Sciences. Open Access funding enabled and organized by Projekt DEAL.
Author contributions
Roberto Briceno-Rosas conceptualised and designed the study, conducted the analysis, and wrote the manuscript. Roberto Briceno-Rosas also prepared all figures and tables and reviewed the manuscript thoroughly.
Data availability
The datasets used in this study are freely accessible through the European Social Survey’s public repository at its data portal (https://www.europeansocialsurvey.org/). ESS Round 8 (edition 3.2) - https://doi.org/10.21338/ess8e02_3; CRONOS Wave 6 (edition 1.1) - https://doi.org/10.21338/cronos1_wave6_e01_1. To safeguard respondent anonymity, detailed information regarding the date of completion of educational qualifications is available only upon special request for scientific purposes. Such requests must comply with the European Social Survey’s data protection policies and adhere to necessary safeguards and restrictions.
Competing interests
The authors declare no competing interests.
Ethical approval
This study is based on secondary analysis of survey data collected by the European Social Survey - European Research Infrastructure Consortium (ESS ERIC). The ESS ERIC has a dedicated Research Ethics Board that oversees all data collection activities and ensures compliance with Declaration on Professional Ethics of the International Statistical Institute. Ethical approval for the survey protocols were obtained prior to data collection. The ESS ERIC ensures compliance with applicable data protection regulations, including the EU General Data Protection Regulation (GDPR) and national laws in participating countries. No additional ethical approval was required for this study, as the analysis relies exclusively on anonymised, publicly available data.
Informed consent
Informed consent was obtained from all survey respondents prior to their participation. Respondents were informed both orally and in writing about the voluntary nature of the survey, the purpose of the research, the anonymity and confidentiality of their responses, and their right to withdraw at any time. For ESS Round 8, consent was documented by interviewers within the questionnaire during the fieldwork period (01 September 2016 - 20 March 2017). For CRONOS panel participants, informed consent to participate in the panel was obtained at two points: first, at the end of the ESS Round 8 interview (2016–2017), when respondents were invited to join the panel; and second, prior to the start of the first CRONOS survey, when respondents explicitly agreed to the collection and use of their data. No additional consent procedures were required for this secondary analysis of anonymised survey data.
Supplementary information
The online version contains supplementary material available at https://doi.org/10.1057/s41599-025-06051-9.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Alwin DF (2007) Margins of error: a study of reliability in survey measurement. John Wiley and Sons
Alwin, DF; Baumgartner, EM; Beattie, BA. Number of response categories and reliability in attitude measurement. J Surv Stat Methodol; 2018; 6,
Brauns H, Scherer S, Steinmann S (2003) The CASMIN educational classification in international comparative research. In: Advances in cross-national comparison. Springer, Boston, MA. pp 221–244 https://doi.org/10.1007/978-1-4419-9186-7_11
Briceno-Rosas R, Liebau E, Ortmanns V, Pagel L, Schneider SL (2018) Documentation on ISCED Generation using the CAMCES tool in the IAB-SOEP Migration Samples M1/M2 (SOEP Survey Papers: Series D – Variable Descriptions and Coding 579) [Methodological Report]. DIW/SOEP. https://www.diw.de/documents/publikationen/73/diw_01.c.608161.de/diw_ssp0579.pdf
Couper, MP. The future of modes of data collection. Public Opin Q; 2011; 75,
Danner D (2015) Reliabilität—die Genauigkeit einer Messung Reliabilität—Die Genauigkeit einer Messung. https://doi.org/10.15465/GESIS-SG_011
de Leeuw ED (2005) To Mix or Not to Mix Data Collection Modes in Surveys. J Offic Stat 21(2):233–255
DeLeeuw ED (2018) Mixed-mode: past, present, and future. Survey Res Methods 12:75–89. https://doi.org/10.18148/SRM/2018.V12I2.7402
Duncan, OD; Duncan, B. A methodological analysis of segregation indexes. Am Sociol Rev; 1955; 20,
European Social Survey (2024) ESS round 12 survey specification for ESS ERIC member, observer, and guest countries, v2. Centre for Comparative Social Surveys, City University London. https://europeansocialsurvey.org/sites/default/files/2024-04/ESS012_projection_specification_v2.pdf
European Social Survey European Research Infrastructure (2023a) CRONOS1 Wave 6 edition 1.1. Sikt - Norwegian Agency for Shared Services in Education and Research. https://doi.org/10.21338/cronos1_wave6_e01_1
European Social Survey European Research Infrastructure (2023b) ESS8—Integrated file, edition 2.3. Sikt - Norwegian Agency for Shared Services in Education and Research. https://doi.org/10.21338/ess8e02_3
Granda P, Wolf C, Hadorn R (2010) Harmonizing survey data. In: Harkness JA, Braun M, Edwards B, Johnson TP, Lyberg LE, Mohler PP, Pennell BE, Smith TW (eds) Survey methods in multinational, multiregional, and multicultural contexts. John Wiley and Sons
Gummer T, Schmiedeberg C, Bujard M, Christmann P, Hank K, Kunz T, Lück D, Neyer FJ (2020) The impact of Covid-19 on fieldwork efforts and planning in Pairfam and FReDA-GGS. Survey Res Methods 14(2), Article 2. https://doi.org/10.18148/srm/2020.v14i2.7740
Herzing, JME. Investigation of alternative interface designs for long-list questions–the case of a computer-assisted survey in Germany. Int J Soc Res Methodol; 2020; 23,
Hout, M; Hastings, O. Reliability of the core items in the general social survey: estimates from the three-wave panels, 2006–2014. Sociol Sci; 2016; 3, pp. 971-1002. [DOI: https://dx.doi.org/10.15195/v3.a43]
Koch A, Briceno-Rosas R (2021) Assessment of socio-demographic sample composition in ESS Round 8 and 9. European Social Survey GESIS
Kreuter F, Presser S, Tourangeau R (2008) Social Desirability Bias in Cati, Ivr, and Web Surveys the Effects of Mode and Question Sensitivity. Public Opin Q 72(5):847–865. https://doi.org/10.1093/poq/nfn063
Krosnick JA (1991) Response strategies for coping with the cognitive demands of attitude measures in surveys. Appl Cogn Psychol 5(3):213–236. https://doi.org/10.1002/acp.2350050305
Lynn P, Anghelescu G (2018) European social survey 8 weighting strategy. Institute for Social and Economic Research, University of Essex
Mavletova A, Couper MP (2013) Sensitive Topics in PC Web and Mobile Web Surveys: Is There a Difference? Surv Res Methods 7(3):3. https://doi.org/10.18148/srm/2013.v7i3.5458
Ortmanns, V. Explaining inconsistencies in the education distributions of ten cross-national surveys – the role of methodological survey characteristics. J Stat; 2020; 36,
Ortmanns V (2020b) Issues in measuring education in cross-national and migration surveys. https://madoc.bib.uni-mannheim.de/55867
Ortmanns, V; Schneider, SL. Harmonisation still failing? Inconsistency of education variables in cross-national public opinion surveys. Int J Public Opin Res; 2016; 28,
Porst, R; Zeifang, K. A description of the German general social survey test-retest study and a report on the stabilities of the sociodemographic variables. Sociol Methods Res; 1987; 15,
Roßmann J, Gummer T, Silber H (2018) Mitigating Satisficing in Cognitively Demanding Grid Questions: Evidence from TwoWeb-Based Experiments. J Surv Stat Methodol, 6(3):376–400. https://doi.org/10.1093/jssam/smx020
Schermelleh-Engel K, Werner CS (2012) Methoden der Reliabilitätsbestimmung. In: Moosbrugger H, Kelava A (eds) Testtheorie und Fragebogenkonstruktion. Springer Berlin Heidelberg. pp 119–141 https://doi.org/10.1007/978-3-642-20072-4_6
Scherpenzeel A, Axt K, Bergmann M, Douhou S, Oepen A, Sand G, Schuller K, Stuck S, Wagner M, Börsch-Supan A (2020) Collecting survey data among the 50+ population during the COVID-19 outbreak: the survey of health, ageing and retirement in Europe (SHARE). Survey Res Methods 14, Article 2. https://doi.org/10.18148/srm/2020.v14i2.7738
Schneider SL (2009) Confusing credentials: the cross-nationally comparable measurement of educational attainment [Thesis, University of Oxford]. https://ora.ox.ac.uk/objects/uuid:15c39d54-f896-425b-aaa8-93ba5bf03529
Schneider SL (2010a) ESS5 2010 appendix A1 education edition 4.3.: the measurement of educational attainment in the ESS. European Social Survey
Schneider, SL. Nominal comparability is not enough: (In-)equivalence of construct validity of cross-national measures of educational attainment in the European Social Survey. Res Soc Stratification Mobil; 2010; 28,
Schneider SL (2016) ESS8 2016 Appendix A1 Education edition 2.3.: The measurement of educational attainment in the ESS. European Social Survey
Schneider, SL. The classification of education in surveys: a generalized framework for ex-post harmonisation. Qual Quant; 2022; 56,
Schneider SL, Briceno-Rosas R, Herzing JME, Ortmanns V (2016) Overcoming the shortcomings of long list showcards: measuring education with an adaptive database lookup [Scientific Conference]. 9th International Conference on Social Science Methodology (RC33 Conference), Berlin
Schneider SL, Briceno-Rosas R, Ortmanns V, Herzing JME (2018) Measuring migrants’ educational attainment: the CAMCES tool in the IAB-SOEP migration samples. In: Behr D (ed) Surveying the migrant population: consideration of linguistic and cultural issues pp 43–74
Smyth E, Steinmetz, S (2015) Vocational training and gender segregation across Europe. In: Gender Segregation in Vocational Education Vol. 31. Emerald Group Publishing Limited. pp 53–81 https://doi.org/10.1108/S0195-631020150000031003
Struminskaya B, Weyandt K, Bosnjak M (2015) The Effects of Questionnaire Completion Using Mobile Devices on Data Quality. Evidence from a Probability-based General Population Panel. Methods, Data, Analyses 9(2):261–292. https://doi.org/10.12758/MDA.2015.014
Toninelli D, Revilla M (2016) Smartphones vs PCs: Does the Device Affect the Web Survey Experience and the Measurement Error for Sensitive Topics? - A Replication of the Mavletova & Couper’s 2013 Experiment. Surv Res Methods 10(2):2. https://doi.org/10.18148/srm/2016.v10i2.6274
Tourangeau R (2003). Cognitive Aspects of Survey Measurement and Mismeasurement. Int J Public Opin Res, 15(1):3–7. https://doi.org/10.1093/ijpor/15.1.3
Tourangeau R, Sun H, Yan T, Maitland A, Rivero G, Williams D (2018) Web Surveys by Smartphones and Tablets: Effects on Data Quality. Soc Sci Comput Rev 36(5):542–556. https://doi.org/10.1177/0894439317719438
Tourangeau R (2021) Survey Reliability: Models, Methods, and Findings. J Surv Stat Methodol 9(5):961–991. https://doi.org/10.1093/jssam/smaa021
UNESCO (2012) International Standard Classification of Education (ISCED) 2011
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.