Content area
The last two decades have witnessed a growing interest in the theoretical understanding of test fairness and the intricacies of the concept have led to a heated discussion about its main characteristics, on the one hand, and have called for its reconceptualization on the other hand. As a partial address of the call, this study intended to advance a measurement and structural model of the test fairness concept in the Iranian educational context. A 118-item questionnaire comprising 12 subscales was developed, pilot-tested, and then administered to 600 participants. The reliability indices of the subscales ranged from .87 to .97, and as for the full scale, Cronbach’s Alpha measure of internal consistency turned out to be .98. A set of Exploratory and Confirmatory Factorial Analyses were used to explore the interrelationships among the set of variables and to gauge the construct validity of the scales, with convergent and discriminant validity being ensured. A resulting model with a general higher-order factor (i.e., test fairness) and twelve lower-order factors of validity, construction and structure, administration, scoring, reporting, decision-making, consequences, security, explicitness, accountability, equality, and rights demonstrated the best fit, with the final solution explaining a total of 73.3 percent of the variance. The verified interplay between test fairness and the identified related factors confirmed fairness as a broad concept inclusive of test use validity. This finding entails that fairness as an all-encompassing and broad multi-disciplinary concept (Sen, A. (2009). The idea of justice. Harvard University Press.) might not be confined only to the testing context as is the case with validity. This idea in turn strengthens the idea that validity might possibly be incapable of justifying all intricacies of fairness in general and test fairness in particular.
Introduction
Test fairness has been bound with much complexity in definition (Davies, 2010; Xi, 2010) with many areas of disagreement and only few areas of agreement on how to define the concept (Zieky, 2016). Such complexity has led to four main ways of test fairness conceptualization in educational contexts: (a) access to the test construct(s), (b) lack of measurement bias, (c) equitable treatment during the testing process, and (d) validity of individual test score interpretations for the intended uses (AERA et al., (2014)).
Such variation of perspectives on the concept has brought about controversies over the nature of test fairness for over a decade (e.g., Davies, 2010; Kane, 2010; Kunnan, 2004, 2008, 2010, 2018; Xi, 2010) and still many aspects of the concept are relatively left intact and underexplored. For instance, there has been little agreement as to whether test fairness and validity are standalone concepts (AERA et al., (1985); Joint Committee on Testing Practice, (2004); Kane, 2010) or subsume each other (AERA et al., (1999), 2014; Davies, 2010; Educational Testing Service, 2016, 2022; Kunnan, 2008; Messick, 1989; Willingham, 1999; Xi, 2010). In the milieu of such a complex of uncertainty, and given the lack of an agreed-upon definition or a generally accepted model of test fairness, an in-depth study of various aspects of the concept is an urgent need for a better understanding of test fairness (e.g., Davies, 2010; Fulcher & Davidson, 2007; Kane, 2010; Kunnan, 2010; Xi, 2010).
Against such a backdrop and drawing on the Revised Test Fairness Model (Ahmadi Safa & Beheshti, 2022; Beheshti, 2023; Beheshti & Ahmadi Safa, 2023), the present study, which is part of a broader study, aimed at an integrative measurement and structural model of the concept. On this basis, attempts were made to first operationally define each and every category of the Revised Test Fairness Model (RTFM) and next figure out a model of relations among the factors.
Literature Review
Test fairness is a social process in its basic nature (Willingham, 1999). Dorans and Cook (2016) view the definition of fairness as more difficult than its detection, with several conflicting connotations being carried with the term “fair” as judgments on fairness are driven by differing values across people (Darlington, 1971). Notwithstanding the complexities associated with the definition of test fairness, many testing specialists have tried to theorize the concept (e.g., Davies, 2010; Fulcher & Davidson, 2007; Kane, 2010; Kunnan, 2010; Xi, 2010) and outlined features including validity, construction and structure, administration, scoring, reporting, consequences, security, explicitness, accountability, equality, and rights (Beheshti, 2023; Beheshti & Ahmadi Safa, 2023) as associated with test fairness concept in educational measurement, testing, and assessment.
Most contentiously, the validity of test-based interpretations has been viewed as a precursor of test fairness or the other way around. Zieky (2016) maintains that fairness is closely associated with validity, with an increase in one leading to an increase in the other and a decrease in one leading to a decrease in the other. Studies have also highlighted the importance of validity and its association with test fairness (e.g., Field, 2013; Haertel & Ho, 2016; Harding, 2012). By way of illustration, Field (2013) cautions that employing foreign language features that test-takers are improbable to have ever encountered in learning contexts might be unfair.
The latest 2014 version of standards (AERA et al., 2014), advances fairness to a central concern in testing, demanding noticeable attention to the test content and structure. Further, some authors, government agencies, and test construction organizations have issued guidelines on how to generate fair tests (e.g., American Psychological Association, 2020; Educational Testing Service, 2016; Smarter Balanced Assessment Consortium, 2012). In an attempt to review such guidelines, Stone and Cook (2016) and Zieky (2016) detail the fairness considerations required throughout item writing, test design, test development, and test review that could assist in ensuring fairness for different groups of test takers.
The test administration procedure is also identified to have a major influence on test fairness (AERA et al., 2014; Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; Kunnan, 2018; Stone & Cook, 2016; Zieky, 2016). AERA et al. (2014) for instance calls for considerably more attention to the administration procedure as a fundamental issue in test fairness. The obligation of test administrators is extended in the revised standards to the one whose primary responsibility is to assure all the test takers receive equal and fair treatment (AERA et al., 2014). Turning to some guidelines on how to ensure fair test administration procedures, Kunnan (2008) elaborates on necessary aspects, such as suitable physical conditions, geographical access, and equipment access and reasonable time limit (Zieky, 2016).
Appropriate scoring is also underscored as a significant determinant of test fairness. Therefore, increasing concern with fairness calls for greater attention to the scoring protocol as the human rating is often fraught with dilemmas over fairness, with background characteristics and individual differences by and large being supposed to engender sources of error, potentially biasing test scores (Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; Wollack & Case, 2016; Zieky, 2016). Thus, a fair test has to make sure that fair evaluations are made about test takers regardless of their background, race, religion, or gender (American Educational Research Association, 2014). Several studies have dealt with how to ensure fairness in test scoring (e.g., Ferbežar & Stanovnik, 2021; Leitner & Kremmel, 2021; McNamara et al., 2019; Zieky, 2016).
Appropriateness of test results reporting has been instrumental in the understanding of test fairness as well (AERA et al., 2014; Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; Educational Testing Service, 2014). International Test Commission (2014) and Spolsky (1990) point to the moral obligation of test reporters to control the quality, accuracy, interpretability, and standardization of score reporting procedures for all examinees, recommending the use of multiple reporting profiles as a more valuable reporting method.
Appropriate interpretations and uses of the test scores are also commonly considered as important features of test fairness as they are conducive to improved test-based decisions about programs and test takers and could pave the way for more equitable access to employment and education (AERA et al., 2014; Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; Dorans & cook, 2016; Joint Committee on Testing Practice, 2004, Kunnan, 2018; Sireci et al., 2016). On this basis, a set of guidelines on various aspects of educational testing are set out for educational measurement professionals to provide support for fair testing practices and give policymakers and test users detailed information regarding how testing practices could result in correct and fair test score interpretations for their intended uses (Dorans & cook, 2016; Sinharay, 2016).
On the other hand, test consequences make another important determinant of test fairness (Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; Kunnan, 2018; Tsagari & Cheng, 2017). This is logically rooted in the idea that human ratings might be vulnerable to different sources of bias and that tests are generally accompanied by relatively far-reaching consequences for those being rated (Eckes, 2017; McNamara, 2008; McNamara & Roever, 2006; Shohamy, 2017); therefore, test consequences have to be examined and scrutinized critically and the principles concerning social dimensions of injustice and unfair uses of tests need to be discussed in society (Karatas & Okan, 2021).
As the next identified factor relevant to test fairness, the security of the testing process is of paramount significance as the insecurity of the process might pose challenges to test fairness (AERA et al., 2014; Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; Kunnan, 2008). From Wollack and Case’s (2016) perspective, test fairness needs to be guaranteed by observing the security of the pre-test, during-the test, and post-test administrative practices.
In addition, the explicitness of rules is a further significant contributing factor to test fairness (AERA et al., 2014; Beheshti, 2023; Beheshti & Ahmadi Safa, 2023). Test takers have the right to receive explicit, clear, and detailed information, and get notified of the rules and conditions for the test. Due in part to the importance of prenotification of the standards or clarity of the procedures for test fairness, the significance of standard setting has been widely recognized in recent years (Brooks, 2017). Precision and standardization are necessary for all testing phases, i.e., test development, test administration, test scoring, and reporting of the score (Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; International Test Commission, 2014; Wollack & Case, 2016).
Besides, it is now firmly confirmed that accountability plays an important role in test fairness (AERA et al., 2014; Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; Kunnan, 2008). In light of the increased emphasis on fairness and due to the growing need for the stakeholders to get involved and take significant responsibilities in different phases of the testing process, accountability in the social context of testing has been considered to be an important factor (AERA et al., 2014; Inbar-Lourie, 2017; Kremmel & Harding, 2020).
Equality of the testing experience for all the examinees is another underscored aspect of fairness as inequalities lead to test unfairness (Kunnan, 2018). Equality of approach, administration, content, and rating is admitted to produce uniformity and consistency across different examinees (AERA et al., 2014; Beheshti, 2023; Beheshti & Ahmadi Safa, 2023). From some experts’ perspectives, inequality or lack of standardization is conducive to abundant sources of error, potentially biasing scores of a test, with the bias being considered as a source of invalidity and test unfairness (Wollack & Case, 2016).
Finally, safeguarding the rights of the test takers is among the key factors which are involved in test fairness, with natural and civil rights being among the two major sorts of rights (Davies, 2017). The legal rights and fairness guidelines tend to be comparable in terms of the topics and the main thrust of their ideas or concepts, differing primarily in the level of detail (Zieky, 2016). Sources of construct-irrelevant variance that are likely to distract, insult, offend, annoy, anger, or upset some test-taker groups have to be avoided as they are highly associated with unfairness in testing (Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; Kunnan, 2018; Zieky, 2016). Phillips (2016) discusses certain court cases and legal issues pertaining to the unfairness of the testing processes in a variety of educational contexts.
Research Questions
Whilst a number of features are frequently named to be associated with the test fairness concept, the literature has still been equivocal and lacking consistency (e.g., Davies, 2010; Kane, 2010; Kunnan, 2008; Xi, 2010). Therefore, as a partial attempt to address this concern, this piece of research intended to delineate the underlying factors of test fairness and their interrelationships. To this end, the study sought to answer the following research questions:
RQ1: Is the newly developed measurement model of test fairness valid in the Iranian educational context?
RQ2: Is the newly developed structural model of test fairness valid in the Iranian educational context?
Methodology
Participants
This study was conducted in two phases, i.e., a pilot study and main study. Prior to the main study, a pilot test was administered to a smaller random sample of 200 participants to refine the questionnaire and assess its clarity and comprehensiveness, adhering to the guideline of having at least five participants per questionnaire item (Tabachnick & Fidell, 2019). The initial pilot study sample consisted of testing experts, English language teachers, and semi-expert test takers (MA students and Ph.D. candidates in language testing programs) which were recruited from various educational institutions across 27 cities in Iran. Of the initial group of 200 participants, 54percent were female and 46percent male, and all were between 22 and 57 years old. Just over half the sample (68%) were higher education semi-expert test takers, and the remaining were English language teachers (27%) and testing experts (5%), ensuring a comprehensive range of perspectives. The experts were selected based on their qualifications, professional affiliation, and experience in the Field of language testing and assessment, with their insights contributing to ensure that the questionnaire accurately addressed multifaceted dimensions of test fairness and aligned with both theoretical frameworks and practical applications in educational assessment. In the main phase of the study, likewise, a random sample of 600 testing experts, English language teachers, semi-expert test takers (MA students and Ph.D. candidates in language testing program) was recruited from 54 cities in Iran, and efforts were made to make sure that the number of participants taking the questionnaire to be at least 5 times the number of questionnaire items (Tabachnick & Fidell, 2019). Attempts were made to ensure diversity within the sample by including individuals from different professional backgrounds and varying levels of experience in language testing. The participants were either testing experts (3.5%), higher education semi-expert test takers (65.5percent), or English language teachers (31%), with men composing 47.66 and women 52.33 percent of the sample. The age of the participants ranged from 24 to 59. The detailed demographic information of the participants is presented in Table 1.
Table 1. The participants’ demographic information
Professional level | Age range | f | N |
|---|---|---|---|
Testing Experts | |||
Female | 37–53 | 6 | 21 |
Male | 38–59 | 15 | |
English Language Teachers | |||
Female | 34–50 | 85 | 186 |
Male | 36–58 | 101 | |
Semi-expert Test Takers (Ph.D. Candidates) | |||
Female | 27–32 | 15 | 23 |
Male | 28–35 | 8 | |
Semi-expert Test Takers (MA Students) | |||
Female | 26–28 | 208 | 370 |
Male | 24–30 | 162 | |
N | 600 |
PhD doctor of philosophy, MA master of art
Instruments
The primary instrument for this study was a test fairness questionnaire for the development of which several stages of item generation, expert judgment, preliminary pilot testing, and statistical analyses were carried out. The items in the questionnaire were designed on a 5-point Likert scale, ranging from1 (strongly disagree), 2 (disagree), 3 (undecided), 4 (agree), to 5 (strongly agree). The first draft of the scale was developed on the bases of the literature on test fairness and the RTFM (Ahmadi Safa & Beheshti, 2022; Beheshti & Ahmadi Safa, 2023).
The preliminary draft of the scale once reviewed and carefully revised was subject to a rigorous content validation process. A content evaluation panel consisting of 15 experts specialized in language testing with an interest in test fairness was requested to scrutinize the 122-item draft. The main objective behind this expert judgment was to ascertain the correspondence between the generated items and the underlying theoretical concepts being measured. Each panel member was requested to comment on the suitability of each item and decide whether its inclusion in the final draft was deemed either (a) essential, (b) useful but not essential, or (c) not necessary. The panel members’ choices concerning each questionnaire item were pooled and the total number of “essential” options for each item was counted, on the basis of which, the Content Validity Ratio (CVR) was estimated based on the relevant formula for each and every one of the items) Lawshe, 1975). The minimum CVR requirement to satisfy the five percent level was decided following Lawshe’s (1975) approach (i.e., in this case.49 for 15 experts). All but 12 items satisfied the five percent level requirement. The items with CVR values meeting the minimum level requirement were retained in the revised version of the questionnaire. Based on the obtained data and feedback from the experts, adopting both quantitative and qualitative approaches to content validation, irrelevant, double-barreled, and inappropriate items were removed or ameliorated. Using the obtained CVR values for all remaining items, the content validity index (CVI) was estimated to be.80 for the whole questionnaire which was considered satisfactory (Lawshe, 1975).
Next, the revised scale was pilot-tested with a representative sample of 200 participants to obtain data concerning the suitability, readability, clarity, and comprehensiveness of the items, according to the results of which inapposite items were to be removed if necessary. Finally, the resulting version of the revised test fairness questionnaire included 118 items (see Appendix A), tapping into the following 12 subscales. Out of 118 items six were negatively directed (items 2, 11, 15, 24, 35, and 79).
Revised Test Fairness Questionnaire
Validity Subscale
The validity subscale was defined in terms of 13 individual items which were tapping into both validity and reliability evidence (α =.96; e.g., “A fair test only measures the ability, skill, or knowledge it aims to”). Reliability evidence items addressed stability evidence (2 items), alternate form evidence (1 item), inter-rater evidence (1 item), intra-rater evidence (1 item), and internal consistency evidence (1 item). Validity evidence items also measured content validity evidence (3 items), construct validity evidence (2 items), and criterion-related validity evidence (2 items, with 1 item for each concurrent and predictive validity evidence).
Construction and Structure Subscale
The construction and structure subscale contains 11 items (α =.93; e.g., “Instructions must be clear and complete in a fair test”). This subscale is comprised of seven variables that measure the physical appearance of the test (1 item), length (2 items), test item structure (2 items), difficulty level (2 items), score budgeting (1 item), discrimination level (1 item), and question type (2 items).
Administration Subscale
The administration subscale contains 11 items (α =.96; e.g., “The test location should be accessible to all the test-takers in a fair test”). This subscale includes five variables that measure geographical access (1 item), physical setting and equipment (4 items), personal access (1 item), administrators (2 items), and time (3 items).
Scoring Subscale
The scoring subscale is indicated by five items (α =.92; e.g., “The scoring is objective and based on specific criteria in a fair test”), which measure scoring rubrics (2 items) and.
scoring expertise (3 items).
Reporting Subscale
Reporting subscale is comprised of four items (α =.87; e.g., “Reporting the results at a predetermined time is a feature of a fair test”). This subscale embodies five variables that measure reporting time (1 item), flawless reporting (1 item), detailed reporting (1 item), and easily accessible reporting (1 item).
Decision-making Subscale
The decision-making subscale is indicated by four items (α =.87; e.g., “In a fair test, the applications or uses of the test scores are clearly determined to the extent possible”). This subscale measures score interpretation (2 items) and score uses (2 items).
Consequences Subscale
The consequences subscale consists of eight items (α =.95; e.g., “A fair test does not impose unreasonable harm to the future of the test takers”). It involves two variables indexing washback (4 items) and impact (4 items).
Security Subscale
Eight items signify the security subscale (α =.91; e.g., “Cheating is strictly controlled in a fair test”). This subscale is composed of five variables indexing item security (1 item), administration security (2 items), scoring security (1 item), reporting security (2 items), decision-making security (1 item), and consequences security (1 item).
Explicitness Subscale
The explicitness subscale involves 12 items (α =.97; e.g., “A fair test is administrated based on clear criteria”). This subscale is comprised of two variables that measure standard-setting (8 items) and prenotification of the standards (4 items).
Accountability Subscale
The accountability subscale is indicated by seven items (α =.95; e.g., “A fair test provides clarification in case of any objection to the results”), which measure responsibility (4 items), answerability (2 items), and involvement (1 item).
Equality Subscale
The equality subscale, which consists of two variables of uniformity and absence of bias is indicated by 20 items (α =.97; e.g., “All test takers should have equal access to the test sources”). The uniformity variable measures equal educational access (3 items), equal teaching (1 item), and equal testing (11 items). The absence of bias variable measures unbiased educational access (1 item), unbiased teaching (1 item), and unbiased testing (3 items).
Rights Subscale
15 items denote the rights subscale (α =.96; e.g., “There must be legal provisions to help disadvantaged test takers meet their rights”). This subscale covers four variables of rightfulness, laws, monitoring, and remedies. Rightfulness variable measures registration opportunity (1 item), financial access (3 items), preparation opportunity (1 item), absence of offense [inoffensive content or language (2 items) and inoffensive test administration (1 item)], and satisfaction of rights (1 item). Laws is comprised of 1 item; monitoring covers 2 items; and remedies include 3 items.
Procedures
To provide protection against ethical rights violations, ethical clearance was ensured, informed consent was obtained, the voluntary basis of participation was notified, confidentiality of the data, and use of the data only for research purposes was assured at the outset of the study.
The present study, which is part of a broader mixed methods research project (Ahmadi Safa & Beheshti, 2022; Beheshti, 2023; Beheshti & Ahmadi Safa, 2023), inspected the nature of test fairness as measured by a researcher-made questionnaire. RTFM (Beheshti, 2023; Beheshti & Ahmadi Safa, 2023) was applied to develop the revised test fairness questionnaire. To ensure the content validity of the questionnaire, a content evaluation panel consisting of 15 well-informed experts in the realm of test fairness was requested to pass comments on the appropriateness of the items, with the CVR and the CVI values being identified for each item and the total questionnaire. The questionnaire was then pilot-tested on 200 participants. On the basis of the obtained values and feedback from the subject matter experts and pilot study, some items were removed or adjusted. The reliabilities of the final version of the subscales and the total scale were estimated using Cronbach’s alpha (α) measure of internal consistency, yielding values of α =.98 for the total scale, indicating excellent reliability.
The final version questionnaire was administered to 600 participants by sending online google form links to different email addresses, LinkedIn accounts, and Telegram or WhatsApp groups, the members of which were testing experts, English language teachers, and higher education students in language testing program. The participants were requested to rate the extent to which they agreed with the items of the questionnaire, with no time limitation being set for the completion of the questionnaire.
The responses were screened and then checked; the negatively worded items were reverse-coded; the statistical assumptions were assessed; and a set of exploratory and confirmatory factor analyses were conducted to advance a valid measurement and structural model of test fairness. To this end, the means (M), standard deviations (SD), skewness, and kurtosis values were scrutinized to ensure the normal distribution of the responses (Tabachnick & Fidell, 2019). To provide evidence as to the factorability of the data, Bartlett’s Test of Sphericity and the Kaiser–Meyer–Olkin (KMO) index were examined, and to decide on the number of factors, criterion values from the parallel analysis were scrutinized. Items that strongly loaded on more than one factor and the variables that did not load on any factor were removed and the analysis was repeated until the most optimal solution was attained (Tabachnick & Fidell, 2019). A conceptual model of structural interrelations among factors was hypothesized and assessed in order to estimate the strength of the hypothesized relations among the observed factors. The average variance extracted (AVE), average shared variance (ASV), and maximum shared variance (MSV) values were evaluated to provide evidence of convergent and discriminant validity, and finally, tests for the goodness-of-fit including the Goodness of Fit Index (GFI), Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA), Adjusted Goodness of Fit Index (AGFI), Incremental Fit Index (IFI), Tucker and Lewis Index (TLI), Parsimonious Normed Fit Index (PNFI), Parsimonious Comparative Fit Index (PCFI) were checked as basis to provide evidence of construct validity of the questionnaire and the model (Cretu & Brodie, 2009).
Design of the study
The present study employed a mixed-methods design in data collection, analysis, and interpretation, integrating a set of qualitative and quantitative approaches to comprehensively investigate the concept of test fairness within the Iranian educational context. This study design allowed for a thorough examination of test fairness, combining the strengths of both quantitative and qualitative methodologies to provide a robust understanding of the concept in the Iranian educational context (Dornyei, 2007). For the analysis of the obtained data, qualitative data obtained from the experts and participants were used to evaluate the validity of the questionnaire and enhance the interpretation of the quantitative findings. In addition, quantitative data were analyzed using statistical methods, including EFA and CFA, the results of which are described below.
Results
This study aimed at a valid measurement and structural model of test fairness. To this end, a researcher-made questionnaire was designed and applied as the main data collection instrument of the study. Concerning the reliability of the total scale and componential subscales, the obtained data were subjected to Cronbach’s alpha measure of internal consistency. The results of which are displayed in Table 2.
Table 2. Reliability assessment of the revised test fairness questionnaire
Variable | No. of items | Cronbach’s α |
|---|---|---|
Validity | 13 | .96 |
Construction | 11 | .93 |
Administration | 11 | .96 |
Scoring | 5 | .92 |
Reporting | 4 | .87 |
Decision-making | 4 | .87 |
Consequences | 8 | .95 |
Security | 8 | .91 |
Explicitness | 12 | .97 |
Accountability | 7 | .95 |
Equality | 20 | .97 |
Rights | 15 | .96 |
Total | 118 | .98 |
As is evident in Table 2, the Cronbach alpha (α) index of internal consistency of all subscales was satisfactory and exceeded the acceptable value of.70 (Pallant, 2020).
Bartlett’s test of Sphericity (Bartlett, 1954), and the KMO measure of sampling adequacy (Kaiser, 1974) were applied to assess the factorability of the data. Table 3 presents the summary of the results.
Table 3. KMO and Bartlett’s test of the revised test fairness questionnaire data
KMO | Bartlett’s test of sphericity | ||
|---|---|---|---|
.97 | Approximate Chi-Square | df | p |
69796.034 | 6903 | 0.000 |
*p <.05
As is displayed in Table 3, The KMO value was.97, exceeding the recommended value of.60 (Kaiser, 1974), and Bartlett’s (1954) Test of Sphericity reached statistical significance, supporting the factorability of the correlation matrix.
Furthermore, exploratory factor analysis was carried out to provide validity evidence for the questionnaire and explore the interrelationships among the set of variables (Pallant, 2020). Principal components analysis with Varimax rotation and Kaiser normalization technique revealed the presence of 12 components with eigenvalues exceeding 1 and factor loadings above.4. The 12-component solution explained a total of 73.3percent of the variance, with equality accounting for (13.2% (, rights) 9.9% (, validity) 8.4% (, explicitness) 8.0% (, construction and structure) 6.9% (, administration) 6.9% (, consequences) 4.8% (, security) 4.7% (, accountability) 3.9% (, scoring) 2.6% (, reporting) 2.0% (, and decision-making) 2.0%) of the total variance.
To answer the first research question, i.e., to present a valid measurement model of test fairness in the Iranian context, a hypothesized measurement model of test fairness was proposed building on the RTFM and relevant literature, with some relations being postulated between the latent variables and their observed measures (Byrne, 2016). Validity as a latent factor was assumed to be defined in terms of 13 items, construction and structure 11 items, administration 11 items, scoring 5 items, reporting 4 items, decision-making 4 items, consequences 8 items, security 8 items, explicitness 12 items, accountability 7 items, equality 20 items, and rights 15 items.
To examine the verifiability of the postulated measurement part of the model, the convergent and discriminant validity measures were tested (Table 4).
Table 4. Assessment of discriminant validity of the test fairness model
Variable | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | MSV | ASV | AVE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1. Validity | .82** | .24 | .22 | .68 | |||||||||||
2. Construction | .46 | .83** | .28 | .24 | .69 | ||||||||||
3. Administration | .42 | .50 | .83** | .28 | .25 | .69 | |||||||||
4. Scoring | .49 | .50 | .53 | .83** | .32 | .27 | .69 | ||||||||
5. Reporting | .47 | .45 | .51 | .53 | .82** | .28 | .25 | .67 | |||||||
6. Decision-making | .48 | .53 | .51 | .55 | .52 | .82** | .30 | .27 | .68 | ||||||
7. Consequences | .49 | .53 | .51 | .56 | .53 | .52 | .83** | .32 | .27 | .69 | |||||
8. Security | .46 | .48 | .50 | .54 | .50 | .50 | .54 | .82** | .30 | .25 | .68 | ||||
9. Explicitness | .45 | .49 | .49 | .50 | .51 | .50 | .49 | .5 | .84** | .28 | .24 | .70 | |||
10. Accountability | .49 | .52 | .49 | .53 | .51 | .52 | .54 | .5 | .53 | .84** | .29 | .26 | .71 | ||
11. Equality | .46 | .48 | .50 | .48 | .49 | .51 | .48 | .5 | .45 | .51 | .84** | .26 | .23 | .70 | |
12. Rights | .49 | .50 | .52 | .50 | .47 | .53 | .46 | .5 | .45 | .51 | .46 | .84** | .28 | .24 | .70 |
The square roots of AVE values of latent variables are presented on the main diagonal of the correlation matrix
MSV maximum shared squared variance, ASV average shared variance, AVE average variance extracted
*p <.05
According to Table 4, the convergent validity of the model was verified as all AVE values were equal to or greater than 0.5 (Henseler et al., 2009). The discriminant validity of the model was also verified as the square root of AVE values of latent variables which are observable on the main diagonal of the correlation matrix were greater than correlations among the latent variables. Moreover, all MSV and ASV values were less than AVE values (Henseler et al., 2009). Therefore, according to the findings, the validity as a latent factor justified the links among 13, construction and structure among 11, administration among 11, scoring among 5, reporting among 4, decision-making among 4, consequences among 8, security among 8, explicitness among 12, accountability among 7, equality among 20, and rights among 15 observed variables.
To answer the second research question, i.e., to introduce a valid structural model of test fairness, a hypothesized structural model of test fairness was proposed building on the RTFM, with some relations being postulated among the latent variables (Byrne, 2016). It was postulated that there would be a positive correlation among validity, construction and structure, administration, scoring, reporting, decision-making, consequences, security, explicitness, accountability, equality, rights, and test fairness. The confirmatory factor analysis (CFA) was performed on the data, the results of which are demonstrated in Fig. 1.
[See PDF for image]
Fig. 1
Standardized Estimates of First and Second Order CFA of the Test Fairness. Note. VA = validity; CO = construction and structure; AD = administration; SC = scoring; RE = reporting; DM = decision-making; CONS = consequences; SE = security; EX = explicitness; AC = accountability; EQ = equality; RI = rights
As is displayed in Fig. 1, a model with a general factor and 12 distinct trait factors provides the best fit to the data. All 12 subscales and their relevant indicators loaded most heavily on the general factor of fairness. As predicted, the subscales of validity, construction and structure, administration, scoring, reporting, decision-making, consequences, security, explicitness, accountability, equality, and rights had strong positive correlations with test fairness.
Several absolute, incremental, and parsimonious fit measures (Cretu & Brodie, 2009) are reported for the proposed model in Table 5.
Table 5. The Goodness of fit indices for the test fairness conceptual model
Index type | Fit index | Acceptable range | Estimated value | Judgment |
|---|---|---|---|---|
Absolute fit measures | Normed χ2 (cmin/df) | < 3 | 1.39 | Accept |
Root mean square error of approximation (RMSEA) | <.08 | .03 | Accept | |
Goodness of fit index (GFI) | >.90 | .03 | Reject | |
Adjusted goodness of fit index (AGFI) | >.90 | .79 | Reject | |
Incremental fit measures | Comparative fit index (CFI) | >.90 | .96 | Accept |
Incremental fit index (IFI) | >.90 | .96 | Accept | |
Tucker and Lewis index (TLI) | >.90 | .96 | Accept | |
Parsimonious fit measures | Parsimonious normed fit index (PNFI) | >.50 | .86 | Accept |
Parsimonious comparative fit index (PCFI) | >.50 | .94 | Accept |
The sign < refers to the values less than a nominated value. The sign > refers to the values beyond a nominated value
As is displayed in Table 5, two fit indices out of nine were weak; however, the remaining indices were satisfactory. Indeed, although GFI and AGFI values greater than.90 indicate a good fit, the values around.80 or greater also indicate an acceptable fit (Cretu & Brodie, 2009). Thus, it could be concluded that the postulated model and the gathered data were consistent, and the model was acceptable (Cretu & Brodie, 2009).
Discussion
The study aimed at valid measurement and structural models of test fairness, with a hypothesized measurement model of test fairness being proposed building on the Revised Test Fairness Model (RTFM), and some relations being postulated between the latent variables and their observed measures or indicators. The links among the latent variables of interest were evaluated after establishing a feasible measurement model. The convergent and discriminant validity evidence was ensured, and AVE and MSV values have been examined in a heuristic sense. The measurement component of the model showed the validity factor to have (13 indicator measures), the construction and structure (11), the administration factor (11), the scoring (five), reporting (four), decision-making (four), consequences (eight), security (eight), explicitness (12), accountability (seven), equality (20), and rights (15).
With respect to the second research question that was originally posed to introduce a valid structural model of test fairness, a hypothetical model of relations among the identified factors was assumed and assessed. A 12-factor model of test fairness has been validated by the results of the CFA, with a path from the second-order common factor of test fairness to their respective 12 first-order indicator variables, and all latent variables showing statistically significant and positive effects on test fairness.
As a major finding of this study, test fairness is found to be a broad concept consisting of various test aspects one of which might be test use validity. This might indicate that embedding test fairness under the general validity concept undermines both concepts and leads to misperceptions about how exactly validity and fairness are related (Beheshti, 2023; Beheshti & Ahmadi Safa,(2023); Kunnan, 2008). A possible justification for this specific finding might be found in the idea that the nullification of construct-irrelevant variance or nuisance variables would preclude unfair variation in the test results (Field, 2013; Haertel & Ho, 2016; Harding, 2012; Zieky, 2016).
Adding to the classic perplexity which features the literature concerning the relation between test fairness and test use validity however, the findings of this study in this regard stand in contrast to a number of previous studies (e.g., AERA et al., 1985, 1999, 2014; Davies, 2010; Educational Testing Service, 2016, 2022; Joint Committee on Testing Practice, 2004; Kane, 2010; Messick, 1989; Willingham, 1999; Xi, 2010; Zieky, 2016).
A well-formed test construction or structure was found to contribute to test fairness. It can thus be suggested based on the data that among other new factors of this type, response formats, new fonts, small fonts, poor figures, slant or vertical printing, unsuitable test formats for the test takers with disabilities, ambiguous test directions, circumlocutory directions, dissimilar content, or unsuitable item difficulty should be avoided at the level of test design. A possible explanation for this which is in keeping with earlier findings or assertion of scholars (e.g., American Psychological Association, 2020; Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; Educational Testing Service, 2022; Zieky, 2016) might be that such nuisance sources of construct-irrelevant variance could bring about undesired effects on the test scores. This finding, as also reported by Zieky (2016), suggests that item writers might probably need to receive advanced systematic training on how to actualize the fairness principles.
The study found that test administration is likely to contribute to test fairness in general, therefore, it seems that administration procedure demands remarkably further consideration as appropriate, equivalent, and standardized testing conditions could plausibly influence test performance (e.g. AERA et al., 2014; Beheshti & Ahmadi Safa, 2023; Kunnan, 2018).
Turning to the next factor, it was found that the scoring might probably contribute to test fairness and the scorers thus perhaps need to ensure that some groups or individuals are not inadvertently advantaged over others, a notion which is consistent with the attitudes expressed in previous research into issues of bias, prejudice, and scoring malpractice. This highlights the significance of due consideration to the function of rating indeterminacy or prejudice in thinking about fairness of the language testing, suggesting unbiased evaluation of the test takers’ performance irrespective of their background, race, religion, or gender (AERA et al., 2014; Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; Ferbežar & Stanovnik, 2021; Leitner & Kremmel, 2021).
As another important finding, reporting factor was found to contribute to test fairness, with the findings suggesting that test reporters might possibly need to ensure the interpretability of test score information and control the quality, accuracy, and standardization of score reporting procedures for all the test takers, a result which stands in agreement with the ideas of Beheshti & Ahmadi Safa (2023), International Test Commission (2014), and Spolsky (1990) that similarly emphasize quality insurance, accuracy control, and score reporting standardization as the moral obligation of test reporters.
One further interesting finding was that appropriate decision-making could perhaps contribute to test fairness, with the data providing further support for the hypothesis that a fair test needs to take into account issues of the social consequences of decision-making (Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; Eckes, 2017; Karatas & Okan, 2021; Kunnan, 2018; McNamara, 2008; McNamara & Roever, 2006; Shohamy, 2017; Sinharay, 2016). This result is in keeping with previous theories of Kunnan (2008, 2010), that call for consideration of the consequences of testing, with heavy reliance on producing beneficial outcomes emanating from proper test use and preventing harmful effects originating from an inaccurate test score to the public.
In addition, based on the data, it seems possible to hypothesize that the insecurity of the testing process might pose real challenges to fairness, a finding which provides some support to the conceptual premise that issues related to the pre-, during-, and post-test security might threaten the fairness of test score interpretations (e.g., AERA et al., 2014; Beheshti & Ahmadi Safa, 2023; Wollack & Case, 2016).
Further, the results provided support for the hypothesis that an appropriate and clear standard-setting could support accurate and fair decisions regarding the test takers and could probably make a test-taking experience reasonably fairer, a finding which was also supported by a number of publications in the previous literature (e.g., Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; Brooks, 2017; Kunnan, 2018; McNamara et al., 2019). In general, therefore, it seems that adequate information about the general testing procedures or rules needs to be made available to all test takers prior to testing to help them appropriately prepare for a test and reflect their actual standing on the construct being measured.
According to the data, it could be inferred that test-related accountability systems which hold groups or individuals accountable for particular performances might possibly resolve a part of fairness concerns by the growing joint responsibilities on the part of the stakeholders (e.g., Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; Inbar-Lourie, 2017; Kremmel & Harding, 2020). The increasing need for the stakeholders to get involved in various phases of the testing process and to fulfill relevant responsibilities in this study supports many earlier works that provide guidelines on accountability systems and cover the issues of accountability in the social context of testing and assessment (e.g., AERA et al., 2014; Hamilton & Koretz, 2002).
Equality of the testing experience for all the examinees was found as another essential aspect of fairness that could significantly predict this higher-order factor, a finding which broadly supports the work of other studies in this area, such as that of Kunnan’s (2018) Principle of Justice, according to which a test has to be fair to all test takers and treat everyone equally, and those of Beheshti & Ahmadi Safa (2023), and Educational Testing Service (2014), which called for standard testing practice or equal test administration, scoring, and security protocols for all test takers.
Last but not the least, the rights factor was found to be meaningfully associated with test fairness. Hence, it could conceivably be hypothesized that there is a need to ensure one’s access to his/her rights and democratic testing by making adequate provisions to protect the personal rights of all test takers. This might entail that the test-related authorities are required to devise procedures for monitoring all test-taking procedures and safeguarding one’s right to test security (Beheshti, 2023; Beheshti & Ahmadi Safa, 2023; Zieky, 2016). Furthermore, similar to Beheshti & Ahmadi Safa (2023), Kunnan (2018), and Phillips (2016), it seems reasonable to expect relevant authorities to offer appropriate remediation opportunities to compensate for some deprivations of fair opportunities as well as violations of one’s legal rights.
Conclusion
A wide range of foreign language proficiency tests appears to have become high-stakes due to their decisive and vital consequences for applicants’ job prospects, life opportunities, and immigration needs. On this basis, it may be important to ensure valid and fair uses of such tests for each and every test taker in such conditions (Kormos & Taylor, 2021). Enhancing the use of fair testing procedures and strategies might protect principals, teachers, parents, and test takers by maximizing ethicality, justice, fairness, equality, and learning, and by minimizing marginalization and discrimination (Shohamy, 2017). Owing to this significance, the current research focused on how tests might be made more inclusive, equal, less biased, fair, just, and democratic. A model is presented which may well have a bearing on the testing of today and tomorrow by providing a detailed illustration of the requirements of different stakeholders to meet the fairness needs of a certain group of test takers. According to the model, test fairness encompasses the overall characteristic of a well-constructed, well-administered, well-scored, well-reported, well-interpreted, and well-used test that ideally minimizes bias against certain test-takers, and observes the principles of validity, security, explicitness, accountability, equality, and rights. While different factors such as scoring, reporting, and decision-making, etc. are interconnected, each plays a distinct role in ensuring test fairness, and it should be kept in mind that test validity is viewed as a sub-component of a broader test fairness concept in this study. The brief illustration of the interplay between test fairness and the previously mentioned factors could have important implications for the testing practices of today and tomorrow. According to the study findings, a fair test could be conceptualized as the one that:
ensures the reliability of the scores, validity of the interpretations, and appropriateness of construction and structure.
ensures uniformity and suitability of administration procedures.
is scored by trained and experienced scorers adopting valid scoring rubrics
is reported flawlessly and in detail at a predetermined time through an accessible medium.
ensures the appropriateness of test-based interpretations and uses and does not bring about the adverse effects on the future of the test takers.
follows fair standard-setting procedures, sets explicit standards, and prenotifies the standards to the test takers.
ensures the security of the pre-, while-, and post-test administration procedures and observes accountability by being responsible and answerable to the test takers or the public.
ensures equality and uniformity of teaching and testing for all the test takers and is not biased against particular examinees.
monitors the fulfillment of the test takers’ rights, accessibility, and affordability of the test.
is not offensive to certain groups of test-takers and offers appropriate remedies such as re-scoring, re-evaluation, and legal remedies to reverse the detrimental consequences of the unfair tests.
Despite many contributions of the model discussed so far, it is important to acknowledge certain limitations regarding the present study. The reliance on self-reported data, for instance, could possibly introduce biases, as participants might provide socially desirable responses rather than their true experiences. Moreover, while the sample was nationally representative, the sample size and the specific cultural, sociopolitical, and educational dynamics of the context of the current study might probably limit the applicability of the findings across different contexts. However, it is worth mentioning that these limitations should not be considered as a substantial barrier to contributions of the study as the model proposed here is widely informed by an exhaustive literature review and is firmly grounded in established theoretical frameworks, which reinforce its potential relevance far beyond the Iranian context. Indeed, while the results might be specifically contextualized, the test fairness principles described including reliability, validity, accountability, equality etc., are universally applicable and resonate with global standards in educational testing and assessment, providing grounded knowledge that can inform cross-cultural comparisons and adaptations, and serve as a stepping stone for future research.
Future research is invited to explore the proposed model in various cultural and educational settings, including larger and more diverse samples to possibly refine and validate the model, promote its applicability and robustness across diverse educational contexts.
In addition, It is essential to point out that although socio-economic factors were integrated into the equality dimension of the model, further exploration of the nuanced ways these factors interact with other dimensions of test fairness remains an important avenue for future research, providing deeper insights into how socio-economic disparities affect test outcomes.
Additionally, future studies on the efficacy of existing policies pertaining to test fairness and their execution in educational systems are recommended as conducting such studies could disclose how well these policies enhance fairness and detect areas for improvement. Finally, there remains abundant room for further longitudinal research to assess how stakeholders'perceptions of test fairness evolve over time, providing insights into the long-term effect of fairness practices on societal perceptions, and educational, vocational, psychological outcomes.
Clinical trial number
Not applicable.
Approval committee members
Hassan Soodmand Afshar (Professor, Bu-Ali Sina University)
Ayatollah Fazeli Manie (Assistant Prof. Bu-Ali Sina University)
Reza Taherkhani (Assistant Prof., Bu-Ali Sina University)
Mohammad Hadi Mahmoodi ( Assisstant Prof., Bu-Ali Sina University)
Authors’ contributions
M. A. S. contributed to the study in topic nomination, study design, supervision, data analysis and interpretation, and drafts revision. Sh. B. contributed in data collection, curation, drafts' amelioration. The study from which this paper is derived was the Ph.D. dissertation project of this writer which was conducted under the supervision of Prof. Mohammad Ahmadi Safa.
Funding
Bu-Ali Sina University provided funding to this project.
Data availability
No datasets were generated or analysed during the current study.
Declarations
Ethics approval and consent to participate
All participants expressed their consent to participate in the study.
Competing interests
The authors declare no competing interests.
Abbreviations
Average Variance Extracted
Average Shared Variance
Content Validity Index
Content Validity Ratio
Kaiser–Meyer–Olkin
Maximum Shared Variance
Revised Test Fairness Framework
Revised Test Fairness Model
Revised Test Fairness Questionnaire
Test Fairness Framework
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
Ahmadi Safa, M., & Beheshti, S. (2022). Reconceptualization of test fairness model: A grounded theory approach [Paper presentation]. The 10th European Conference on Language Learning (ECLL), University College London (UCL), London, UK. https://vimeo.com/iafor/64117.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards for educational and psychological testing.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing.
American Psychological Association. (2020). Publication manual of the American psychological association (7th ed.). https://doi.org/10.1037/0000165-000
Bartlett, MS. A note on the multiplying factors for various chi square approximations. Journal of the Royal Statistical Society; 1954; 16, pp. 296-298. [DOI: https://dx.doi.org/10.1111/j.2517-6161.1954.tb00174.x]
Beheshti, Sh. (2023). Reconceptualization of test fairness model: A questionnaire-based validation of the test fairness construct. [Unpublished doctoral dissertation]. Bu-Ali Sina University.
Beheshti, S., & Ahmadi Safa, M. (2023). Reconceptualization of Test Fairness Model: A Grounded Theory Approach. Iranian Journal of Language Teaching Research, 11(2), 119-146. https://doi.org/10.30466/ijltr.2023.121333.
Brooks, R. L. (2017). Language assessment in the US government. In E. Shohamy, I. G. Or, & S. May (Eds.), Language testing and assessment (3rd ed., pp. 64–76). Springer International Publishing. https://doi.org/10.1007/978-3-319-02261-1_4
Byrne, B. M. (2016). Structural equation modeling with Amos: Basic concepts, applications, and programming (3rd ed.). Routledge. https://doi.org/10.4324/9781315757421
Cretu, AE; Brodie, RJ. Brand image, corporate reputation, and customer value. Advances in Business Marketing and Purchasing; 2009; 15, pp. 263-387. [DOI: https://dx.doi.org/10.1108/S1069-0964(2009)0000015011]
Darlington, RB. Another look at cultural fairness. Journal of Educational Measurement; 1971; 8, pp. 71-82. [DOI: https://dx.doi.org/10.1111/j.1745-3984.1971.tb00908.x]
Davies, A. Test fairness: A response. Language Testing; 2010; 27,
Davies, A. (2017). Ethics, professionalism, rights, and codes. In E. Shohamy, I. G. Or, & S. May (Eds.), Language testing and assessment (3rd ed., pp. 397–415). Springer International Publishing. https://doi.org/10.1007/978-3-319-02261-1_27
Dorans, N. J., & Cook, L. L. (2016). Introduction. In N. J. Dorans & L. L. Cook (Eds.), Fairness in educational assessment and measurement (pp. 1–6). Routledge. https://doi.org/10.4324/9781315774527
Dornyei, Z. Research methods in Applied linguistics; 2007; Oxford University Press:
Eckes, T. Rater effects: Advances in item response modeling for human ratings-Part I. Psychological Test and Assessment Modeling; 2017; 59,
Educational Testing Service. (2014). ETS standards for quality and fairness. https://www.ets.org/s/about/pdf/standards.pdf
Educational Testing Service. (2016). ETS international principles for the fairness of assessments: A manual for developing locally appropriate fairness guidelines for various countries. https://www.ets.org/s/about/pdf/fairness_review_international.pdf
Educational Testing Service. (2022). ETS guidelines for developing fair tests and communications. https://www.ets.org/s/about/pdf/ets_guidelines_for_fair_tests_and_communications.pdf
Ferbezar, I., & Stanovnik, P. L. (2021). How to challenge prejudice in assessing the productive skills of speakers of closely related languages (the case of Slovenia). In B. Lanteigne, C. Coombe, & J. D. Brown (Eds.), Challenges in language testing around the world: Insights for language test users (pp. 201–220). Springer. https://doi.org/10.1007/978-981-33-4232-3_15
Field, J. Geranpayeh, A; Taylor, L. Cognitive validity. Examining listening: Research and practice in assessing second language listening; 2013; Cambridge University Press: pp. 77-151.
Fulcher, G; Davidson, F. Language testing and assessment: An advanced resource book. Routledge; 2007; [DOI: https://dx.doi.org/10.1007/978-3-319-02261-1]
Haertel, E., & Ho, A. (2016). Fairness using derived scores. In N. J. Dorans & L. L. Cook (Eds.), Fairness in educational assessment and measurement (pp. 239–266). Routledge. https://doi.org/10.4324/9781315774527-15
Hamilton, L. S., & Koretz, D. M. (2002). Tests and their use in test-based accountability systems. In L. S. Hamilton, B. M. Stecher, & S. P. Klein (Eds.), Making sense of test-based accountability in education (pp. 13–49). RAND Corporation. https://www.rand.org/content/dam/rand/pubs/monograph_reports/2002/MR1554.pdf
Harding, L. Accent, listening assessment and the potential for a shared-L1 advantage: A DIF perspective. Language Testing; 2012; 29,
Henseler, J; Ringle, CM; Sinkovics, RR. The use of partial least squares path modeling in international marketing. Advances in International Marketing; 2009; 20,
Inbar-Lourie, O. (2017). Language assessment literacy. In E. Shohamy, I. G. Or, & S. May (Eds.), Language testing and assessment (3rd ed., pp. 257–270). Springer International Publishing. https://doi.org/10.1007/978-3-319-02261-1_19
International Test Commission. ITC guidelines on quality control in scoring, test analysis, and reporting of test scores. International Journal of Testing; 2014; 14,
Joint Committee on Testing Practices. (2004). Code of fair testing practices in education. https://www.apa.org/science/programs/testing/fair-testing.pdf
Kaiser, H. An index of factorial simplicity. Psychometrika; 1974; 39, pp. 31-36. [DOI: https://dx.doi.org/10.1007/BF02291575]
Kane, M. Validity and fairness. Language Testing; 2010; 27,
Karatas, TO; Okan, Z. Lanteigne, B; Coombe, C; Brown, JD. A conceptual framework on the power of language tests as social practice. Challenges in language testing around the world; 2021; Springer Nature Singapore Pte Ltd: pp. 47-56.
Kormos, J., & Taylor, L. B. (2021).Testing L2 of learners with specific learning difficulties. In P. Winke & T. Brunfaut (Eds.), The Routledge handbook of second language acquisition and language testing (pp. 413–421). Taylor & Francis. https://doi.org/10.4324/9781351034784-3
Kremmel, B; Harding, L. Towards a comprehensive, empirical model of language assessment literacy across stakeholder groups: Developing the language assessment literacy survey. Language Assessment Quarterly; 2020; 17,
Kunnan, A. J. (2004). Test fairness. In M. Milanovic & C. Weir (Eds.), European language testing in a global context: Proceedings of the ALTE Barcelona conference (pp. 27–48). Cambridge University Press.
Kunnan, A. J. (2008). Towards a model of test evaluation: Using the test fairness and wider context frameworks. In L. Taylor & C. Weir (Eds.), Multilingualism and assessment: Achieving transparency, assuring quality, sustaining diversity (pp. 229–251). Cambridge University Press. https://www.cambridgeenglish.org/images/329231-studies-in-language-testing-volume-27.pdf
Kunnan, AJ. Test fairness and toulmin’s argument structure. Language Testing; 2010; 27,
Kunnan, AJ. Evaluating language assessments; 2018; Routledge:
Lawshe, CH. A qualitative approach to content validity. Personnel Psychology; 1975; 28, pp. 563-575. [DOI: https://dx.doi.org/10.1111/j.1744-6570.1975.tb01393.x]
Leitner, K., & Kremmel, B. (2021). Avoiding scoring malpractice: Supporting reliable scoring of constructed-response items in high-stakes exams. In B. Lanteigne, C. Coombe, & J. D. Brown (Eds.), Challenges in language testing around the world: Insights for language test users (pp. 127–146). Springer. https://doi.org/10.1007/978-981-33-4232-3_10
McNamara, T. (2008). The socio-political and power dimensions of tests. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of language and education: Vol. 7. Language testing and assessment (2nd ed., pp. 415–428). Springer.
McNamara, T., & Roever, C. (2006). Language testing: The social dimension. Wiley-Blackwell.
McNamara, T., Knoch, U., & Fan, J. (2019). Fairness, justice, and language assessment: The role of measurement. Oxford University Press.
Messick, S. Linn, R. Validity. Educational Measurement; 1989; Macmillan: pp. 13-103.
Phillips, S. E. (2016). Legal aspects of test fairness. In N. J. Dorans & L. L. Cook (Eds.), Fairness in educational assessment and measurement (pp. 239–266). Routledge. https://doi.org/10.4324/9781315774527-16
Shohamy, E. (2017). Critical language testing. In E. Shohamy, L. G. Or, & S. May (Eds.), Language testing and assessment (3rd ed., pp. 441–454). Springer International Publishing. https://doi.org/10.1007/978-3-319-02261-126
Sinharay, S. (2016). Commentary on ensuring fairness in test design, construction, administration, and scoring. In N. J. Dorans & L. L. Cook (Eds.), Fairness in educational assessment and measurement (pp. 97–108). Routledge. https://doi.org/10.4324/9781315774527-7
Sireci, S. G., Rios, J. A., & Powers, S. (2016). Comparing scores from tests administered in different languages. In N. J. Dorans & L. L. Cook (Eds.), Fairness in educational assessment and measurement (pp. 181–202). Routledge. https://doi.org/10.4324/9781315774527-12
Smarter Balanced Assessment Consortium. (2012). Bias and sensitivity guidelines. https://portal.smarterbalanced.org/library/en/v1.0/bias-and-sensitivity-guidelines.pdf
Spolsky, B. de Jong, JHAL; Stevenson, DK. Social aspects of individual assessment. Individualizing the assessment of language abilities; 1990; Multilingual Matters: pp. 3-15.
Stone, E. A., & Cook, L. (2016). Testing individuals in special populations. In N. J. Dorans & L. L. Cook (Eds.), Fairness in educational assessment and measurement (pp. 157–181). Routledge. https://doi.org/10.4324/9781315774527-11
Tabachnick, B. G., & Fidell, L. S. (2019). Using multivariate statistics (7th ed.). Pearson Education.
Tsagari, D., & Cheng, L. (2017). Washback, impact, and consequences revisited. In E. Shohamy, I. G. Or, & S. May (Eds.), Language testing and assessment (3rd ed., pp. 359–372). Springer International Publishing. https://doi.org/10.1007/978-3-319-02261-1_24
Willingham, WW. Messick, SJ. A systemic view of test fairness. Assessment in higher education: Issues in access, quality, student development, and public policy; 1999; Lawrence Erlbaum Associates: pp. 213-242.
Wollack, J. A., & Case, S. M. (2016). Maintaining fairness through test administration. In N. J. Dorans & L. L. Cook (Eds.), Fairness in educational assessment and measurement (pp. 33–53). Routledge. https://doi.org/10.4324/9781315774527-4
Xi, X. How do we go about investigating test fairness?. Language Testing; 2010; 27,
Zieky, M. J. (2016). Fairness in test design and development. In N. J. Dorans & L. L. Cook (Eds.), Fairness in educational assessment and measurement (pp. 9–32). Routledge. https://doi.org/10.4324/9781315774527-3
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.