It appears you don't have support to open PDFs in this web browser. To view this file, Open with your PDF reader
Abstract
Agreement between raters on a categorical scale is not only a subject of scientific research but also a problem frequently encountered in practice. Whenever a new scale is developed to assess individuals or items in a certain context, inter-rater agreement is a prerequisite for the scale to be actually implemented in routine use. Cohen's kappa coeffcient is a landmark in the developments of rater agreement theory. This coeffcient, which operated a radical change in previously proposed indexes, opened a new field of research in the domain.
In the first part of this work, after a brief review of agreement on a quantitative scale, the kappa-like family of agreement indexes is described in various instances: two raters, several raters, an isolated rater and a group of raters and two groups of raters. To quantify the agreement between two individual raters, Cohen's kappa coefficient (Cohen, 1960) and the intraclass kappa coefficient (Kraemer, 1979) are widely used for binary and nominal scales, while the weighted kappa coefficient (Cohen, 1968) is recommended for ordinal scales. An interpretation of the quadratic (Schuster, 2004) and the linear (Vanbelle and Albert, 2009c) weighting schemes is given. Cohen's kappa (Fleiss, 1971) and intraclass kappa (Landis and Koch, 1977c) coefficients were extended to the case where agreement is searched between several raters. Next, the kappa-like family of agreement coefficients is extended to the case of an isolated rater and a group of raters (Vanbelle and Albert, 2009a) and to the case of two groups of raters (Vanbelle and Albert, 2009b). These agreement coefficients are derived on a population-based model and reduce to the well-known Cohen's kappa coefficient in the case of two single raters. The proposed agreement indexes are also compared to existing methods, the consensus method and Schouten's agreement index (Schouten, 1982). The superiority of the new approach over the latter is shown.
In the second part of the work, methods for hypothesis testing and data modeling are discussed. Firstly, the method proposed by Fleiss (1981) for comparing several independent agreement indexes is presented. Then, a bootstrap method initially developed by McKenzie et al. (1996) to compare two dependent agreement indexes, is extended to several dependent agreement indexes (Vanbelle and Albert, 2008). All these methods equally apply to the kappa coefficients introduced in the first part of the work. Next, regression methods for testing the effect of continuous and categorical covariates on the agreement between two or several raters are reviewed. This includes the weighted least-squares method allowing only for categorical covariates (Barnhart and Williamson, 2002) and a regression method based on two sets of generalized estimating equations. The latter method was developed for the intraclass kappa coefficient (Klar et al., 2000), Cohen's kappa coefficient (Williamson et al., 2000) and the weighted kappa coefficient (Gonin et al., 2000). Finally, a heuristic method, restricted to the case of independent observations, is presented (Lipsitz et al., 2001, 2003) which turns out to be equivalent to the generalized estimating equations approach. These regression methods are compared to the bootstrap method extended by Vanbelle and Albert (2008) but they were not generalized to agreement between a single rater and a group of raters nor between two groups of raters.





