Content area
Abstract
To examine the reliability of scores obtained from a proposed critical appraisal tool (CAT).
Based on a random sample of 24 health-related research papers, the scores from the proposed CAT were examined using intraclass correlation coefficients (ICCs), generalizability theory, and participants' feedback.
The ICC for all research papers was 0.83 (consistency) and 0.74 (absolute agreement) for four participants. For individual research designs, the highest ICC (consistency) was for qualitative research (0.91) and the lowest was for descriptive, exploratory and observational research (0.64). The G study showed a moderate research design effect (32%) for scores averaged across all papers. The research design effect was mainly in the Sampling , Results , and Discussion categories (44%, 36%, and 34%, respectively). The scores for research designs showed a majority paper effect for each (53-70%), with small to moderate rater or paper×rater interaction effects (0-27%).
Possible reasons for the research design effect were that the participants were unfamiliar with some of the research designs and that papers were not matched to participants' expertise. Even so, the proposed CAT showed great promise as a tool that can be used across a wide range of research designs.
Full text
Category | Research designs | |||||||||||||
TE (n =4) | QE (n =4) | SS (n =4) | DEO (n =4) | QL (n =4) | SR (n =4) | All (N =24) | ||||||||
C | A | C | A | C | A | C | A | C | A | C | A | C | A | |
Preamble | 0.80 | 0.80 | 0.49 | 0.28 | 0 | 0 | 0.26 | 0.24 | 0.68 | 0.65 | 0.89 | 0.78 | 0.58 | 0.50 |
Introduction | 0 | 0 | 0 | 0 | 0.91 | 0.57 | 0.56 | 0.38 | 0.78 | 0.71 | 0.33 | 0.08 | 0.70 | 0.53 |
Design | 0.81 | 0.72 | 0.81 | 0.76 | 0 | 0 | 0 | 0 | 0.81 | 0.57 | 0.92 | 0.75 | 0.73 | 0.65 |
Sampling | 0.69 | 0.73 | 0 | 0 | 0 | 0 | 0.67 | 0.63 | 0.84 | 0.46 | 0.59 | 0.21 | 0.66 | 0.62 |
Data collection | 0 | 0 | 0.80 | 0.70 | 0.65 | 0.48 | 0 | 0 | 0.95 | 0.85 | 0.56 | 0.57 | 0.54 | 0.52 |
Ethical matters | 0.72 | 0.73 | 0.76 | 0.62 | 0.93 | 0.87 | 0.81 | 0.79 | 0.79 | 0.62 | 0.24* | 0.20* | 0.84 | 0.78 |
Results | 0 | 0 | 0.55 | 0.40 | 0.30 | 0.13 | 0 | 0 | 0.14 | 0.06 | 0.75 | 0.62 | 0.75 | 0.60 |
Discussion | 0.53 | 0.37 | 0.81 | 0.58 | 0.59 | 0.39 | 0 | 0 | 0.84 | 0.75 | 0 | 0 | 0.68 | 0.62 |
Total score (%) | 0.75 | 0.73 | 0.72 | 0.60 | 0.85 | 0.57 | 0.64 | 0.65 | 0.91 | 0.67 | 0.89 | 0.67 | 0.83 | 0.74 |
Table 1 - Summary of ICCs (k =4, excludes rater III)
Abbreviations_ ICC, intraclass correlation coefficient; TE, True experimental; QE, Quasi-experimental; SS, Single system; DEO, Descriptive, exploratory, observational; QL, Qualitative; SR, Systematic review; C, Consistency; A, Absolute agreement; k , No. of raters; *k = 2 (missing data); n , Papers per research design; N , Total papers.
Effect | % |
p_d | 38 |
d | 32 |
r | 8 |
dr | 5 |
dc | 0 |
pr_d | 7 |
pc_d | 4 |
rc | 1 |
drc | 0 |
prc_d | 3 |
Table 2 - Percentage mean variance components (k =4, excludes rater III); average for all papers
Abbreviations_ p, Paper; d, Research design; r, Rater (random); c, Category (fixed); Object of measure, p:d.
Effect | Research design | |||||
TE | QE | SS | DEO | QL | SR | |
p | 60 | 70 | 59 | 53 | 60 | 65 |
r | 0 | 0 | 15 | 0 | 27 | 22 |
pr | 13 | 10 | 8 | 27 | 0 | 3 |
pc | 7 | 5 | 9 | 9 | 6 | 1 |
rc | 10 | 7 | 6 | 2 | 3 | 5 |
prc | 10 | 9 | 4 | 8 | 4 | 5 |
Table 3 - Percentage mean variance components (k = 4, excludes rater III); average for each research design
Abbreviations_ p, Paper; r, Rater (random); c, Category (fixed); Object of measure, p; TE, True experimental; QE, Quasi-experimental; SS, Single system; DEO, Descriptive, exploratory, observational; QL, Qualitative; SR, Systematic review.
Effect | Preamble | Introduction | Design | Sampling | Data collection | Ethical matters | Results | Discussion | Total score |
p_d | 34 | 28 | 43 | 21 | 52 | 72 | 17 | 29 | 44 |
d | 17 | 25 | 23 | 44 | 1 | 4 | 46 | 34 | 31 |
r | 12 | 24 | 10 | 3 | 3 | 7 | 18 | 8 | 9 |
dr | 9 | 0 | 9 | 12 | 5 | 2 | 2 | 8 | 5 |
pr_d | 28 | 23 | 16 | 21 | 40 | 15 | 17 | 21 | 10 |
Table 4 - Percentage mean variance components (k = 4, excludes rater III); average for each category
Abbreviations_ p, Paper; d, Research design; r, Rater (random); Object of measure, p:d.
Category | Effect | Research design | |||||
TE | QE | SS | DEO | QL | SR | ||
Preamble | p | 80 | 28 | 0 | 24 | 65 | 78 |
r | 0 | 44 | 63 | 6 | 5 | 12 | |
pr | 20 | 28 | 37 | 70 | 30 | 10 | |
Introduction | p | 0 | 0 | 57 | 38 | 71 | 8 |
r | 42 | 5 | 37 | 32 | 9 | 76 | |
pr | 58 | 95 | 6 | 30 | 20 | 16 | |
Design | p | 72 | 76 | 0 | 0 | 57 | 75 |
r | 12 | 6 | 67 | 38 | 29 | 19 | |
pr | 17 | 17 | 33 | 62 | 13 | 6 | |
Sampling | p | 69 | 0 | 0 | 63 | 46 | 21 |
r | 0 | 0 | 38 | 5 | 46 | 64 | |
pr | 31 | 100 | 62 | 32 | 9 | 15 | |
Data collection | p | 0 | 70 | 48 | 0 | 85 | 56 |
r | 0 | 12 | 26 | 0 | 11 | 0 | |
pr | 100 | 18 | 26 | 100 | 5 | 44 | |
Ethical matters | p | 72 | 62 | 87 | 78 | 62 | 38* |
r | 0 | 19 | 7 | 3 | 22 | 13* | |
pr | 28 | 19 | 6 | 18 | 16 | 49* | |
Results | p | 0 | 40 | 13 | 0 | 6 | 62 |
r | 71 | 27 | 56 | 9 | 60 | 18 | |
pr | 29 | 33 | 31 | 91 | 34 | 21 | |
Discussion | p | 37 | 58 | 39 | 0 | 75 | 0 |
r | 30 | 29 | 33 | 0 | 10 | 59 | |
pr | 33 | 13 | 27 | 100 | 15 | 41 |
Table 5 - Percentage mean variance components (k = 4, excludes rater III); average for research design by category
Abbreviations_ p, Paper; r, Rater (random); Object of measure, p; TE, True experimental; QE, Quasi-experimental; SS, Single system; DEO, Descriptive, exploratory, observational; QL, Qualitative; SR, Systematic review; k , no. of raters; *k = 2.
No. of raters | Research designs | |||||||||||||
TE (n =4) | QE (n =4) | SS (n =4) | DEO (n =4) | QL (n =4) | SR (n =4) | All (N =24) | ||||||||
Eρ2 | Φ | Eρ2 | Φ | Eρ2 | Φ | Eρ2 | Φ | Eρ2 | Φ | Eρ2 | Φ | Eρ2 | Φ | |
1 | 0.43 | 0.40 | 0.40 | 0.27 | 0.59 | 0.25 | 0.30 | 0.30 | 0.73 | 0.34 | 0.67 | 0.34 | 0.52 | 0.25 |
2 | 0.60 | 0.58 | 0.57 | 0.43 | 0.74 | 0.40 | 0.47 | 0.47 | 0.84 | 0.51 | 0.81 | 0.50 | 0.68 | 0.35 |
3 | 0.69 | 0.67 | 0.66 | 0.53 | 0.81 | 0.50 | 0.57 | 0.57 | 0.89 | 0.61 | 0.86 | 0.60 | 0.76 | 0.41 |
4 | 0.75 | 0.73 | 0.72 | 0.60 | 0.85 | 0.57 | 0.64 | 0.64 | 0.91 | 0.67 | 0.89 | 0.67 | 0.81 | 0.44 |
5 | 0.79 | 0.77 | 0.77 | 0.65 | 0.88 | 0.62 | 0.69 | 0.69 | 0.93 | 0.72 | 0.91 | 0.72 | 0.84 | 0.47 |
[vertical ellipsis] | [vertical ellipsis] | [vertical ellipsis] | [vertical ellipsis] | [vertical ellipsis] | [vertical ellipsis] | [vertical ellipsis] | [vertical ellipsis] | [vertical ellipsis] | [vertical ellipsis] | [vertical ellipsis] | [vertical ellipsis] | [vertical ellipsis] | [vertical ellipsis] | [vertical ellipsis] |
10 | 0.88 | 0.87 | 0.87 | 0.79 | 0.94 | 0.77 | 0.81 | 0.81 | 0.96 | 0.84 | 0.95 | 0.84 | 0.92 | 0.52 |
Table 6 - D study (excludes rater III)
Abbreviations_ TE, True experimental; QE, Quasi-experimental; SS, Single system; DEO, Descriptive, exploratory, observational; QL, Qualitative; SR, Systematic review; G coefficient: Eρ2 , Relative error; Φ, Absolute; n , Papers per research design; N , Total papers.
1
Background
What is new?
* The proposed critical appraisal tool (CAT) can be used across a variety of research designs.
* The reliability of scores from the proposed CAT were tested using the intraclass correlation coefficient (ICC) and generalizability theory (G theory).
* The proposed CAT obtained consistency ICCs from 0.91 to 0.64, depending on the research design used in the research.
* G theory showed where raters ran into difficulty based on research design and experience with the subject material in the papers.
* Raters were positive toward the proposed CAT and thought it could also be used as a template for writing and for peer-reviewing research papers.
Critical appraisal, a core technique in evidence-based practice and systematic reviews, is a standardized way of assessing research so that decisions can be made based on the best evidence available [1,2]. To achieve an efficient approach to critical appraisal, a large number of critical appraisal tools (CATs) have been developed [1,3]. Unfortunately, many of these CATs have fundamental flaws, which prevent them from being truly useful for appraising research. These problems include tools that are limited in the research designs that can be assessed; tools which lack comprehensiveness in their appraisal approach; and, tools that use inappropriate scoring methods which can hide poor research [1,3-6]. Of greatest concern is that CATs are used to assess the validity and reliability of research and many CATs have been published with little or no evidence for validity or reliability [1,7,8]. It was found in a review of 44 papers, which reported on how CATs were designed, that 37 (84%) had little or no evidence for validity and that 33 (75%) had no evidence for reliability [1].
Based on the review and the evidence available for designing CATs, a new structure for a CAT was proposed [1]. The proposed CAT attempts to overcome the shortfalls in previous CATs by having a structure that can be used across all research types, comprehensively assessing research, and having an appropriate scoring system [1,9]. The structure was initially based on research validity, however, that method was abandoned because:
1. Assessment of the research was often limited to internal research validity, ignoring external and conclusion validity;
2. Issues such as clear objectives or reasons on which certain decisions were made do not readily fit research validity but are still considered within critical appraisal;
3. Using research validity criteria to assess different types of research was difficult and time consuming [1].
Therefore, based on seven reporting guidelines and research methods theory, which incorporate internal, external, and conclusion validity, a structure for the proposed CAT was developed [1]. The proposed CAT consists of eight categories such that each category contains items that are most similar and the categories themselves are dissimilar. The categories are Preamble , Introduction , Design , Sampling , Data collection , Ethical matters , Results , and Discussion . Within each category, there are a number of items to be examined that are further described, as can be seen in Fig. 1.
The next step was to determine whether the proposed CAT could validly appraise different research designs [9]. The validation process closely followed the guidelines outlined in the Standards for educational and psychological testing , which require a combination of theory, empirical evidence, and a context for validity testing [10]. The validity study had two major aims: to develop a scoring system for the proposed CAT; and, to determine whether each of the categories were necessary to appraise research [9]. The scoring system does not require each item to be scored individually. Instead, items are marked as being present, absent but should be present, or not applicable based on the research design used in the paper being appraised. However, this is not a simple checklist, which can lead to inflexibility and inaccuracy. The appraiser makes a decision on what score a category should receive based on the marked items plus their overall assessment of that category. Scoring each category is on a scale from zero (no evidence) to five (highest evidence) where only whole numbers (integers) are used. Furthermore, evidence must be stated in the paper and cannot be assumed, which is in keeping with other CATs, reporting guidelines, and procedures for conducting systematic reviews [3,11,12]. The reasoning is that like appraising a student's work, scores are based on what was written and not on what was meant to be written.
A full user guide was developed for the proposed CAT to assist with scoring and as a necessary component of validity [9]. The validity process also tested the proposed CAT against five CATs previously tested for validity and reliability [8,13-16]. This showed that all the categories used in the proposed CAT, except Preamble , could be considered suitable for critical appraisal. There was insufficient evidence to make a decision on the Preamble category because it could only be compared against one of the five alternative CATs. Therefore, it was decided to include the Preamble category in the proposed CAT until more evidence could be gathered [9].
The third step, after design and validity, and overall objective of this study was to examine reliability of scores obtained from the proposed CAT. Reliability in this instance refers how closely a number of different raters agree on the score that should be given to a particular piece of research (i.e., inter-rater reliability). The intraclass correlation coefficient (ICC) is a common method to measure reliability [8,16,17]. However, the ICC is based on classical test theory and breaks the score into true score and error. This does not allow for further analysis of where exactly error in the score occurred. Generalizability theory (G theory) breaks scores down into the universal (or true) score plus where errors occur due to, for example, the tool, raters, environmental conditions, or any other factor that may have potentially influenced the score, called a G study [18-20]. This ability to find where errors occur was seen as of vital interest to understanding the proposed CAT. However, it should be noted that, except in specific circumstances, ICCs and G coefficients cannot be directly compared [20].
Given the previous use of ICCs and the flexibility of G theory, the aims of this reliability study were to determine:
1. Whether the scores obtained by the proposed CAT were reliable as determined by the ICC;
2. Which factors (or facets) contributed most to the mean variances in scores (G study);
3. How many raters may be required per paper to obtain optimal reliability of the scores (Decision or D study); and,
4. Participants' feedback on the proposed CAT and user guide.
2
Methods
2.1 Design
The study design was exploratory because the purpose was to discover whether the scores obtained by the proposed CAT were reliable. Each participant was given a random selection of papers to appraise. The papers selected were based on six research designs:
1. True experimental;
2. Quasi-experimental;
3. Single system;
4. Descriptive, exploratory, observational (DEO);
5. Qualitative; and,
6. Systematic review.
The reasons these research designs were chosen and how the individual papers were selected was fully outlined in a previous study [9]. Briefly, the research designs were based on broad groupings from the literature [21-23]. A predetermined search strategy was used to find papers based on the research designs and limit results to substantial research papers. The papers were randomly selected from the full text journals subscribed to by James Cook University (JCU) through OvidSP (Ovid, New York) in September 2009 using the random sequence generator available from random.org [24]. Each paper was read by the main author (M.C.) to confirm the research design. This ensured that the papers selected belonged to the research designs required. A full list of the papers used in this study is available in Appendix A on the journal's Web site at www.jclinepi.com.
Ethical approval for this study was obtained from JCU Human Ethics Committee (H3415). Informed consent was obtained from each participant before they voluntarily took part in the study. Participants could withdraw at any stage without explanation or prejudice. The authors have no conflicts of interest to declare and received no funding to complete this study.
A sample size calculation showed that a minimum of five participants apprising four papers each were required to obtain an ICC (r ) of 0.90 (α =95%, 1-β =0.85, r min=0.40) [25]. With six research designs being tested, this meant a total of 24 papers were required per participant. Participants were recruited from a convenience sample of staff from the JCU Schools of Public Health, Tropical Medicine & Rehabilitation Science, and Medicine & Dentistry. Staff were e-mailed regarding the study and a total of six participants voluntarily agreed to enroll.
Before any data were collected, it was decided that missing data from a participant would be scored based on the median score of the same category, from the same research design, for that participant, rounded to the nearest integer. For example, if the missing value for a participant was from a true experimental design paper, in the Sampling category, then the median of the Sampling category from the remaining true experimental design papers for that participant would constitute the missing value. This strategy was used because it had the least effect on rankings, given that the statistical analysis used was based on ranking data.
It has been stated in the literature that reliability scores for nonclinical tools should be at least 0.70 and for clinical tools at least 0.90, although the method used to calculate these scores is generally not declared [20]. However, for this study there were three reasons why the reliability scores may be lower than expected or may show some inconsistency. First, the papers are a random selection, which meant that there was no coherence between the papers, which may in turn make appraisal more difficult to accomplish. Secondly, the papers were taken from the broad field of health research, whereas most appraisal of research is confined to one or a limited number of related fields. Thirdly, some or all of the papers may have been outside a participants' expertise making it more difficult for them to accurately rate papers. Even given these issues, the design, sample, and data collection were conducted in a manner where a high reliability coefficient was expected.
2.2 Data collection
All participants were supplied with a guide to using the proposed CAT, appraisal forms, and papers. The form (Fig. 1) and guide were the same as those used in a previous study, which showed that the form was a valid method of obtaining scores [9]. Each paper and form had an identification label so that the participants and subsequent analysis could be attributed to individual papers. Furthermore, the research design for each paper was printed on each paper and critical appraisal form. The reason for doing this was to eliminate the possibility that participants might mistake the research design used in a paper. By identifying the research design, this variable was controlled without affecting the overall purpose of the study and participants could concentrate on critically appraising the papers.
Participants were instructed to read the user guide before appraising any papers as it had information regarding how to use the proposed CAT. Participants were given a 6-week period between March and April 2010 to read and appraise the papers, and return the forms. Participants were e-mailed every 2 weeks to remind them of the completion date and to check whether there were any problems. Participants were also informed to contact the main author (M.C.) if they had any problems with the tool or appraising the papers; none of the participants requested assistance during this time.
After appraising the papers, participants were questioned, using a semistructured questionnaire, about their research experience and their perception of the proposed CAT and user guide. The purpose was to determine if there was any difference in how participant's appraised a paper based on their research experience and to gain feedback on the proposed CAT and guide so that both could be improved for future use.
3
Results
Of the six participants who volunteered, five returned appraisal data. Two of the participants self-rated themselves as being very experienced researchers. The remaining three participants self-rated themselves as moderately experienced researchers.
There were five incidents of accidental missing data, which were scored based on the method outlined above. However, in three out of five cases, participants had purposely marked the Ethical matters category as "not applicable" for all systematic review papers although the user guide for the proposed CAT clearly stated that all categories should be scored for all research designs. When questioned about this, all three participants stated that they thought that Ethical matters for systematic reviews was irrelevant as approval from an ethics committee was not required to complete this type of research. When asked whether they thought sources of funding or conflicts of interest were ethical issues that should be stated in a systematic review paper, the three participants agreed that this was true and that they should have included an Ethical matters score but had limited their thinking to participant ethics rather than also including researcher ethics.
Because of this unforseen circumstance, the missing data strategy for Ethical matters was altered:
1. Where the Ethical matters category was calculated for systematic reviews, the ICC only used two participant scores.
2. The G study software [26] automatically replaced missing data with the grand mean for the category being calculated.
3. Where the Total scores (%) for systematic reviews were calculated, the missing Ethical matters scores were replaced with the median value from the two participants that had scored the Ethical matters category for systematic reviews. This prevented the Total scores (%) for the participants that had not scored the Ethical matters categories for systematic reviews being been much lower than expected and, thereby, negatively biasing the results while at the same time this method did not positively bias the results.
In calculating the ICCs, the G study, and the D study, it was assumed that the paper and rater effects were random (i.e., the papers and raters were not the only possible papers or raters), and the category effects were fixed (i.e., there were no additional categories) [27,28]. Also, two types of coefficients can be calculated for ICCs and G studies. In ICCs, these are called the coefficients for consistency (C) and absolute agreement (A), whereas the G coefficients are called relative error (Eρ2 ) and absolute error (Φ). For consistency and relative error, the coefficients are calculated based on whether the raters rank the entity being measured in the same order regardless of the real score given. For absolute agreement and absolute error, the coefficients are calculated based on whether the raters gave the same real scores to the entity being measured. As a result, the consistency/relative error coefficient is normally higher than the absolute agreement/absolute error coefficient [20,27,29].
When there are fewer raters, the ICC and G coefficients are lower, in general. This occurred when four of the raters were individually removed from analysis so that only four raters remained in each case. However, when rater III was removed from analysis, the ICC and G coefficients increased particularly in true experimental (by 23%), DEO (43%), and qualitative research designs (28%), and in Sampling (8%) and Ethical matters (15%) categories. In other words, rater III scored papers much differently compared with other raters. Based on conversations with rater III, it became evident that they had not read the user guide and had not scored papers in keeping with the proposed CAT. It was, therefore, decided to exclude rater III scores from analysis of the data.
3.1 Intraclass correlation coefficient
Each ICC was calculated using SPSS Statistics version 18.0.2 (SPSS, Chicago, IL) using the reliability command. Because the assumption was that the paper effects were random and the category effects were fixed, the model subcommand used the mixed value, the type subcommand was calculated for consistency and absolute agreement, and other subcommands used were the defaults [30].
The Total score (%) for all research designs taken together had an ICC for consistency of 0.83 and absolute agreement of 0.74 (Table 1). The Total score (%) for each research design had ICCs for consistency of (highest to lowest): qualitative 0.91; systematic review 0.89; single system 0.85; true experimental 0.75; quasi-experimental 0.72; and DEO 0.64. The Total score (%) ICCs for absolute agreement were (highest to lowest): true experimental 0.73; qualitative 0.67; systematic review 0.67; DEO 0.65; quasi-experimental 0.60; and single system 0.57.
For each category, the ICCs for consistency were (highest to lowest): Ethical matters 0.84; Results 0.75; Design 0.73; Introduction 0.70; Discussion 0.68; Sampling 0.66; Preamble 0.58; and Data collection 0.54. Although the ICCs for absolute agreement for the categories were (highest to lowest): Ethical matters 0.78; Design 0.65; Sampling 0.62; Discussion 0.62; Results 0.60; Introduction 0.53; Data collection 0.52; and, Preamble 0.50.
3.2 G and D study
G and D study results were calculated using a combination of SPSS Statistics version 18.0.2 (SPSS, Chicago, IL) and G-String_III [26]. In SPSS, the command used was varcomp and the method subcommand used Minique(1) [28]. In this G study, the object of measure was the paper (p ) or paper nested within a research design (p_d ). Most of the mean variance should be accounted for in the object of measure. Main effects were research design (d ), rater (r ), and category (c ), which may have influence on the object of measure. Interaction effects were interactions between the object of measure and other items, or interactions just between other items, which may have influence on the object of measure (e.g., rater crossed with paper (pr ) or research design crossed with category nested within research design (pc_d )) [27].
To obtain a sense of where variances were occurring, firstly the percentage mean variance components for all papers were analyzed (Table 2). This showed that 38% of variance was because of the paper effect (p ) and 32% of variance was from research design effect (d ).
To explore how research design affected variances, the percentage mean variance components for average research design scores were analyzed (Table 3). This showed that most of the variance was for the object of measure (p ) (53-70%). For DEO, 27% of variance was because of the interaction effect of paper crossed with rater (pr ). For qualitative and systematic reviews, the rater effect (r ) accounted for 27% and 22% of variance, respectively. Interaction effects for paper crossed with category (pc ), rater crossed with category (rc ), and paper crossed with rater crossed with category (prc ) were minimal in each research design.
How average category scores were affected by variance is shown in Table 4. Data collection and Ethical matters categories had the majority of variance from paper nested within research design (p_d ) at 52% and 72%, respectively. Sampling , Results , and Discussion categories and Total score (%) had high variance contributed by the research design effect (d ) (23%, 44%, 46%, 34%, and 31%, respectively). The Introduction category had a combination of 49% variance attributable to research design (25%) and rater (24%) effects. Whereas variance for Preamble (28%) and Data collection (40%) categories was because of an interaction effect in paper crossed with rater nested within research design (pr_d ).
Examination of individual categories within each research design (Table 5) showed that 15 of the 48 possible combinations had a 0-10% paper effect (p ). Of these 15, six had a 90-100% paper crossed with rater (pr ) interaction effect. Qualitative research showed the best results with six categories having majority paper effects (57-85%). Next were true experimental, quasi-experimental, and systematic review with four categories having majority paper effects (56-80%), followed by DEO (two categories) and single system (one category). The Ethical matters category had the best results across research designs with five out of six designs showing majority paper effects (62-87%). The next best results were for Design with four research designs having majority paper effects. This was followed by Preamble and Data collection with three each.
Finally, a D study was undertaken to determine the Total score (%) coefficients with different numbers of raters per paper and with all other variables kept equal (Table 6). The number of raters calculated were 1, 2, 3, 5, and 10. The greatest change in G coefficients was between 1 and 2 raters, with the change between 2 and 3 raters and beyond being progressively less.
3.3 Participant reactions
Participants were asked questions about the strengths and weaknesses of the proposed CAT and user guide, and any other comments they would like to make. With regard to the proposed CAT, participants thought that strengths of the tool were that it covered all areas of research methods, it separated research methods into individual categories so that the appraiser got an impression of which parts of a paper were good or bad and an overall impression of the paper, and that areas such as Ethical matters and Sampling were included which were not covered in other CATs. Weaknesses of the proposed CAT were that sometimes participants caught themselves giving a low rating to a category because a number of items were missing however, those items should have been marked as not applicable. Other weaknesses were that it was easier to use at the ends of the research continuum (e.g., true experimental and qualitative research) rather than the middle, some items were confusing such as outlying data and subgroup analysis, and a lot of items were not applicable, especially in qualitative research.
Strengths of the user guide included that it clarified how to use the tool in each research design and category, it was useful to refer back to when a participant was unsure how to score something, and that it had the right amount of information on research. Weaknesses were that it was both too short and slightly too long, it needed more information on research designs and methods, it was difficult to decide if something was not applicable or absent, and that there were not enough examples to help guide appraisal of the papers.
Finally, participants thought that other uses for the proposed tool could be: a template for writing a research paper; a tool to peer-review articles; teaching how to critically appraise research; and to appreciate the complexity of research. Other comments by participants were that they were more comfortable appraising research methods they were familiar with and found it difficult to appraise papers that were outside their field of expertise.
4
Discussion
The most unexpected result was that three out of five participants did not appraise the Ethical matters category for systematic reviews. These participants stated that they had seen Ethical matters for systematic reviews in research participation terms rather than in participant and researcher terms, as was stated in the user guide. Why conducting a systematic review should be any different from other types of research was a question that could not be answered but the issue requires further study. Also whether there was a tendency for these three participants to also score other research designs from a participant viewpoint rather than in participant and researcher terms could not be determined and would also require further study.
The decision to remove rater III's scores could be considered dubious. However, it must be remembered that the instructions in a user guide are an integral part of validity testing [9,10]. The aim of this study was to measure reliability and including rater III's data, just because they were collected, would negate the validity and consequently the reliability of the scores. Therefore, removing the invalid scores from the study was believed to be the most appropriate action.
When the Total score (%) given to each research design was examined, there were ICCs for consistency of between 0.72 and 0.91 for all research designs except DEO (0.64). Therefore, although participants were cognizant of the difficulty in appraising papers outside their experience of research methods, they still ranked papers reasonably consistently, that is, above the 0.70 level for nonclinical tools indicated earlier [20]. This was also evident from the G study, which showed that the majority of mean variance was because of the paper effect (p ) across each of the research designs. Only in DEO research was there a noteworthy interaction effect, paper crossed with rater (pr ), which also had the lowest ICC for consistency. Furthermore, ICCs for absolute agreement were high in each research design (0.57-0.73), which showed that the actual score raters gave to each paper was reasonably similar and not just the ranking of scores.
A core tenet for the proposed CAT was that the Total score (%) should not be the sole indicator of how a research was appraised and each category score should be an indicator of a paper's standard. However, although the ICCs for consistency were reasonably high for each category (0.54-0.84), they were still lower than those for research designs. The reason for this became apparent from the G study, which showed that the Introduction , Design , Sampling , Results , and Discussion categories had a substantial research design effect (d , 23-46%), that is, the scores in these categories were affected by the research design of the paper being appraised. Of the three remaining categories, Preamble and Data collection had a substantial interaction effect (pr_d , 28% and 40%, respectively), that is, a combination of paper and rater nested within research design influenced score variance. The exception was the Ethical matters category, which had no large main or interaction effects.
A possible explanation for the variation in the results could be that the categories were not appropriate for each research design leading to fluctuations in scoring. However, where the percent mean variance component for the category facet was extracted (pc , rc , prc in Table 3), it was very low (0-10%). Therefore, the most likely causes of these variations were twofold. First, participants stated that they had greater experience with some research designs than others and that they found it more difficult to appraise papers, which used research designs they were less familiar with. Secondly, participants' expertise was not matched to the papers in the study because the papers were randomly selected. A result of this was that papers that were outside a participant's expertise were more difficult to rate and more likely to be rated inconsistently than papers where the participant was more familiar with the subject matter.
As an example of these two issues influencing scores, all participants stated that they were most uncomfortable with the single-system papers because they were unfamiliar with the research design and also because they lacked the knowledge for the topics covered in those papers. The ICC for single-system designs reflected these issues because the difference between the consistency (0.85) and absolute (0.57) coefficients was the highest (0.28) for any research design, meaning that although participants ranked the single-system papers similarly they did not agree on the real score the papers should receive. Also, the percentage mean variance components for single-system design across each category were unimpressive with only two majority paper effects for Ethical matters (87%), which was most consistent across each research design, and Introduction (57%), and three 0% paper effects for Preamble , Design , and Sampling , which reflects the participants' lack of familiarity with the research design (Table 5). Similar statements of unfamiliarity with DEO research designs and the subject matter in those papers was evident in the ICC for consistency of 0.64, which was the lowest of all research designs. In the G study for DEO research, there were only two majority paper effects, Sampling (63%) and Ethical matters (78%) and four 0% paper effects (Design , Data collection , Results , and Discussion ).
Finally, the D study showed that to achieve consistently high relative error (Eρ2 approx. 0.70) or absolute error (Φ approx. 0.50) coefficients, a minimum of three raters should be used. This differs from conventional thinking on systematic reviews, which states that a minimum of two raters are required and a third or subsequent rater is only necessary when the other raters cannot achieve a consensus. From these results, further investigation should be undertaken to empirically determine the optimum number of raters required to appraise research papers.
5
Conclusion
When interpreting the scores obtained by this research, three things need to be kept in mind: (1) a random selection of 24 papers was used across six research designs; (2) participants' expertise did not necessarily match the subject matter in the papers, although the papers and participants were from health-related disciplines; and, (3) participants' knowledge of research designs was self-described as being limited to those they had most experience using.
Even given these limits, the proposed CAT shows great promise in being a viable tool that can be used across a wide range of research designs and appraisal situations. There were little or no category effects across the research designs meaning that the categories are appropriate for different types of research design. Much of the variation in scores can be explained as being due to the diverse subject matter of papers and participants' unfamiliarity with some research designs. To overcome this variability, improvements to the user guide can be made by including examples for each research design in each category. However, the problems with the proposed CAT should not be overstated or taken out of context as they are less likely to feature in situations where raters are familiar with the subject matter and the research designs used to gather data for that subject matter.
Supplementary information
Supplementary data
Supplementary data associated with this article can be found, in the online version, at doi:.
Crowe M, Sheppard L: A review of critical appraisal tools show they lack rigour: alternative tool structure is proposed . J Clin Epidemiol 64 79-89, 2011.
Glasziou P, Irwig L, Bain C, Colditz G: Systematic reviews in health care: a practical guide . Cambridge, MA: Cambridge University Press, 2001.
Deeks JJ, Dinnes J, D'Amico R, Sowden AJ, Sakarovitc C, Petticrew M: Evaluating non-randomised intervention studies . Health Technol Assess 7 (27): 2003.
Armijo Olivo S, Macedo LG, Gadotti IC, Fuentes J, Stanton T, Magee DJ: Scales to assess the quality of randomized controlled trials: a systematic review . Phys Ther 88 (2): 156-175, 2008.
Jüni P, Witschi A, Bloch R, Egger M: The hazards of scoring the quality of clinical trials for meta-analysis . JAMA 282 (11): 1054-1060, 1999.
Moyer A, Finney JW: Rating methodological quality: toward improved assessment and investigation . Account Res 12 (4): 299-313, 2005.
Bialocerkowski AE, Grimmer KA, Milanese SF, Kumar S: Application of current research evidence to clinical physiotherapy practice . J Allied Health 33 (4): 230-237, 2004.
Maher CG, Sheerington C, Herbert RD, Moseley AM, Elkins M: Reliability of the PEDro scale for rating quality of randomized controlled trials . Phys Ther 83 (8): 713-721, 2003.
Standards for educational and psychological testing . 2nd ed; Washington, DC: American Educational Research Association, 1999.
Khan KS, ter Riet G, Glanville J, Sowden AJ, Kleijen J: Undertaking systematic reviews of research on effectiveness: CRD's guidance for those carrying out or commissioning reviews (CRD Report 4) . York, England: University of York, 2001.
Moher D, Jones A, Lepage L: Use of the CONSORT statement and quality of reports of randomized trials: a comparative before-and-after evaluation . JAMA 285 1992-1995, 2001.
Cho MK, Bero LA: Instruments for assessing the quality of drug studies published in the medical literature . JAMA 272 (2): 101-104, 1994.
Reis S, Hermoni D, Van-Raalte R, Dahan R, Borkan JM: Aggregation of qualitative studies--from theory to practice: patient priorities and family medicine/general practice evaluations . Patient Educ Couns 65 214-222, 2007.
Shea B, Grimshaw J, Wells G, Boers M, Andersson N, Hamel C: Development of AMSTAR: a measurement tool to assess the methodological quality of systematic reviews . BMC Med Res Methodol 7 (10): 2007.
Tate RL, McDonald S, Perdices M, Togher L, Schultz R, Savage S: Rating the methodological quality of single-subject designs and n-of-1 trials: introducing the single-case experimental design (SCED) scale . Neuropsychol Rehabil 18 (4): 385-401, 2008.
Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJM, Gavaghan DJ: Assessing the quality of reports of randomized clinical trials: is blinding necessary? . Control Clin Trials 17 (1): 1-12, 1996.
Cronbach LJ, Shavelson RJ: My current thoughts on coefficient alpha and successor procedures . Educ Psychol Meas 64 391-418, 2004.
Marcoulides GA: Generalizability theory . Tinsley HEA, Brown SD: Handbook of applied multivariate statistics and mathematical modeling San Diego, CA Academic Press. 527-551, 2000.
Streiner DL, Norman GR: Health measurement scales: a practical guide to their development and use . 4th ed; Oxford, UK: Oxford University Press, 2008.
Creswell JW: Research design: qualitative, quantitative, and mixed methods approaches . 2nd ed; Thousand Oaks, CA: Sage, 2008.
Neuman WL: Social research methods: qualitative and quantitative approaches . Boston, MA: Pearson, 2006.
Portney LG, Watkins MP: Foundations of clinical research: applications to practice . 3rd ed; Upper Saddle River, NJ: Prentice Hall, 2008.
Walter SD, Eliasziw M, Donner A: Sample size and optimal designs for reliability studies . Stat Med 17 101-110, 1998.
Bloch R: G-String_III. Version 5.4.2 . Hamilton, ON: Programme for Educational Research and Development, 2010.
Brennan RL: Generalizability theory . New York: Springer, 2001.
Mushquash C, O'Connor BP: SPSS and SAS programs for generalizability theory analyses . Behav Res Method 38 542-547, 2006.
PASW Statistics 18 command syntax reference . Chicago, IL: IBM SPSS Inc, 2009.
Michael Crowe, Lorraine Sheppard. Discipline of Physiotherapy, James Cook University, Townsville, Qld 4810, Australia; Corresponding author. Tel.: +61 7 4781 4085; fax: +61 7 4781 6868.
Lorraine Sheppard. School of Health Sciences, University of South Australia, Adelaide, SA, Australia
Alistair Campbell. Discipline of Psychology, James Cook University, Townsville, Qld, Australia
© 2012 Elsevier Inc.