Introduction
Breast cancer is a biologically heterogeneous disease, which comprises multiple distinctive subtypes that are distinguishable by immunohistochemistry (IHC) or molecular analysis such as transcriptomic profiling . Clinically, IHC staining for oestrogen receptor (ER), progesterone receptor (PR) and epidermal growth factor receptor 2 (HER2) is routinely performed in most diagnostic laboratories to help select adjuvant treatment and to assess prognosis . Research studies demonstrate that expanding this IHC panel to include markers of basal breast cancers, such as cytokeratin 5/6 (CK5/6) and epidermal growth factor receptor 1 (EGFR or HER1), can enable more detailed molecular subtyping, approximating taxonomies based on molecular profiling .
Evaluating differences across breast cancer subtypes is central to etiological and clinical research. However, such studies require large sample sizes in order to include sufficient numbers of the less common subtypes, many of which are clinically important. Tissue microarrays (TMAs) can be used to assess IHC results for multiple cases in one tissue section , enabling standardized IHC staining and facilitating scoring. Given that visual scoring is labour intensive and suffers from imperfect inter‐rater agreement, automated quantitative image analysis has been proposed as an alternative that may offer logistical advantages with good reliability.
Automated analysis of pathology images has been in use for more than 20 years and has been applied extensively in recent years in the study of breast cancer with increasingly complex algorithms and improved concordance with visual scores . However, most comparisons are based on TMAs of a few hundred to a few thousand tumours constructed and stained in a single pathology laboratory. Although centralized construction and staining of TMAs is desirable to obtain comparable data , this is not always practical in large collaborative investigations that aggregate pathology samples from multiple studies.
This article details the application of fully automated image analysis of 8267 breast cancers collated from nine studies within the Breast Cancer Association Consortium (BCAC) . Automated image analysis was applied to score nuclear (ER, PR), membranous (HER2, EGFR) and cytoplasmic (CK5/6) markers to determine the usefulness and pitfalls of this approach and to identify limitations that might be addressed with methodological research.
Materials and methods
Study populations
This report includes nine BCAC studies with formalin‐fixed, paraffin‐embedded tumour blocks that had been previously prepared as TMAs (supplementary material Table 1). Relevant research ethics committees approved all studies; samples were anonymized before being sent to two coordinating centres at Strangeways Research Laboratory (University of Cambridge, Cambridge, UK) and Breakthrough Pathology Core Facility (Institute of Cancer Research, London, UK) for analysis. A total of 8267 cases with information on clinico‐pathological characteristics of the tumour, obtained from clinical records or centralized review of cases, were included in the analyses (supplementary material Table 2).
TMA immunohistochemistry
Three studies (ABCS, PBCS and SEARCH) provided previously stained TMA slides of ER and PR, four studies (ABCS, HEBCS, PBCS and SEARCH) of HER2, three studies (ABCS, KBCP, PBCS) of CK5/6 and three studies (HEBCS, KBCP, PBCS) of EGFR. Studies lacking pre‐existing stained TMAs for specific stains provided unstained TMA slides for centralized staining. Staining centres and protocols are detailed in supplementary material Table 3.
Automated Ariol scanning and scoring of TMAs
All TMA slides were scanned and analysed on the Leica Ariol system (Leica Biosystems, Newcastle upon Tyne, UK) using standard procedures and predefined algorithms tuned by an image analysis expert (see details in supplementary material). A single tuned algorithm was then applied to all TMAs. For ER and PR nuclear staining, we obtained automated measures of average stain intensity and percentage of cells stained. For HER2, the system calculated the HercepTest score (0, 1+, 2+, 3+). For CK5/6 and EGFR, we obtained a continuous automated score (0–300) based on a weighted sum of the percentage of positive cells in three bins of weakly, intermediate and strongly positive cells. Quality control procedures are described in the supplementary material.
Visual scoring of TMAs
Randomly selected cores from each study were re‐arrayed in ‘virtual TMAs' for visual scoring (see supplementary material). This resulted on a total of 942, 952 and 998 core images being visually scored in duplicate by two pathologists (M.E.S. and E.P.) for ER, PR and HER2, respectively. The Allred scoring system and intensity score was used for ER and PR . Stains for ER and PR were considered positive if the Allred score was ≥3. For HER2, the Herceptest scoring system was used for visual scoring. Positive stains for HER2 were defined in two groups as having an intensity score of 2 or 3 (HER2 2+) or 3 only (HER2 3+).
TMA slides of CK5/6 from four studies (CNIO‐BCS, MCBCS, ORIGO, SBCS) and slides of EGFR from six studies (ABCS, CNIO‐BCS, MCBCS, KBCP, ORIGO, SBCS) that had been centrally stained at CRUK‐CI were visually scored using the SlidePath system (see supplementary material). Ten scorers scored a total of 5771 cores for CK5/6 and 8259 for EGFR. MES served as the reference pathologist and scored a random sample of up to 100 cores per study/centre assigned to each of the other scores to evaluate inter‐scorer agreement. CK5/6 and EGFR positive score by visual scoring was defined as >10% of positive cells.
Scorers assigned each core the following quality control categories: 1) satisfactory core (invasive tumour), 2) DCIS only, 3) no tumour/few tumour cells, 4) no core and 5) unsatisfactory for other reasons.
Statistical methods
The correlation between automated continuous scores and visual ordinal scores was evaluated by the Spearman's correlation coefficient, using data from the virtual TMA. The area under the curve (AUC) of receiver operating characteristic (ROC) graphs was used to evaluate the discriminatory accuracy of the ER, PR combined‐automated scores (intensity*percentage) to distinguish between visual positive and negative scores. The automated score that optimized the sensitivity and specificity in the ROC graph was applied as the cut‐off point to define marker status for all analysed cores (not just the ones in the virtual TMA). We also evaluated an alternative method to define the cut‐off for positive and negative scores, as described by Ali et al . Briefly, the cut‐off under this method is determined by the distribution of automated percentage and intensity scores for all cores, ie, it does not use information on visual scores from a subset of tumours in the virtual TMAs to define a cut‐off point.
The kappa statistic was used as a measure of agreement between dichotomous or semi‐quantitative scores. Sensitivity and specificity were calculated as measures of validity using the visual score as the reference; positive predictive value (PPV) and negative predictive value (NPV) were calculated as a measure of the value of automated dichotomous scores to predict visual dichotomous scores.
Comparisons between automated scores and visual scores were performed at the core level for cores in the virtual TMAs. Subject‐level scores for ER, PR, HER2 were derived by selecting the maximum score of all available cores for a given subject, after having excluded cores identified as having few or no tumour cells or no cores by the pathologist. These were compared to positive/negative status in the BCAC database, based primarily on medical records, or centralized reviews by study centres.
Kaplan–Meier survival plots were used to plot survival functions by subject‐level IHC scores. Associations with 10‐year breast cancer‐specific survival were assessed using a Cox proportional‐hazards model, providing estimates of hazard ratio (HR) and 95% confidence interval (95% CI). Violations of the proportional‐hazards assumption were accounted for by the T coefficient that varied as a function of log time. We used penalized‐likelihood criteria, ie, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), to compare model parsimony and fit of alternative non‐nested Cox regression models including visual versus automated scores. Models with lower values for AIC or BIC have a better balance between model parsimony and fit. All statistical analyses were conducted in Stata/MP version 12.1 (StataCorp, College Station, TX, USA).
Results
Differences in TMAs and clinico‐pathological characteristics of cases across studies
The nine studies used different TMA designs including a total of 20 263 tissue cores in 104 TMA blocks from 8267 BCAC breast cancer cases (Table and supplementary material Table 1). The average age at diagnosis was 53 years. There were substantial differences in the distribution of age and clinico‐pathological characteristics across studies (supplementary material Table 2). A range of 75–77% of cores across virtual TMAs for ER, PR, HER2 were satisfactory for scoring (5–8% of which had only DCIS component), 10–13% had no tumour or few tumour cells, 3–5% had missing cores and 7–10% had unsatisfactory cores for other reasons (eg, blurred image, folded cores; see Table ).
Description of study populations and TMA designs used by participating studiesStudy Acronym | Country | Cases | Age at diagnosis, mean (range) | TMA blocks | Cores per case | Cores per TMA | Core size (mm) | Total cores per study |
ABCS | Netherlands | 1000 | 43 (23 50) | 26 | 1–6 | 12–241 | 0.6 | 3 314 |
CNIO‐BCS | Spain | 171 | 60 (35 81) | 3 | 2–2 | 86–148 | 1.0 | 342 |
HEBCS | Finland | 1154 | 56 (22 95) | 17 | 2–8 | 56–400 | 0.6 | 4 880 |
KBCP | Finland | 392 | 59 (23 92) | 12 | 3–3 | 96–99 | 1.0 | 1 176 |
MCBCS | USA | 348 | 58 (26 87) | 4 | 4–4 | 280–400 | 0.6 | 1 392 |
ORIGO | Netherlands | 233 | 56 (27 88) | 3 | 3–9 | 237–310 | 0.6 | 841 |
PBCS | Poland | 1406 | 56 (27 75) | 9 | 2–7 | 363–474 | 0.6 | 3 790 |
SBCS | UK | 358 | 60 (30 92) | 11 | 3–8 | 90–156 | 0.6 | 1 320 |
SEARCH | UK | 3205 | 52 (24 70) | 19 | 1–2 | 152–172 | 0.6 | 3 208 |
Totals | 8267 | 53 (22 95) | 104 | 1–9 | 12–474 | 0.6–1.0 | 20 263 |
Quality control category | ER | PR | HER2 | |||
N | % | N | % | N | % | |
Satisfactory Core (invasive tumour) | 649 | 69 | 679 | 71 | 672 | 67 |
DCIS only | 61 | 6 | 52 | 5 | 82 | 8 |
No Tumour, few tumour cells | 123 | 13 | 98 | 10 | 126 | 13 |
No core | 38 | 4 | 32 | 3 | 48 | 5 |
Unsatisfactory core for other reasons | 71 | 8 | 91 | 10 | 70 | 7 |
Total | 942 | 952 | 998 |
Core‐level comparison between ER, PR, HER2 automated and visual scores in virtual TMAs
The distributions of continuous automated scores and ordinal Allred visual scores for ER and PR are shown in Figures and , respectively. The automated and ordinal visual scores were highly correlated and there was a clear separation of the distribution of automated scores by the visual positive/negative scores (Figures D, E and D, E). There were differences in distributions of automated scores across studies that could reflect different clinico‐pathological characteristics of the tumours or staining quality (supplementary material Figures 3 and 4).
The AUC for ER and PR showed excellent discrimination (Table ). For dichotomous scores, there was excellent inter‐rater agreement for ER and PR and substantial agreement between automated and visual scores, which were better for ER than PR (Table , see supplementary material Table 4 for cross‐tabulations). The automated system had good sensitivity and specificity. The NPV was substantially lower for the automated to rater comparisons than the inter‐rater comparison (∼70% versus 95%). Use of study‐specific cut‐off points for negative versus positive scores did not substantially improve the measures of agreement (data not shown). Measures of relative performance of automated versus visual scoring were similar when we used the Ali et al method to select a cut‐off point for positive and negative automated score (data not shown).
Inter‐rater agreement and agreement between each rater and Ariol automated quantitative ER, PR scores for cores in the virtual TMAMarker | Comparison | N | % Pos. | Continuous automated score | Dichotomous automated score | |||||
AUC (95%CI) | Observed agreement | Kappa (95%CI) | Se (%) | Sp (%) | PPV (%) | NPV (%) | ||||
ER | Rater 1 vs rater 2 | 615 | 76.3 | n/a | 96.7 | 0.91 (0.83, 0.99) | 98.3 | 91.8 | 97.5 | 94.4 |
Ariol vs rater 1 | 587 | 75.0 | 0.97 (0.95, 0.98) | 90.1 | 0.76 (0.68, 0.84) | 89.5 | 91.8 | 97.0 | 74.6 | |
Ariol vs rater 2 | 636 | 76.4 | 0.96 (0.95, 0.98) | 90.1 | 0.75 (0.67, 0.83) | 88.9 | 94.0 | 98.0 | 72.3 | |
PR | Rater 1 vs rater 2 | 655 | 67.0 | n/a | 96.8 | 0.93 (0.85, 1.00) | 97.5 | 95.4 | 97.7 | 94.9 |
Ariol vs rater 1 | 624 | 67.3 | 0.93 (0.91, 0.95) | 83.8 | 0.65 (0.57, 0.73) | 82.9 | 85.8 | 92.3 | 70.9 | |
Ariol vs rater 2 | 634 | 66.6 | 0.93 (0.91, 0.95) | 84.4 | 0.66 (0.59, 0.74) | 83.6 | 85.8 | 92.2 | 72.5 |
Raters scores are dichotomous (positive/negative), and Ariol automated scores are considered as continuous and dichotomous.
% Pos., % positive cores for reference rater; Se, sensitivity; Sp, Specificity; PPV, positive predictive value; NPV, negative predictive value.
The kappa statistics for HER2 Herceptest score showed substantial agreement for both inter‐rater and automated to visual comparisons (kappa = 0.62–0.71; Table ). Although the agreement for the HER2 2+ dichotomous classification was substantial for both inter‐rater and rater‐automated comparisons, the agreement for HER2 3+ was only moderate for one of the raters. Sensitivity to identify HER2 3+ cores was low, both in inter‐rater and rater‐automated comparisons (Table ). When we examined cross‐tabulations to evaluate the sources of disagreement (supplementary material Table 4), it could be seen that extreme discrepancies, ie, Ariol scores of 0 where pathologist scores were 3 were very infrequent. Of the 13 discrepant cores, five were determined as pathologist error and re‐evaluated; four were due to poor tissue or staining quality (either through folds, high level of background staining or edge artifact, small tumour fragment) and four were due to Ariol error. The kappa statistics for rater‐automated agreement changed little when pathology errors and staining errors were removed from the analysis (data not shown).
Inter‐rater agreement and agreement between each rater and Ariol automated quantitative HER2 scores for cores in the virtual TMAHER2 semi‐quantitative score (0/1, 2,3) | N | Observed agreement | Kappa (95% CI) | |||||
Comparisons | ||||||||
Rater 1 vs rater2 | 660 | 92.7 | 0.71 (0.65, 0.78) | |||||
Ariol vs rater 1 | 693 | 90.7 | 0.62 (0.56, 0.68) | |||||
Ariol vs rater 2 | 716 | 93.7 | 0.71 (0.65, 0.77) | |||||
HER2 dichotomous score | N | % Pos. | Observed agreement | Kappa (95% CI) | Se (%) | Sp (%) | PPV (%) | NPV (%) |
HER2 2+ (0/1 vs 2/3) | ||||||||
Rater 1 vs rater2 | 660 | 20.9 | 91.4 | 0.73 (0.65, 0.81) | 74.6 | 95.8 | 82.4 | 93.5 |
Ariol vs rater 1 | 693 | 21.2 | 90.0 | 0.69 (0.62, 0.77) | 72.1 | 94.9 | 79.1 | 92.7 |
Ariol vs rater 2 | 716 | 19.6 | 91.6 | 0.73 (0.66, 0.81) | 77.9 | 95.0 | 79.0 | 94.6 |
HER2 3+ (0/2 vs 3) | ||||||||
Rater 1 vs rater2 | 660 | 12.6 | 94.1 | 0.68 (0.61, 0.76) | 59.0 | 99.1 | 90.7 | 94.4 |
Ariol vs rater 1 | 693 | 12.4 | 91.3 | 0.46 (0.40, 0.53) | 34.9 | 99.3 | 88.2 | 91.5 |
Ariol vs rater 2 | 716 | 8.4 | 95.8 | 0.67 (0.60, 0.74) | 55.0 | 99.5 | 91.7 | 96.0 |
% Pos.=% positive cores for reference rater
Se, sensitivity; Sp, Specificity; PPV, positive predictive value; NPV, negative predictive value.
Subject‐level comparison for ER, PR, HER2 automated scores to positive/negative scores in BCAC database
Figure shows scatter plots and distributions of automated scores for all cases (6424 cases for ER and 6385 cases for PR) by positive/negative status previously assigned by each individual study. The agreement between subject‐level automated scores and marker status was substantial to moderate, generally lower than the core‐level comparisons in the virtual TMAs (Table ; see supplementary material Table 5 for cross‐tabulations). There were substantial differences in the measures of agreement by study (supplementary material Table 6).
Marker | N | % Pos. | Continuous automated score | Dichotomous automated score | |||||
AUC (95%CI) | Observed agreement | Kappa (95%CI) | Se (%) | Sp (%) | PPV (%) | NPV (%) | |||
ER | 6424 | 74.5 | 0.89 (0.89, 0.90) | 84.1 | 0.62 (0.59, 0.64) | 84.6 | 82.7 | 93.4 | 64.7 |
PR | 6385 | 63.6 | 0.87 (0.86, 0.88) | 80.0 | 0.57 (0.55, 0.60) | 82.5 | 75.7 | 85.6 | 71.2 |
HER2 2+ | 6322 | 15.5 | – | 88.9 | 0.62 (0.59, 0.64) | 77.2 | 91.0 | 61.3 | 95.6 |
HER2 3+ | 6322 | 15.5 | – | 89.2 | 0.43 (0.41, 0.44) | 31.8 | 99.7 | 95.4 | 88.8 |
Clinical/study scores are dichotomous (positive/negative), and ER, PR Ariol scores are considered both as continuous and dichotomous.
% Pos., % positive cores for reference rater; Se, sensitivity; Sp, Specificity; PPV, positive predictive value; NPV, negative predictive value.
To evaluate the impact of core quality on measures of agreement, we used automated estimates of the number of tumour nuclei to identify cores with no or few tumour cells. Measures of agreement improved only slightly after these exclusions; however, this resulted in a substantial reduction in the number of subjects with valid scores (data not shown). We, therefore, decided not to make these exclusions in the remaining analyses.
Survival analysis for ER, PR, HER2 automated scores compared to positive/negative scores from individual studies
Kaplan–Meier survival curves drawn from the full subject‐level dataset demonstrated that the automated analysis generated the expected survival associations for ER, PR and HER2 (Figures ). While estimates of HR for automated data showed weaker associations with survival for dichotomous scores, automated scores allowed classification of cases into meaningful quantitative levels of ER and PR expression. Quintiles of the automated scores resulted in a refinement of the associations with survival (Figures and ). However, models with automated scores had a worse fit than models with dichotomous visual scores (see AIC/BIC values in Figures and ).
The HRs for women in the lowest quintiles for ER and PR were similar to those for receptor negative cases according to the BCAC database (representing 25.3% of the cases for ER 36.2% of cases for PR). The percentage of cores classified as negative in the BCAC database included in each of the quintiles for ER and PR is shown in supplementary material Table 8. Automated scores for HER2 also allowed estimation of HR by HER2 semi‐quantitative scores, showing increasing hazard for increasing scores (Figure ). However, as for ER and PR, the model fit was worse for automated that visual scores (see AIC/BIC values in Figure )
Comparison of automated and visual scores for CK5/6 and EGFR
An initial analysis for CK5/6 and EGFR in the entire TMA dataset resulted in very poor performance of automated scoring compared with visual scores by rater 1 or rater 2 (data not shown). A subsequent re‐analysis was performed only in the SEARCH study to demonstrate if limiting the tuning and analysis to a single study helped. Although this resulted in a marked improvement, the PPV was still poor (49.2% for CK5/6 and 30.0% for EGFR) reflecting a large number of false positives (Table ). Performance was better for ER‐negative than ER‐positive tumours, the former including a higher percentage of CK5/6 and EGFR‐positive tumours. Examination of discordant cores showed that the disagreements were primarily related to false positives due to scoring of normal cells by Ariol. We, therefore, scored visually all cores that had not been previously scored by individual studies (ie, 5771 cores stained for CK5/6 and 8259 cores stained for EGFR). The distribution of quality control scores for these TMAs was similar to those seen for ER, PR and HER2 (supplementary material Table 9). Examination of inter‐rater agreement on a subset of 357 CK5/6 cores and 760 EGFR cores scored visually by a reference pathologist for QC showed a better agreement than the automated versus visual agreement seen in the SEARCH study; however, the PPV was also relatively low (Table ). Evaluation of discordant pairs revealed that disagreements between visual scores were primarily due to disagreements between pathologists in identifying whether immunostained cells were normal cells versus cancer cells.
Inter‐rater agreement and agreement between Ariol automated CK56 and EGFR scores (dichotomized using the ROC method) for cores in TMAs from participating studiesMarker | Comparison | N | % Pos. | Observed Agreement | Kappa (95%CI) | Se (%) | Sp (%) | PPV (%) | NPV (%) |
CK5/6 | Inter‐rater agreement | 357 | 11.9 | 91.6 | 0.74 (0.66, 0.83) | 96.6 | 90.6 | 67.1 | 99.3 |
Ariol vs rater ‐ all | 1897 | 10.4 | 89.4 | 0.49 (0.44, 0.53) | 61.6 | 92.6 | 49.2 | 95.4 | |
Ariol vs rater – ER+ | 1107 | 6.4 | 89.1 | 0.41 (0.35, 0.46) | 71.8 | 90.3 | 33.6 | 97.9 | |
Ariol vs rater – ER− | 360 | 21.1 | 86.9 | 0.57 (0.47, 0.67) | 56.6 | 95.1 | 75.4 | 89.1 | |
EGFR | Inter‐rater agreement | 760 | 10.5 | 94.5 | 0.73 (0.66, 0.81) | 90.7 | 94.9 | 66.0 | 98.9 |
Ariol vs rater | 1914 | 9.8 | 84.1 | 0.44 (0.40, 0.48) | 87.7 | 83.7 | 36.9 | 98.4 | |
Ariol vs rater – ER+ | 1041 | 1.3 | 86.2 | 0.14 (0.11, 0.17) | 100.0 | 86.0 | 8.9 | 100.0 | |
Ariol vs rater – ER− | 342 | 39.5 | 82.2 | 0.63 (0.53, 0.74) | 83.7 | 81.2 | 74.3 | 88.4 |
% Pos., % positive cores for reference rater; Se, sensitivity; Sp, Specificity; PPV, positive predictive value; NPV, negative predictive value.
10Includes data from CNIO‐BCS, MCBCS, ORIGO, SBCS.
11Includes data from SEARCH re‐analysis.
12Includes data from ABCS, CNIO‐BCS, KBCP, MCBCS, ORIGO, SBCS.
Discussion
Automated image analysis of TMAs using many different systems has been shown to perform well for multiple markers . However, most studies have been based on relatively small comparisons of TMAs from one or few centres. Our report is a large‐scale evaluation of the performance of automated image analysis in the scoring of TMAs from different source institutions across different countries in a consortium of breast cancer studies.
The core‐level measures of agreement between automated and visual scores for the virtual TMAs in our report are most comparable to those in previous reports as they were based on comparisons of the same exact images. For ER, PR and HER2, they were lower than previously reported by our group using the Ariol system , or an automated scoring algorithm adapted from astronomy , possibly reflecting the greater variability in tissue preparation related to multiple specimen sources. Bolton et al used TMAs stained by ER, PR and HER2 from PBCS, and Ali et al used ER and HER2 stained TMAs from SEARCH. While these TMAs were also included in our study, the automated image analysis of the TMAs was done independently using different methods. Patient characteristics, modes of tumour detection, pathologic features and tissue handling in this report were likely highly variable because of the inclusion of multiple studies, but representative of ‘real world' population‐based samples collected over many years in international collaborations.
As expected, the agreement for subject‐level comparisons was lower than for core‐level comparisons since the latter are comparing scores based on different pieces of the tumour tissue, and the visual scores came from multiple sources (mainly clinical records and central review of cases by individual studies). Arguably, however, these comparisons are most relevant for answering scientific questions. A key advantage of automated image analysis is that it does not use pathologists's time and can be run continuously, including overnight. The analysis time is dependent on the type of stain, size of cores and number of cores per TMA. For instance, the time to score a TMA with 183 cores of 0.6 mm diameter can range from 25 min for a simple nuclear analysis (ER) to 70 min for a cytoplasmic analysis (CK5/6). The entire dataset was analysed over the course of a week using four batch processors. This is in comparison to approximately 35–40 min for a simple manual ER score of a similar TMA by a skilled pathologist using computer‐assisted scoring methods. A limitation of the automated approach is that 20–25% of cores in TMAs are unsatisfactory for scoring, but imaging systems do not perform well in triaging such cores. QC assessment of each core by visual inspection to identify unsatisfactory cores would improve the performance of the automated scoring. Similarly, study‐specific training of algorithms could also improve performance. Although during TMA production, tissue cores are targeted to tumour areas, contamination by normal elements is unavoidable. Identification of tumour cells by semi‐automated systems including manual demarcation of tumour areas prior to automate scoring could improve the performance of automated systems. However, these additional procedures are time consuming, and the added efforts to improve scoring diminish the relative value of automation. Automated pattern‐recognition software to identify tumour areas such as Definiens are promising but it is still difficult to get accurate identification of breast tumour cells, particularly in heterogeneous sets of tissue samples such as those derived from international consortia.
The performance of automated scoring for PR was worse than ER stains, possibly partly explained by a higher regional heterogeneity in positive staining for PR than ER . HER2 scoring performed using FDA‐approved commercial algorithms in brightfield and in fluorescence has demonstrated substantial agreement with visual assessment in studies of varying size and design. We observed a substantial agreement for HER2 semi‐quantitative scores for inter‐rater or automated‐rater comparisons, which was similar to that demonstrated previously on the Ariol system in our hands .
Dichotomous classification of automated scores for ER, PR and HER2 achieved less separation of prognostic groups by marker expression than the clinical/study scores. Because these three markers are routinely determined in most clinical settings, the main advantage of the automated scores was providing quantitative measures of expression that allowed refinement in the groups of patients with different prognosis. Although semi‐quantitative scores can also be obtained from clinical records, the reporting is not homogeneous and this information is not available in many epidemiological studies.
The performance of automated analysis of the cytoplasmic CK5/6 and membranous EGFR stains was much worse than for the time‐tested nuclear ER/PR and membranous HER2 antibodies, resulting in many false positive results. Automated scoring of cytoplasm stains such CK5/6 is particularly challenging since most systems use colour de‐convolution to remove the nuclear counterstain from the brown stain in order to identify nuclei and determine the cell type. This method reduces resolution so the accuracy of identification decreases. The poor performance for CK5/6 and EGFR was also an issue for the inter‐rater comparison, although to a lesser extent. Examination of discordant scores revealed that both the inter‐rater and automated‐visual discordances were often due to scoring immunopositive normal cells. Automated image analyses for CK5/6 and EGFR may provide useful triage; negative results may be considered final, whereas positive results require visual confirmation. This would potentially reduce scoring workloads by about 75%, and could be further refined by limiting visual review to ER‐negative or triple negative (ER−/PR−/HER2−) cancers expressing basal markers. However, image management could present challenges for targeted visual reviews.
In conclusion, using automated image analysis of TMAs stained by ER, PR and HER2 can be a useful tool to obtain quantitative scores for these markers in large collaborative studies including heterogeneous TMAs. However, automated scoring does not result in an improved performance of survival models, compared to visual scores. Automated scoring of CK5/6 and EGFR may permit triage of negative cores but positive results require visual review. Efforts to improve the performance of automated analysis should focus on standardization of specimen handling, TMA construction and use of centralized optimized IHC‐staining protocols. Improved standardization and optimization of key steps in these procedures combined with technical advances in automated analysis of IHC stains would facilitate large population‐based studies of breast cancer.
Acknowledgements
We would like to thank Rob Sykes from Leica Biosystems who helped to retrain the SEARCH image analysis for EGFR and CK5/6. We would also like to thank Mike Irwin at the Institute of Astronomy, Cambridge, for assistance with methods in the automated analysis. ABCS was supported by the Dutch Cancer Society [grants NKI 2007‐3839; 2009 4363]; BBMRI‐NL, which is a Research Infrastructure financed by the Dutch government (NWO 184.021.007); and the Dutch National Genomics Initiative. CNIO‐BCS was supported by the Genome Spain Foundation, the Red Temática de Investigación Cooperativa en Cáncer and grants from the Asociación Española Contra el Cáncer and the Fondo de Investigación Sanitario (PI11/00923 and PI081120). The Human Genotyping‐CEGEN Unit (CNIO) is supported by the Instituto de Salud Carlos III. HEBCS was financially supported by the Helsinki University Central Hospital Research Fund, Academy of Finland (132473), the Finnish Cancer Society, The Nordic Cancer Union and the Sigrid Juselius Foundation. The KBCP was financially supported by the special Government Funding (EVO) of Kuopio University Hospital grants, Cancer Fund of North Savo, the Finnish Cancer Organizations, the Academy of Finland and by the strategic funding of the University of Eastern Finland. The MCBCS was supported by an NIH Specialized Program of Research Excellence (SPORE) in Breast Cancer [CA116201], the Breast Cancer Research Foundation, the Mayo Clinic Breast Cancer Registry and a generous gift from the David F. and Margaret T. Grohne Family Foundation and the Ting Tsung and Wei Fong Chao Foundation. ORIGO authors thank E. Krol‐Warmerdam, and J. Blom; The contributing studies were funded by grants from the Dutch Cancer Society (UL1997‐1505) and the Biobanking and Biomolecular Resources Research Infrastructure (BBMRI‐NL CP16). PBCS was funded by Intramural Research Funds of the National Cancer Institute, Department of Health and Human Services, USA. SBCS was supported by Yorkshire Cancer Research S295, S299, S305PA. SEARCH is funded by programme grant from Cancer Research UK [C490/A10124. C490/A16561] and supported by the UK National Institute for Health Research Biomedical Research Centre at the University of Cambridge. Part of this work was supported by the European Community's Seventh Framework Programme under grant agreement number 223175 (grant number HEALTH‐F2‐2009‐223175) (COGS). We acknowledge funds from Breakthrough Breast Cancer, UK, in support of MGC and MB.
Author contributions
WJH, FMB, EP, PDP, MES, MG‐C conceived and carried out the study; WJH, LM, PG, NJ, L‐AMcD, JM, FD carried out the centralized laboratory work; EP, MES, EJS, SP, CHMvanD, LJ, RS, DV performed visual scoring; PC performed data management; MG‐C, MNB analysed data; CC, AB, JS, JW, HN, RF, CB, PH, HRA, S‐JD, JF, JL, LB, AM, VK, V‐MK, AC, IWB, SSC, MWR, FJC, JEO, PD, WEM, CMS, AH, JB, JIAP, PM, MKB, DFE, MKS contributed to data collection and/or data management. All authors were involved in writing the paper and gave final approval of the submitted and published versions.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2015. This work is published under http://creativecommons.org/licenses/by-nc/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Breast cancer risk factors and clinical outcomes vary by tumour marker expression. However, individual studies often lack the power required to assess these relationships, and large‐scale analyses are limited by the need for high throughput, standardized scoring methods. To address these limitations, we assessed whether automated image analysis of immunohistochemically stained tissue microarrays can permit rapid, standardized scoring of tumour markers from multiple studies. Tissue microarray sections prepared in nine studies containing 20 263 cores from 8267 breast cancers stained for two nuclear (oestrogen receptor, progesterone receptor), two membranous (human epidermal growth factor receptor 2 and epidermal growth factor receptor) and one cytoplasmic (cytokeratin 5/6) marker were scanned as digital images. Automated algorithms were used to score markers in tumour cells using the Ariol system. We compared automated scores against visual reads, and their associations with breast cancer survival. Approximately 65–70% of tissue microarray cores were satisfactory for scoring. Among satisfactory cores, agreement between dichotomous automated and visual scores was highest for oestrogen receptor (Kappa = 0.76), followed by human epidermal growth factor receptor 2 (Kappa = 0.69) and progesterone receptor (Kappa = 0.67). Automated quantitative scores for these markers were associated with hazard ratios for breast cancer mortality in a dose‐response manner. Considering visual scores of epidermal growth factor receptor or cytokeratin 5/6 as the reference, automated scoring achieved excellent negative predictive value (96–98%), but yielded many false positives (positive predictive value = 30–32%). For all markers, we observed substantial heterogeneity in automated scoring performance across tissue microarrays. Automated analysis is a potentially useful tool for large‐scale, quantitative scoring of immunohistochemically stained tissue microarrays available in consortia. However, continued optimization, rigorous marker‐specific quality control measures and standardization of tissue microarray designs, staining and scoring protocols is needed to enhance results.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details
1 Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK
2 Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, Cambridge, UK
3 Breast Pathology, Addenbrookes Hospital, Cambridge, UK
4 Division of Genetics and Epidemiology, The Institute of Cancer Research, London, UK
5 Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK; Department of Oncology, University of Cambridge, Cambridge, UK
6 Breakthrough Breast Cancer Research Unit, Division of Cancer Studies, King's College London, Guy's Hospital, London, UK
7 Division of Cancer Studies, NIHR Comprehensive Biomedical Research Centre, Guy's & St. Thomas' NHS Foundation Trust in partnership with King's College London, London, UK
8 Research Oncology, Division of Cancer Studies, King's College London, Guy's Hospital, London, UK
9 Department of Pathology, Erasmus University Medical Center, Rotterdam, The Netherlands
10 Centre for Tumour Biology, Barts Institute of Cancer, Barts, UK; The London School of Medicine and Dentistry, London, UK
11 School of Medicine, Institute of Clinical Medicine, Pathology and Forensic Medicine, Cancer Center of Eastern Finland, University of Eastern Finland, Kuopio, Finland; Imaging Center, Department of Clinical Pathology, Kuopio University Hospital, Kuopio, Finland
12 Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA
13 Breakthrough Breast Cancer Research Centre, Division of Breast Cancer Research, The Institute of Cancer Research, London, UK
14 Core Facility for Molecular Pathology and Biobanking, Netherlands Cancer Institute, Antoni van Leeuwenhoek Hospital, Amsterdam, The Netherlands
15 Department of Pathology, Division of Diagnostic Oncology, Netherlands Cancer Institute, Antoni van Leeuwenhoek Hospital, Amsterdam, The Netherlands
16 Department of Obstetrics and Gynecology, University of Helsinki and Helsinki University Central Hospital, Helsinki, Finland
17 Department of Oncology, Helsinki University Central Hospital, Helsinki, Finland
18 Department of Pathology, Helsinki University Central Hospital, Helsinki, Finland
19 Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, Maryland, USA
20 Department of Cancer Epidemiology and Prevention, M. Sklodowska‐Curie Memorial Cancer Center & Institute of Oncology, Warsaw, Poland
21 Kuopio University Hospital, Cancer Center, Kuopio, Finland; School of Medicine, Institute of Clinical Medicine, University of Eastern Finland, Oncology and Central Hospital of Central Finland, Central Finland Hospital District, Kuopio, Finland
22 CRUK/YCR Sheffield Cancer Research Centre, Department of Oncology, University of Sheffield, Sheffield, UK
23 Academic Unit of Pathology, Department of Neuroscience, University of Sheffield, Sheffield, UK
24 Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
25 Department of Human Genetics & Department of Pathology, Leiden University Medical Center, The Netherlands
26 Department of Surgical Oncology, Leiden University Medical Center, The Netherlands
27 Family Cancer Clinic, Department of Medical Oncology, Erasmus MC Cancer Institute, Rotterdam, The Netherlands
28 Human Genetics Group, Human Cancer Genetics Program, Spanish National Cancer Research Centre (CNIO), Madrid, Spain; Centro de Investigación en Red de Enfermedades Raras (CIBERER), Valencia, Spain
29 Servicio de Cirugía General y Especialidades, Hospital Monte Naranco, Oviedo, Spain
30 Servicio de Anatomía Patológica, Hospital Monte Naranco, Oviedo, Spain
31 Centre for Cancer Genetic Epidemiology, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
32 Centre for Cancer Genetic Epidemiology, Department of Oncology, University of Cambridge, Cambridge, UK; Centre for Cancer Genetic Epidemiology, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
33 Division of Molecular Pathology, Netherlands Cancer Institute, Antoni van Leeuwenhoek Hospital, Amsterdam, The Netherlands
34 Division of Genetics and Epidemiology, The Institute of Cancer Research, London, UK; Breakthrough Breast Cancer Research Centre, Division of Breast Cancer Research, The Institute of Cancer Research, London, UK