Context.—Precise subtype diagnosis of non–small cell lung carcinoma is increasingly relevant, based on the availability of subtype-specific therapies, such as bevacizumab and pemetrexed, and based on the subtype-specific prevalence of activating epidermal growth factor receptor mutations.
Objectives.—To establish a baseline measure of interobserver reproducibility for non–small cell lung carcinoma diagnoses with hematoxylin-eosin for the current 2004 World Health Organization classification, to estimate interobserver reproducibility for the therapeutically relevant squamous/nonsquamous subsets, and to examine characteristics that improve interobserver reproducibility.
Design.—Primary, resected lung cancer specimens were converted to digital (virtual) slides. Based on a single hematoxylin-eosin virtual slide, pathologists were asked to assign a diagnosis using the 2004 World Health Organization classification. Kappa statistics were calculated for each pathologist-pair for each slide and were summarized by classification scheme, pulmonary pathology expertise, diagnostic confidence, and neoplastic grade.
Results.—The 12 pulmonary pathology experts and the 12 community pathologists each independently diagnosed 48 to 96 single hematoxylin-eosin digital slides derived from 96 cases of non–small cell lung carcinoma resection. Overall agreement improved with simplification from the comprehensive 44 World Health Organization diagnoses (κ = 0.25) to their 10 major header subtypes (κ = 0.48) and improved again with simplification into the therapeutically relevant squamous/nonsquamous dichotomy (κ = 0.55). Multivariate analysis showed that higher diagnostic agreement was associated with better differentiation, better slide quality, higher diagnostic confidence, similar years of pathology experience, and pulmonary pathology expertise.
Conclusions.—These data define the baseline diagnostic agreement for hematoxylin-eosin diagnosis of non–small cell lung carcinoma, allowing future studies to test for improved diagnostic agreement with reflex ancillary tests.
The diagnosis of non–small cell lung carcinoma (NSCLC) histologic subtype is the current gold standard for appropriate selection of chemotherapy, affecting the safety of bevacizumab1 and the efficacy of pemetrexed.2 The efficacy of epidermal growth factor receptor (EGFR) tyrosine kinase inhibitors3 is higher in patients with activating EGFR gene mutations,4 present in 10% to 20% of lung adenocarcinoma (AD),5 but few or no lung squamous carcinoma (SC).6 Here, we estimate pathologists' diagnostic agreement by measuring interobserver reproducibility (IOR) for hematoxylin-eosin (H&E) diagnosis of NSCLC subtypes in resected specimens using the 2004 World Health Organization classification (2004-WHO).7
Four WHO lung cancer classifications have been published: 1967,8 1982,9 1999,10 and 2004.7 These classifications are based primarily on light microscopic evaluation of H&E-stained sections from resected neoplasms. Incremental refinements between editions have included reclassification for some disease entities (eg, solid AD with mucin production), recognition of new disease entities (eg, large cell neuroendocrine carcinoma), fine-tuning of diagnostic criteria, and correlation with clinical, radiographic, immunohistochemical, and molecular variables.
Diagnostic agreement can be estimated by measuring percentage agreement or by calculating a κ statistic, which accounts for chance agreement. The κ statistic ranges from complete disagreement (κ = −1.0) to complete agreement (κ = 1.0), with a target minimum for clinical testing of 0.7.11 Although the WHO classification system is complex, studies typically simplify categories.12–24 Four studies assessed H&E-only IOR for NSCLC. Using the 1967 WHO classification, Feinstein et al17 found 95% to 98% agreement for epidermoid and AD when well differentiated (WD), but only 58% to 60% agreement when poorly differentiated (PD). With the 1982 WHO classification, Hanai et al20 and Yamamoto et al24 reported 77% to 100% and 97% to 98% agreement, respectively. Burnett et al12,13 reported κ = 0.28 to 0.30 for SC and AD with modest improvement when mucin stains were provided. Other IOR studies14,15,18,19,22 are not directly comparable to the current study because they mix H&E-only diagnoses with diagnoses using both H&E and mucin stains. Employing the 1999 WHO classification, Colby et al16 found dominant cell–type agreement in 74% to 82% of NSCLC/small cell lung cancer cases, with an overall κ of 0.65 to 0.74. No published IOR studies were found with the 2004-WHO classifications.
We designed this baseline study to measure the IOR (agreement) for diagnosis of resected NSCLC. Using the current 2004-WHO, we evaluated the IOR for the H&E diagnosis of NSCLC by comparing 24 pathologists' diagnoses of representative, digital H&E slides from 96 resected lung cancers. We report IORs for the complete 2004-WHO classification of 44 diagnoses (44DC) and estimate IORs for the classification's 10 major diagnostic categories (10DC) and the clinically relevant squamous/nonsquamous (SC/non-SC) classes (Table 1). We also report the effect of pathologists' practice settings, expertise in lung pathology, years of experience, confidence in the H&E diagnosis, slide quality, and carcinoma grade on IOR. This study is the first, to our knowledge, to measure the agreement of NSCLC H&E diagnoses for the entire current 2004-WHO, to estimate IOR for the therapeutically relevant SC/non-SC classes, and to demonstrate the utility of digital slide review.
Sample Selection and Study Population
Sequential, surgically resected, primary NSCLCs (n = 96) collected at the University of North Carolina (Chapel Hill) from 1997–2007 were identified. Single diagnostic blocks used in the original pathologic diagnosis were recut and stained with H&E and were scanned using an Aperio ScanScope slide scanner (Aperio Technologies, Vista, California) into virtual slides viewable at magnifications equivalent to ×2 to ×20 objectives (×40 magnifier). Snapshot jpeg images (×2 and ×20) were created from unselected, central areas of the virtual slides. Grades were based on the original pathologic diagnosis. Small cell lung cancer, metastases, and normal specimens were excluded.
Increasing the number of pathologists increases the generalizability of the conclusions, and increasing the number of reviewed slides decreases the standard error around the κ estimate of IOR.25 To balance these considerations, we recruited 12 expert lung pathologists from the Pulmonary Pathology Society and 12 community pathologists. Each pathologist reviewed two random sets of 24 slides of the total 96 slides. Some pathologists elected to review all 96 slides.
Using DVDs containing virtual slides and Internet-based snapshots, pathologists recorded their 2004-WHO diagnoses onto an Internet-based survey. Pathologists were free to base their diagnoses on the virtual slide and/or the snapshot images. For each slide, pathologists reported diagnosis, quality of slide image, diagnostic confidence, and any additional comments. Pathologists' personal identifiers were removed by a designated data manager, but linked demographic information was retained, including years in practice and surgical pathology fellowship (yes/no), as well as whether the participant was an expert lung pathologist or a community pathologist. The study was approved by the University of North Carolina Institutional Review Board.
The Cohen26 simple κ statistic was used to measure agreement among the 222 pathologist-pairs, from combinations of 24 pathologists. Pathologists' 44DC were collapsed into 10 DC and then into SC/non-SC categories (Table 1).27
Bootstrap methods28,29 (including block bootstrapping) were used to calculate standard errors (standard deviations of the bootstrapped means), through which, 95% confidence intervals (CI) for the (weighted) mean κ statistics were calculated. Subgroup κ statistics were calculated along with their (bootstrap) 95% CI.
Exploratory analyses were performed using logistic regression modeling to examine possible associations of pathologist, slide, and tumor factors on the probability of agreement. The dependent variable of agreement on a diagnosis for a particular slide by a pathologist-pair was scored as agreement or disagreement. A c-index30 was used to measure and compare the levels of association for both univariable and multivariable models. The covariates that were evaluated relating to the pathologists included expertise, practice setting, and years of diagnostic experience (both the sum of their combined experience, and the absolute values of the difference in their years of experience). We distinguished between tumor factors (inherent to the entire case as diagnosed by the original pathologist) and slide factors (inherent to the image being considered). Tumor factor covariates included pathologic diagnosis and original neoplastic grade. Slide factor covariates included confidence in diagnosis and image quality. In our logistic regression analyses, we dichotomized diagnosis as SC versus non-SC, grade as WD versus moderately differentiated (MD) versus PD, and confidence as high or not high. Because of the exploratory nature of the analysis, we did not adjust for the dependencies among slides and pathologists. Odds ratios with 95% CIs are given for these covariates of interest (Table 4).
Analyses were performed using both SAS (Version 9.2; SAS Institute, Inc, Cary, North Carolina) and R statistical software (R Development Core Team 2008).31
Twelve of 13 expert lung pathologists (92%) and 12 of 13 community pathologists (92%) agreed to participate in the study. A surgical pathology fellowship had been completed by 16 of 24 pathologists (67%). A median of 17 years (range, 1–36 years) of posttraining experience was reported (Table 2). Based on the 24 study pathologists reviewing random allocations of 48 to 96 slides, a comprehensive 1∶1 matching of pathologists' pairwise agreements resulted in a total of 222 unique “pathologist-pairs” and 7130 unique slide viewings (“slide-pairs”) reviewed by the pathologist-pairs. Slide-pairs (2 pathologists' diagnoses of a single slide) formed the fundamental unit by which we measured agreement.
All virtual slides contained cancer. All (96 of 96; 100%) low-power and 94% (90 of 96) of high-power jpeg snapshot images contained representative fields of the same neoplasm. Six percent (6 of 96) of the high-power jpeg snapshot images did not contain representative fields of the neoplasm seen in the low-power jpeg snapshot image. The IORs for pathologists who used primarily jpegs or both are similar with or without elimination of the 6 cases with nonrepresentative high-power jpeg images.
Four out of 24 pathologists (17%) experienced technical challenges in use of the large DVD virtual slide files and retrospectively reported using jpegs exclusively or a mixture of jpegs and DVDs. The IORs for pathologists who primarily used DVDs were similar to those who used primarily jpegs or both versions (data not shown).
On average, pathologists rated 91% of the diagnostic images of sufficient quality for diagnosis, with little agreement on which slides were of low quality. Quality was uniformly scored as acceptable in 37 of 96 (39%), with an additional 32 slides (33%) receiving only one unacceptable quality rating. Pathologists assigned confidence in their diagnoses as follows: high, 52%; moderate, 40%; and poor, 8% (Table 2).
The distribution of the original and study diagnoses were AD, 35% and 36%; SC, 35% and 31%; adenosquamous, 13% and 3%; large cell, 9% and 17%; miscellaneous, 6% and 7%; sarcomatoid carcinoma, 1% and 4%; and carcinoid, 1% and 2%, respectively. Based on the original pathologic grade, slides were 3% WD, 54% MD, and 43% PD (Table 2).
Overall, the IOR for H&E diagnoses for the entire 2004-WHO classification system (44DC), was κ = 0.25 (95% CI, 0.23–0.26) (Figure 1; Table 3). The 44DC κ statistics improved with simplification into 10DC (overall κ = 0.48), and again into the SC/non-SC classes (overall κ = 0.55; 95% CI, 0.53–0.58) and into the AD/non-AD classes (overall κ = 0.59; 95% CI, 0.57–0.61). Table 3 shows the variability of IOR as a function of diagnostic confidence, pulmonary pathology expertise, and neoplastic grade. The IOR varied most widely as a function of the pathologist's confidence in his or her H&E diagnosis. For each classification and level of expertise, IOR was higher when diagnostic confidence was higher. Overall, IOR improved by simplifying 44DC (high confidence κ = 0.38, moderate confidence κ = 0.15) into 10DC (high confidence κ = 0.69, moderate confidence κ = 0.31) and again into SC/non-SC classes (high confidence κ = 0.78, moderate confidence κ = 0.28).
For each classification (44DC, 10DC, dichotomous), IOR was higher when pulmonary pathology expertise was higher (Table 3). The IOR improved by simplifying the classification from 44DC (expert κ = 0.30, community κ = 0.19) into 10DC (expert κ = 0.55, community κ = 0.36), and again into SC/non-SC classes (expert κ = 0.64, community κ = 0.41) and AD/non-AD classes (expert κ = 0.69, community κ = 0.46).
For each classification (44DC, 10DC, dichotomous), IOR was higher when carcinomas were better differentiated (Table 3). The IOR improved by simplifying the 44DC (WD/MD κ = 0.27; PD κ = 0.22) into 10DC (WD/MD κ = 0.52; PD κ = 0.41) and again into the SC/non-SC (WD/MD κ = 0.60; PD κ = 0.46) and AD/non-AD (WD/MD κ = 0.64; PD κ = 0.48) classes. When considering only the 3 WD slides (all non-SC), pathologists were in 100% diagnostic agreement.
Mean agreement of each study pathologist's diagnosis with the original pathologist's diagnosis (κ = 0.52) was comparable to the overall IOR for 10DC of κ = 0.48. To assess the effect of potential outliers, study pathologist-pairs were stratified by pairwise agreement quartiles. The top quartile approached the goal of κ = 0.70 for good clinical agreement, whereas the bottom quartile had fair agreement. We identified both expert and community pathologists in all agreement quartiles (data not shown).
Tumor, slide, and pathologist variables were evaluated for univariable and multivariable effect on SC/non-SC IOR (Table 4). All univariable and all but one multivariable predictor (cumulative pathologist experience) were statistically significant. Predictors for higher IOR included better-differentiated carcinomas, better slide quality, and higher diagnostic confidence. Pathologist diagnostic confidence was statistically associated with neoplastic grade, slide quality, experience, and expertise. Because confidence was highly associated with the perception of slide quality (P < .001), any effect of slide quality on interpretation is probably reflected in the data regarding diagnostic confidence.
Increasing difference in years of pathologist practice experience predicted decreased IOR. Roughly, a 10% decrease in agreement was found for every 5 years difference in practice experience. Increased cumulative pathologist practice experience predicted increased IOR, statistically significant by univariate analysis only, with a 3% increase in agreement for every 5 years of cumulative practice experience. Pulmonary pathology expertise in both pathologists of a pair predicted an increased IOR: expert pathologist-pairs had a 38% increase in the odds of agreement compared with community pathologist pairs. Pulmonary pathology expertise was highly correlated with confidence, such that the odds of agreement for expert pathologist-pairs showed a 21% increase after controlling for confidence, quality, and grade in multivariable analysis (Table 4). Figure 2 graphically summarizes many of the results. Some cases, particularly WD cases of SC and AD, were readily identified with high IOR by H&E alone.
Strengths of the Study
Non–small cell lung carcinoma subtyping has refined and improved survival of patients with advanced NSCLC.2,32 We designed a comprehensive prospective study of H&E diagnostic agreement for NSCLC. Using the 2004-WHO, our data measure IOR for the entire 44DC and provide estimates for the parent 10DC and the therapeutically relevant SC/non-SC classes (Table 1). These data evaluate factors that might predict IOR, including sums and differences in years of practice experience, expertise in lung pathology, slide quality, diagnostic confidence, and carcinoma grade.
We hypothesized that IOR for the H&E diagnosis of NSCLC subtypes according to the 2004-WHO would show a κ of 0.7, an agreed-upon, albeit arbitrary, target for minimal clinical test reproducibility. We found that overall IOR among study pathologists was fair (κ = 0.25) when using all 44DC, with improvement following collapse into the 10DC (κ = 0.48) or the therapeutically relevant SC/non-SC classes (κ = 0.55) (Table 3). The low κ for 44DC is not surprising because many of these diagnoses would not be made in practice without ancillary stains. Our 10DC IOR results appear similar in magnitude to studies of prior versions of the classification,12,13,15,16,18–22,24 but direct comparison to historic studies is limited because the most methodologically similar study12,13 used bronchial biopsies rather than resection specimens. Additionally, other studies used glass slides and simplified the classification system into major diagnostic categories rather than using the comprehensive diagnostic listings.
Our multivariate analysis shows that grade, slide quality, diagnostic confidence, difference in experience, and pulmonary pathology expertise are independent predictors of NSCLC H&E diagnostic agreement, although those methods do not account for the dependencies among the slide review. Controllable factors that may improve agreement include optimizing H&E slide quality and increasing lung pathology expertise.
Our data suggest an upper limit for IOR by H&E alone, mainly because of PD NSCLC lacking morphologic features of SC or AD.19 Pathologist confidence in his or her H&E slide diagnosis, the most predictive factor for increased IOR, likely reflects a qualitative amalgamation of grade, slide quality, and expertise. Diagnostic agreement may improve with systematic definition and application of reflex stain panels for PD NSCLC. Providing histochemical (eg, mucin) and immunohistochemical (eg, thyroid transcription factor 1, p63, cytokeratin 5/6, and napsin A) phenotypes, as well as cytogenetic tests (echinoderm microtubule-associated proteinlike 4 [EML4]–anaplastic lymphoma kinase [ALK] translocation) and molecular tests (eg, EGFR/KRAS/BRAF mutations) to define molecular targets for therapy likely would have improved diagnostic agreement; this is an important question for follow-up studies.
The 2004-WHO continues to reward the lung cancer community with meaningful associations, such as EGFR mutations with AD,33 and the EML4-ALK fusion oncogene with signet-ring histology.34 The goal remains to incrementally improve diagnostic classifications, criteria, and reflex ancillary tests to optimize agreement, as well as to report associated prognostic and predictive data to guide patient management.
Although detailed classification likely reflects underlying biology, κ statistics increase with a reduced number of classes; therefore, simplifying the morphologic classification should improve agreement. Pathology reports that include both the specific (44DC) diagnosis and parent (10DC) category may reduce confusion by treating clinicians regarding management of uncommon WHO diagnoses.
Although our data include 7130 slide-pairs drawn from an incident patient series of 96 cases, we recognize that the sample size was insufficient to represent all diagnostic entities in the 2004-WHO. Diagnoses were based on single H&E images, rather than complete cases (glass slides with ancillary stains), with a goal of establishing baseline κ statistics for the H&E diagnosis of NSCLC. Based on feedback from several pathologists at the time the study was designed, we determined that reviewing 48 to 96 entire cases would deter participation. Study pathologists' agreement with each other was similar to their agreement with the original pathology diagnosis, arguing that our study design reflects what would have been observed if the entire case had been reviewed. We intentionally provided only H&E sections, without pertinent clinical, radiographic, or ancillary stain data, other than the knowledge that the patient carried a diagnosis of NSCLC, to estimate IOR of 3 relevant NSCLC classifications (44DC, 10DC, SC/non-SC) under conditions in which each pathologist had exactly the same information: an H&E image only.
Several pathologists lacked familiarity with digital images or had concerns regarding image resolution, which may have compromised their diagnostic abilities. However, digital images control for any variation in the circulated images, a major advantage over the morphologic variation inevitable in 24 recut sections through a paraffin block. Although not readily employed in clinical practice, it is commonly used in teaching and research, including for The Cancer Genome Atlas.35 Wider use of digital slides could facilitate timely accrual to trials requiring central pathology review and expedite expert review of challenging cases.
The IOR was similar among pathologists who primarily used DVDs versus jpegs or both (regardless of the 6 cases with nonrepresentative, high-power jpeg images). These data argue that IOR estimates were not affected by pathologist decision to use snapshots versus DVD images, or by the 6% of cases with nonrepresentative ×20 snapshots.
Our resected specimen results may be extrapolated to, but may not fully represent, small biopsies and fine-needle aspirates from patients with advanced NSCLC. Recently, the International Association for the Study of Lung Cancer/American Thoracic Society/European Respiratory Society36 published major changes in the lung AD subclassification, including guidelines for small-biopsy diagnosis, although those changes do not alter the distinction among the major 10DC subtypes, such as SC and AD.
The SC/non-SC categorization is not a feature of the WHO classification but, rather, was based on clinical and regulatory practice: pemetrexed has no proven efficacy in SC in any of 3 pivotal studies2,37,38 contributing to the drug's approval in non-SC histology NSCLC, and bevacizumab is contraindicated in SC because of potential life-threatening hemorrhage.1 Our study was executed in 2008, before the publication of pivotal studies related to pemetrexed and bevacizumab in journals not directed at pathologists. Nevertheless, we demonstrate that even a simple classification, such as SC/non-SC, is imperfect by H&E alone (SC/non-SC, experts, maximum κ = 0.84).
Management of advanced NSCLC is now critically dependent on precise histologic diagnoses. This study provides baseline estimates of the IOR for H&E diagnosis of NSCLC and shows that agreement is a function of pathologist experience, pulmonary pathology expertise, pathologist diagnostic confidence, slide quality, and carcinoma grade. Strict definition and application of diagnostic criteria may incrementally improve IOR for H&E diagnosis of NSCLC, but major improvements in NSCLC IOR will likely depend on systematic integration of validated histochemical, immunohistochemical, and molecular methods. We recommend reporting the major (10DC) diagnostic category along with the specific (44DC) WHO diagnosis, thereby providing the groundwork for further therapeutic advances while reducing the potential for clinical confusion in how to manage unusual NSCLC cases. Our findings define a baseline measure for NSCLC H&E diagnostic agreement, to which future studies determining incremental benefits of reflex ancillary tests at the protein, cytogenetic, and molecular levels may be compared.
Research was supported by a grant from the Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill.
These authors contributed equally to this manuscript.
The authors have no relevant financial interest in the products or companies described in this article.
Presented in part at the Metastatic Lung Session of the 45th Annual Meeting of the American Society of Clinical Oncology; May 29, 2009, to June 2, 2009; Orlando, Florida. Presented in part as a poster at the Pathology Session of the 13th World Conference on Lung Cancer; July 31, 2009, to August 4, 2009; San Francisco, California. Presented in part at the Pulmonary Pathology Society Meeting; June 24–26, 2009; Portland, Oregon.