Context.—High-resolution scanning technology provides an opportunity for pathologists to make diagnoses directly from whole slide images (WSIs), but few studies have attempted to validate the diagnoses so obtained.
Objective.—To compare WSI versus microscope slide diagnoses of previously interpreted cases after a 1-year delayed re-review (“wash-out”) period.
Design.—An a priori power study estimated that 450 cases might be needed to demonstrate noninferiority, based on a null hypothesis: “The true difference in major discrepancies between WSI and microscope slide review is greater than 4%.” Slides of consecutive cases interpreted by 2 pathologists 1 year prior were retrieved from files, and alternate cases were scanned at original magnification of ×20. Each pathologist reviewed his or her cases using either a microscope or imaging application. Independent pathologists identified and classified discrepancies; an independent statistician calculated major and minor discrepancy rates for both WSI and microscope slide review of the previously interpreted cases.
Results.—The 607 cases reviewed reflected the subspecialty interests of the 2 pathologists. Study limitations include the lack of cytopathology, hematopathology, or lymphoid cases; the case mix was not enriched with difficult cases; and both pathologists had interpreted several hundred WSI cases before the study to minimize the learning curve. The major and minor discrepancy rates for WSI were 1.65% and 2.31%, whereas rates for microscope slide reviews were 0.99% and 4.93%.
Conclusions.—Based on our assumptions and study design, diagnostic review by WSI was not inferior to microscope slide review (P < .001).
Although radiologists have been using digital imaging technology for diagnostic purposes for years, technology has only recently advanced to the point that interpretation of scanned images of whole microscope slides for primary diagnosis is feasible. This technology is evolving rapidly, and several manufacturers now offer instruments that can scan slides relatively rapidly and create high-resolution digital images for education, research, archiving, quality assurance, image analysis, and diagnosis. Several previous studies have compared the interpretation of microscope slides with whole slide imaging (WSI) (also known as virtual microscopy or digital microscopy; Table 1) for primary diagnosis, consultation, or interpretation of frozen sections,1–14 but many of those studies have been limited in scope, and few have measured the baseline intraobserver variability for glass microscope slide interpretation to which WSI intraobserver variability should be compared. A committee of the College of American Pathologists temporarily posted for comment 13 draft guidelines for validation of digital imaging for clinical use. Although not readily available and not yet accepted by the College of American Pathologists, those guidelines generally recommended testing intraobserver variability by comparing WSI with glass microscope slide diagnosis after an appropriate “wash-out” interval between interpretations. The purpose of this article is to describe a prospective, intraobserver variability study, approved by the Institutional Review Board, in which we compared whole slide imaging review of previously diagnosed cases with microscope slide review after a minimum 1-year delayed re-review (“wash-out”) period from the original diagnosis.
The Cleveland Clinic Health System is composed of a large, main hospital (Cleveland, Ohio) with a group of affiliated regional and distant hospitals. There are approximately 42 pathologists at the main hospital, where in 2010 more than 109 000 surgical pathology cases were interpreted, amounting to more than 650 000 microscope slides. Pathologists at the main hospital subspecialize, and most pathologists have one or two subspecialty interests. Pathologists at the affiliated regional hospitals do not subspecialize as much and have practices similar to community pathologists elsewhere. Our validation study was designed to support our intention to potentially use WSI for primary diagnosis in both subspecialty and general surgical pathology practices by testing the hypothesis that interpreting with WSI is not inferior to interpreting with microscope slides.
MATERIALS AND METHODS
A Priori Power Study
Before starting the study, we sought to estimate the number of cases we would need to review to test the hypothesis that interpreting with WSI is not inferior to interpreting with microscope slides. That power study required an estimation of the approximate intraobserver variability we would expect based on previous experience or the literature. Few published studies document true intraobserver variability in surgical pathology. Some studies have reported “error rates,” or discrepancies between frozen and permanent sections, but many of those studies primarily reflect interobserver variability. For example, Raab and coworkers15 reviewed self-reported discrepancies from 75 laboratories participating in the College of American Pathologists Q-Probe program. The overall discrepancy rate was about 7%, and there were 5.3% major discrepancies. Similarly, a retrospective review16 of 715 second-opinion slide reviews obtained for patients being treated at the Sun Yat-Sen Cancer Center yielded a major discrepancy rate of 6%. Another review cited a wide range of “error rates” but concluded that “[c]urrently, pathology appears to be operating at about a 2.0% error rate.”17(p624)
Although this gives us some information about existing variability in surgical pathology, it is of marginal relevance to validating WSI. Several previous studies, however, have described experiences comparing the use of telepathology and/or WSI with microscope slides. For example, Evans and coworkers2 have described the use of telepathology, and subsequently WSI (“virtual slides”) to interpret frozen sections throughout a health network system in Toronto, Canada. With an experience of more than 350 interpretations by robotic microscope and 630 by WSI, the deferral rate was 7.7%, and the accuracy rate was 98% for both modalities. Molnar9 compared microscope slide and WSI diagnoses of 61 routine gastric biopsies and 42 routine colon biopsies, and reported a final concordance rate of 95.1%. Jukic et al8 evaluated intraobserver discrepancies among 3 pathologists after each reviewed glass slides and digital images of 101 cases. After excluding highly unusual or difficult cases and those that needed high magnification, the rate of significant discrepancies was 4.4%.
Each of the above studies has limitations but still provides us some guidance to the approximate number of discrepancies that might be expected in a prospective validation study. We shared those and other studies with a professional statistician, who then performed a sample-size calculation. Based in part on the literature noted above, we assumed the major discrepancy rate between the original diagnosis and the microscope slide review diagnosis would be 3%. We set the noninferiority margin for WSI review at 4%. A 1-sided, binomial test was used for comparison at a significance level of .05. The power to be achieved was 80%, and the significance level was .05. Based on those assumptions, we calculated that 450 cases (225 per group) would need to be reviewed to establish noninferiority.
We hypothesized that WSI review was noninferior to microscope slide review. It would be considered noninferior if major discrepancies with WSI review were not more than 4% when compared with microscope slide review. Put another way, the null hypothesis that we hoped to reject stated that the true difference in major discrepancies between WSI and microscope slide reviews was greater than 4%.
There were 2 primary pathologists. One pathologist reviewed cases typical of a subspecialty practice at our main hospital (mostly orthopedic and gastrointestinal cases), whereas the second pathologist reviewed a broader spectrum of cases similar to a community practice. Both pathologists performed several hundred side-by-side reviews of WSI and matched microscope slides to gain experience before the study started. During that training interval, both pathologists easily recognized digital images of previously reviewed glass microscope slides, so for the study we chose to use a delayed re-review (“wash-out”) period of at least 1 year after glass slide primary diagnosis.
Our overall study design is shown in Figure 1. The study coordinator first identified consecutive cases that each of the 2 pathologists had interpreted beginning on November 1, 2009. Outside cases for which microscope slides had been returned and cases with gross-only diagnoses were excluded. The original microscope slides were retrieved from file, and sequentially alternating cases were scanned at an original magnification of ×20 using a whole slide imaging system (ScanScope XT, Aperio, Vista, California). At the same time, working drafts of the original surgical pathology reports were reproduced and censored with respect to the original 2009 diagnosis. The working drafts were not censored for other clinical information; the intention of using working drafts was to provide each pathologist with the same amount of background information at the time of review as was available at the time of primary diagnosis (eg, specimen site, gross description, among others). The cases (either microscope slides and paperwork, or paperwork only and notification of a scanned image in queue) were distributed to each pathologist. Each pathologist reviewed the cases, using either a microscope or using whole slide image viewer software (ImageScope, Aperio) on a 24-inch monitor (Dell, Round Rock, Texas), and diagnoses were entered into a spreadsheet. At the completion of the review process, the study coordinator distributed the spreadsheets to an independent, referee pathologist, who determined which cases were concordant and which appeared to be discordant. Once the potentially discordant cases were identified, the original microscope slides and all the data for each case were grouped according to subspecialty and were then passed on to the directors of the appropriate subspecialty section. Each director then reviewed all the available material to establish an independent, gold standard diagnosis and, thereby, determine whether the case was concordant, a minor discrepancy, or a major discrepancy. In the case of discrepant diagnoses, the subspecialty director identified which of the 2 diagnoses (ie, the 2009 primary diagnosis or the review diagnosis—either glass slide or WSI) was most correct. Finally, the entire spreadsheet was passed on to an independent, professional statistician for analysis.
As described above, the study involved a number of critical participants. There were 2 primary study pathologists; a study coordinator, who was responsible for case selection and deidentification and maintaining of the database; imaging technicians, for scanning slides and uploading files; and an information technology specialist to coordinate and maintain informatics. In addition, an independent, referee pathologist screened the diagnoses for possible discrepancies; subspecialty directors reviewed each potentially discrepant case; and an independent, professional statistician performed the initial a priori power study as well as performing the final analysis of the results.
Cases Versus Parts
The distinction between “cases” and “parts” is another consideration in calculating the number of diagnoses necessary to establish noninferiority, as well as the number of discrepant diagnoses. For example, a prostate specimen (case) might contain 6 or more individual biopsies (parts). The interpretation of any one of those could influence patient care, depending on the findings in the remaining biopsies. On the other hand, patient care decisions are usually based on interpreting the combination of all of the parts together. For purposes of our validation study, we chose to set a low threshold for defining discrepancies, and we chose to review a large number of diagnoses so that either cases or parts could be used in subsequent statistical analyses.
We used a low threshold for defining discrepancies. For example, our subspecialty director determined that the difference between a total Gleason score 6 versus a total Gleason score 7 would be a major discrepancy. Interpreting the case illustrated in Table 2 by parts, there are 2 major and 1 minor discrepancies. However, the subspecialty director reviewed the microscope slides from the case and determined that one of the diagnoses from 2009 was better than the subsequent WSI diagnoses (part A), but the WSI review diagnosis of part C was better than the original microscope slide diagnosis. Therefore, if we adjust the discrepancy rate by subtracting the part for which the WSI diagnosis was better, we derive an “adjusted major discrepancy rate” of 1 (part) for this case. The results to be described below will be reported with respect to both parts and cases, as well as with respect to raw and adjusted major and minor discrepancy rates.
The a priori power study to determine the minimum sample size is described above. At the completion of the study a noninferiority test (1-sided binomial test) was performed to compare the major discrepancy rate and minor discrepancy rate between the WSI rereview and the microscope slide rereview. The noninferiority margin was chosen as 4%. Unless otherwise specified, all tests were performed at a significance level of .05. All analyses were performed using SAS software (version 9.2, SAS Institute, Cary, North Carolina).
As noted above, our a priori power study indicated that, based on the assumptions described, we needed at least 450 diagnoses (225 in each group) to prove noninferiority of WSI when compared with microscope slide review. We actually reviewed 607 cases, composed of 1025 parts. The results are summarized in Tables 3–7. For the cases, there was complete concordance in 554 of 607 cases (91.3%); 5 cases of WSI reviews (1.7%) and 3 cases of microscope slide reviews (0.99%) were deferred (see below). Of the 304 cases in the microscope slide review group, there were 3 major discrepancies (0.99%). There were a total of 9 major discrepancies in the WSI group (3.0%), but the WSI review diagnosis was better than the original microscope slide diagnosis in 4 of those cases (1.3% of all WSI cases; 44% of the major discrepancies), whereas the original microscope slide interpretation was better than the WSI review in 5 cases (56% of the major discrepancies), for an adjusted major discrepancy rate for WSI review of 1.65%.
The baseline minor discrepancy rate for the microscope slide review group was 4.93% (15 of 304 cases). There were 18 cases (5.9%) with minor discrepancies among the WSI cases, but the WSI review diagnosis was better than the original in 11 of those cases (61% of the minor discrepancies) for an adjusted minor discrepancy rate for WSI review of 2.31% (7 of 303 cases).
With respect to individual pathologists, the major case discrepancy rate for pathologist 1 was 0.3% (1 of 334 cases). That one case was a WSI review of a previous microscopic slide diagnosis. Pathologist 1 had 2 digital review cases (1.2% of 167 cases) with minor discrepancies in which the original microscope slide interpretation was better and had 6 minor discrepancy cases (3.59% of 167 cases) in which the WSI review diagnosis was better than the original microscope slide interpretation. Pathologist 1 also had 5 minor discrepancies (3% of 167 cases) in the microscope slide rereview group. Pathologist 2 had 8 of 136 cases (5.9%) in which there was a major discrepancy between the WSI and original microscope slide diagnosis. Expert review determined that in 4 of those cases (50% of discrepant cases; 2.9% of all cases), the original microscope slide diagnosis was best, and in the other 4 cases, the WSI review diagnosis was best, for an adjusted major discrepancy rate of 2.94% (4 of 136 cases). Pathologist 2 had 5 digital review cases (3.68% of 136 cases) with minor discrepancies in which the original microscope slide interpretation was better and had 5 minor discrepancy cases (3.68% of 136 cases) in which the WSI review diagnosis was better than the original microscope slide interpretation. Pathologist 2 had 3 major discrepancies (2.20%) and 10 minor discrepancies (7.3%) in the microscope slide rereview group.
The 607 cases were composed of 1025 parts (Table 4). There was complete concordance in 92% of those parts; 10 were deferred (5 in the WSI and 5 in the microscope slide review groups). The baseline minor discrepancy rate based on microscope slide review was 4.58% (24 of 524 parts). There were 25 minor discrepancies among the 501 parts (4.99%) in the WSI review group. Of those minor discrepancies, the WSI review diagnosis was better than the original interpretation in 14 parts (2.79% of all cases; 56% of minor discrepancies), whereas the original diagnosis was better than the WSI review diagnosis in 11 parts for an adjusted minor discrepancy rate of 2.20%.
As noted above, 8 cases in the study were deferred. Three of those were glass microscope slide reviews, whereas 5 were WSI reviews. The 3 microscope slide reviews were deferred only because they were difficult diagnoses: one was dysplasia in Barrett esophagus, and the second was dysplasia in a patient with ulcerative colitis. Each of those cases had also been deferred to a subspecialty consensus group at the time of original diagnosis in 2009. The other deferred microscope slide review required correlation with radiographic imaging to rule out melorheostosis. Five of the digital image reviews were deferred. Three of those (60%) had also been deferred to a consensus group at the time of original diagnosis: (1) dysplasia in a patient with inflammatory bowel disease, (2) an inflammatory cloacogenic polyp with prolapse changes, and (3) a difficult uterine stromal proliferation. The fourth deferred WSI case required polarization to exclude uric acid crystals. The only WSI case in the study that was deferred because of unsatisfactory imaging was a retinal membrane in an ocular pathology slide. For that case, focus points could not be satisfactorily identified because the hematoxylin-eosin stain had faded over time. Although no other cases were deferred because of image quality, both pathologists requested some cases be rescanned at original magnifications of ×40, instead of the default ×20, usually to better identify inflammatory cells or microorganisms.
The major discrepancies in the WSI group are listed in Table 5. In retrospect, the case of reflux esophagitis that was interpreted as eosinophilic esophagitis might have benefited from review at an original magnification of ×40 instead of ×20. A gastric biopsy was originally correctly interpreted as intramucosal adenocarcinoma but interpreted as chronic gastritis with foveolar hyperplasia and reactive changes by WSI review of an image scanned at an original magnification of ×20 (Figure 2, A). Expert review found the case to be difficult using either slides or digital images, but concluded that the original interpretation was correct. Recent rescanning at an original magnification of ×40 improved the resolution, but the case was still difficult (Figure 2, B).
All 11 parts with minor discrepancies in which the original microscope slide diagnosis was better than the WSI review diagnosis are listed in Table 6. Finally, the discrepancy rates for microscope slide and WSI review are summarized in Table 7. Whether based on cases or the number of individual diagnostic parts, review by WSI was not inferior to microscope slide review (P < .001).
Whether or not a test is cleared by the US Food and Drug Administration, it is still appropriate for a laboratory to validate the use of a new diagnostic test, especially one directly related to patient safety. For WSI, the process of validation consists of 2 major parts. The first is the scanner (including a built-in microscope), computer, application software, and monitor, whereas the second is the actual interpretation of the scanned images for diagnosis. Many of the WSI systems are closed, such that verification of the process whereby the scanned image accurately reproduces the content of the microscope slide is controlled and performed by the manufacturer.
The second part of validating the interpretation of WSI systems is to demonstrate with a reasonable degree of confidence that patient safety would not be compromised by using the system for patient care. That process necessarily involves comparing diagnoses rendered using the WSI system with diagnoses achieved using microscopes. Several previous studies have reported on the “accuracy” of telepathology and/or WSI technology for diagnosis. For example, with the goal of validating the use of WSI to interpret frozen sections, Fallon and coworkers3 prepared digital slides of 52 consecutive frozen sections that had been obtained to diagnose ovarian lesions. Two pathologists reviewed the images and reported a correlation between WSI and frozen section as 96% with respect to benign versus malignant, borderline, or uncertain. A recent study by Chargari et al1 actually suggested possible superiority with respect to Gleason grading using WSI when compared with conventional glass microscope slides. In a study of 96 cases of skin tumors and tumorlike changes examined by 4 pathologists, Nielsen and colleagues10 reported an accuracy of 89.2% for WSI and 92.7% for conventional microscopy when diagnoses were compared with an expert diagnosis. They concluded that there was no significant difference in accuracy between WSI and glass slide review, although it is not clear whether this was a statistical determination. A recent study12 that examined 67 frozen section cases using WSI and a mobile viewing device found 89% accuracy compared with glass slide review. The authors concluded that use of mobile viewing devices to interpret WSI images is feasible for frozen sections, based, in part, on their finding that this accuracy rate compares favorable with other studies using WSI for frozen section interpretation.
The level of “concordance” or “agreement” or “accuracy” among diagnoses based on WSI versus glass slides that should be considered acceptable for use of WSI for routine diagnoses remains an open question. For instance, in a study looking at use of WSI for subspecialty consultation, Wilbur and colleagues14 at Massachusetts General Hospital reported an overall concordance rate between WSI and glass slide interpretations of 91%, based on review of 53 consultation-level cases. The authors concluded that WSI-based interpretation is feasible but also cautioned that there is room for improvement because the WSI interpretation was incorrect in 9% of cases. In a study9 of 103 gastrointestinal biopsies, concordances of WSI and glass slides compared with consensus diagnosis were both high, although glass slide concordance was higher (98%) than was WSI concordance (95%).
The results of the aforementioned studies have generally demonstrated a high level of accuracy or concordance and generally concluded that the use of WSI is feasible for the use contemplated in the study; however, no clear basis has emerged. In other words, how is it determined that, say, a 90% concordance rate is acceptable for clinical practice? Our study aims to answer the key question implicit in such a determination by demonstrating whether or not WSI is inferior to glass slide review when baseline intraobserver variability for both WSI and glass slide review is taken into account.
Although the results of the studies noted above are encouraging, to our knowledge, no previous study performed an a priori power study to estimate the sample size needed to determine noninferiority, and based on our own power study, most previous studies have not included enough cases to statistically demonstrate noninferiority of the WSI diagnosis. For example, one validation study evaluated only 12 cases.5 Other studies have evaluated between 24 and 52 cases,3,6,7 although no specific rationale was provided for selecting that sample size.
A critical component of documenting noninferiority is recognizing baseline intraobserver variability for reviewing microscope slides. Without defining that baseline variability, the measured intraobserver variability of WSI review is of limited value. Our intraobserver major discrepancy rate for glass microscope slide review was 0.99% for cases and 1.72% for the many individual parts that composed our cases. The corresponding minor discrepancy rates were 4.93% and 4.58%. These data should be of help to future investigators who may want to perform their own validation studies of similar design.
A committee of the College of American Pathologists briefly posted for comment draft recommendations for laboratories to consider when validating WSI for primary diagnosis. Among those draft recommendations was a recommended wash-out period of approximately 2 to 3 weeks between microscope slide and WSI diagnosis. Little data are available to support that time interval, although the rationale for limiting the wash-out to several weeks rather than 1 year is because even among experienced pathologists, diagnostic criteria may evolve during 1 year. We initiated our study before the draft recommendations were available. One of our major discrepancies was a case in which a hyperplastic colonic polyp was rediagnosed as a sessile serrated polyp on WSI review. Undoubtedly, our criteria for diagnosing sessile serrated polyp have evolved since the first diagnosis, supporting the premise behind the College of American Pathologists recommendation.
One of our major WSI discrepancies was a case in which reflux esophagitis, correctly diagnosed in 2009, was interpreted as “eosinophilic esophagitis” from the digital image. The color balance on the digital image may have rendered the granules in the neutrophils a slightly different color than is customarily seen in routine microscope slides. On the other hand, this study, as well as other experiences with WSI, has suggested to us that certain types of cases may be better reviewed from scans obtained at an original magnification of ×40 rather than at ×20. Those cases often involve detecting inflammatory cells, especially neutrophils and/or plasma cells, or microorganisms. We agree with previous investigators13 that correctly classifying some inflammatory conditions may be relatively difficult by WSI and may require higher-magnification scans than are needed to diagnose most malignancies. On the other hand, we interpreted many inflammatory conditions in this study (eg, many cases of “rule out infection” during revision total joint arthroplasty and many cases of gastrointestinal biopsies to evaluate inflammatory bowel disease) at original magnifications of ×20 without discrepancies.
Our study has several limitations. First, we did not include cytopathology, hematopathology, or lymphoid lesions because those cases often require high magnifications and may need numerous immunohistochemical stains. We did not intentionally enrich our study set with difficult cases. Recognizing that some inflammatory conditions in particular may benefit from higher magnification scans, we anticipate that enriching our study set with difficult cases would increase the intraobserver variability for both microscope slide review and WSI review, but our experience to date suggests that a case that is difficult on an imaging review is also difficult on microscope slide review. As another limitation, we did not attempt to quantify other aspects of workflow (eg, the time it takes to interpret WSI cases) in this study. Finally, both of the study pathologists had interpreted several hundred WSI cases with the matched microscope slides before the validation study. This experience helped them recognize, for example, the importance of requesting an original magnification of ×40 scans in selected cases and helped minimize the potential affect of a learning curve on WSI validation.
Our study had several attributes, including the a priori power calculation to determine sample size and the a priori focus on intended use and patient population. We also established a baseline intraobserver variability rate to which the WSI variability could be compared. Finally, we used an independent referee pathologist to screen diagnoses and identify potential discrepancies, we used independent subspecialty experts to determine the ultimate correct diagnoses of all potentially discrepant cases, and we used a professional statistician to compile the results and calculate the statistics. Representing perhaps the most rigorous WSI validation study published to date, our results add to the growing body of literature that help support the safety and efficacy of WSI for primary diagnosis.
We greatly appreciate the support of Imaging Systems Analyst Scott Mackie, CET, and Imaging and Histology Technologists Sol Crisostomo, HTL (ASCP), MT (AMT), and Pamela Suydam, MT/HTL (ASCP); and Subspecialty Expert Pathologists Ana Bennett, MD, Steven Billings, MD, Charles Biscotti, MD, Jennifer Brainard, MD, John Goldblum, MD, Christina Magi-Galluzzi, MD, PhD, and Carmela Tan, MD, without whose help this study would not have been possible.
The authors have no relevant financial interest in the products or companies described in this article.
Presented in part at the Digital Pathology Association meeting; October 31, 2011; San Diego, California.