Context.—Light microscopy (LM) is considered the reference standard for diagnosis in pathology. Whole slide imaging (WSI) generates digital images of cellular and tissue samples and offers multiple advantages compared with LM. Currently, WSI is not widely used for primary diagnosis. The lack of evidence regarding concordance between diagnoses rendered by WSI and LM is a significant barrier to both regulatory approval and uptake.

Objective.—To examine the published literature on the concordance of pathologic diagnoses rendered by WSI compared with those rendered by LM.

Data Sources.—We conducted a systematic review of studies assessing the concordance of pathologic diagnoses rendered by WSI and LM. Studies were identified following a systematic search of Medline (Medline Industries, Mundelein, Illinois), Medline in progress (Medline Industries), EMBASE (Elsevier, Amsterdam, the Netherlands), and the Cochrane Library (Wiley, London, England), between 1999 and March 2015.

Conclusions.—Thirty-eight studies were included in the review. The mean diagnostic concordance of WSI and LM, weighted by the number of cases per study, was 92.4%. The weighted mean κ coefficient between WSI and LM was 0.75, signifying substantial agreement. Of the 30 studies quoting percentage concordance, 18 (60%) showed a concordance of 90% or greater, of which 10 (33%) showed a concordance of 95% or greater. This review found evidence to support a high level of diagnostic concordance. However, there were few studies, many were small, and they varied in quality, suggesting that further validation studies are still needed.

Traditionally, light microscopy (LM) has been the reference method used in anatomic pathology in the diagnosis of many diseases. However, advances in digital imaging hardware and software have led to the development of whole slide imaging (WSI) devices.1  These devices allow the digital capture, analysis, storage, sharing, and viewing of whole slide pathology images. Digital pathology evolved as a practical technology in the 1980s with the development of “telepathology” technology.2  Initially, telepathology existed in 2 forms: static-image telepathology and dynamic robotic telepathology. Static-image telepathology involved the transmission of preselected regions of microscopy images. Dynamic robotic telepathology enabled real-time image alterations by granting a remote user control of the microscope. Since these 2 initial telepathology technologies, more than 12 types of telepathology systems have evolved.3  Whole slide imaging is the latest technology used in digital pathology.

Whole slide imaging was first developed in the 1990s,2,4,5  generating fully colored digital images of entire glass slides at resolutions of less than 0.5 μm/pixel, comparable to a light microscope.6  Whole slide imaging uses a high-resolution camera, coupled with 1 or more high-quality microscope objectives to capture images of adjacent areas from glass slides, either as tiles or stripes. Specialized software then combines these individual images to generate a single whole slide image. Whole slide images can be viewed and analyzed digitally on a computer screen.7 

In comparison to LM, WSI offers numerous advantages. Several pathologists in different locations are able to independently view and assess a slide simultaneously, and individual pathologists can examine multiple slides, allowing side-by-side comparisons of different magnifications of the same case.810  Slides can be annotated and subjected to standardized-image analysis software.7  The whole slide images generated are stored and shared virtually, decreasing the time taken to render second opinions and preventing slide degradation and physical damage.11 

At present, WSI is routinely used in both undergraduate and postgraduate education and research.1113  Despite increasing use of WSI in Europe and North America in secondary diagnosis, its use in primary clinical diagnosis remains limited. The authors are aware of a handful of projects worldwide in which WSI is used routinely in primary diagnosis (Sweden; Toronto, Ontario, Canada). Barriers to its implementation include low acceptability of digital pathology among pathologists and the costs associated with implementation.11,14,15  However, another current, significant barrier is the lack of evidence that validates the diagnostic concordance between WSI and LM. Although certain WSI devices have been permitted for use in primary diagnosis in the European Union and Canada,16  the approval to use WSI in diagnosis from regulatory bodies, such as the US Food and Drug Administration (FDA), has not yet been established. Whole slide imaging systems are currently categorized as a class III (highest risk) medical device by the FDA, meaning they need premarket approval before the FDA permits their sale for clinical use17,18 ; currently, no devices have been granted that approval. However, the Digital Pathology Association (Indianapolis, Indiana) has recently started to encourage vendors to submit de novo applications for WSI devices to be considered as class II device.19  Furthermore, pathologists cite the lack of evidence regarding the equivalence of WSI with LM as a reason for nonadoption.

To date, few studies that have compared the diagnostic concordance of WSI and traditional LM. In 2012, Lindsköld et al20  undertook a systematic review of these studies for a health technology assessment. They reported the diagnostic intraobserver agreement (variation in diagnoses between the same individual) to range from 61% to 100% and a Cohen κ coefficient range of 0.55 to 0.81. Interobserver diagnostic (variation in diagnoses among different users) agreement ranged from 70% to 100% with a Cohen κ coefficient ranging from 0.28 to 0.42. The study concluded that diagnostic disagreements were associated with differences of minor clinical importance but that the quality of the evidence was low. The wide eligibility criteria used focused only on study design, language, and date of publication. No consideration was given to study factors, such as case type, slide type, and study participants. The subsequent heterogeneity displayed among the included studies may have contributed to the large range of diagnostic agreement found.

Addressing the need to validate WSI, the College of American Pathologists (CAP) produced 12 guideline statements in 2013 for studies wishing to validate the diagnostic concordance of WSI and LM.21  Their guidelines included specific recommendations for validation studies: at least 60 routine cases per application, training in WSI for participants, and, for intraobserver studies, a washout period of at least 2 weeks between viewing the slide sets in each condition. Development of the CAP guidelines; advances in WSI scanners, software, and hardware; and an increased push toward digital technology in health care indicate a need for an updated, systematic review of the diagnostic concordance of WSI and LM.

The primary aim of this systematic review was to examine the published literature on the concordance of pathologic diagnoses rendered by WSI compared with those rendered by LM. Secondary outcome measures, including time to diagnosis and diagnostic confidence, were examined where possible.

The review was registered with PROSPERO database (registration number: CRD42015017859; Centre for Reviews and Dissemination, University of York, Heslington, York, England), the international prospective register of systematic reviews.22  The review protocol can be accessed online.23 

Search Strategy

An electronic search was carried out on the databases: Medline (Medline Industries, Mundelein, Illinois), Medline in progress (Medline Industries), EMBASE (Elsevier, Amsterdam, the Netherlands), and the Cochrane Library (Wiley, London, England) between 1999 and March 2015. A search of clinicaltrials.gov (US National Institutes of Health, Bethesda, Maryland) was performed to identify any ongoing studies. The ProQuest (ProQuest, Ann Arbor, Michigan) and the Health Management Information Consortium (London School of Hygiene & Tropical Medicine, London, England) databases were also searched to identify any relevant grey literature in an attempt to minimize study bias. Included studies underwent manual reference searching and citation tracking through PubMed (National Center for Biotechnology Information, US National Library of Medicine, Bethesda, Maryland) and Google Scholar (Google Inc, Mountain View, California). Corresponding authors (n = 33) of included studies (n = 33; 92%) were contacted, where possible, to identify any subsequent or ongoing research. A detailed breakdown of the search strategy is provided in the supplemental digital content available at www.archivesofpathology.org in the January 2017 table of contents.

Article Screening

Two reviewers (E.G. and D.T.) independently subjected the abstracts of articles to the screening algorithm shown in Figure 1. In cases of disagreement, a third independent reviewer was consulted. Full texts of all articles that fulfilled the initial screening algorithm were retrieved and reviewed.

Figure 1.

Article-screening algorithm. Two reviewers independently screened 1155 abstracts using this screening algorithm. The number of studies rejected at each stage is shown parenthetically. Of the 56 abstracts that met the screening algorithm, 10 were presentation abstracts only. Full texts of the remaining 46 studies were retrieved and reviewed. Abbreviation: H&E, hematoxylin–eosin.

Figure 1.

Article-screening algorithm. Two reviewers independently screened 1155 abstracts using this screening algorithm. The number of studies rejected at each stage is shown parenthetically. Of the 56 abstracts that met the screening algorithm, 10 were presentation abstracts only. Full texts of the remaining 46 studies were retrieved and reviewed. Abbreviation: H&E, hematoxylin–eosin.

Close modal

Data Extraction

A standardized data-extraction protocol was applied to all included studies. The protocol was developed from the Cochrane Effective Practice and Organisation of Care template. Data were extracted from studies by the primary researcher (E.G.) and the extracted data was reviewed independently by a second reviewer (D.T.). For studies reporting results as κ statistics, the Landis and Koch classification24  was used to interpret κ values: no agreement to slight agreement (<0.20), fair agreement (0.21–0.40), moderate agreement (0.41–0.60), substantial agreement (0.61–0.80), and excellent agreement (>0.81). Studies were classified according to organ system and study design (determined as shown in Figure 2).

Figure 2.

Study design classification based on diagnostic comparisons. A, Retrospective retrieval and review using whole slide imaging (WSI). B, Retrospective retrieval and review using light microscopy (LM). C, Prospective, comparative review of WSI and LM. D, Prospective, comparative review using WSI. E, Prospective, comparative review using LM. Studies that performed more than one of the above comparisons were classified as crossover studies. Comparisons C and D were combined for this review and termed a prospective comparative review.

Figure 2.

Study design classification based on diagnostic comparisons. A, Retrospective retrieval and review using whole slide imaging (WSI). B, Retrospective retrieval and review using light microscopy (LM). C, Prospective, comparative review of WSI and LM. D, Prospective, comparative review using WSI. E, Prospective, comparative review using LM. Studies that performed more than one of the above comparisons were classified as crossover studies. Comparisons C and D were combined for this review and termed a prospective comparative review.

Close modal

Quality Assessment

The methodological quality of all included studies was assessed by 2 independent reviewers (E.G. and D.T.) using the updated quality assessment of studies of diagnostic accuracy included in the systematic reviews (QUADAS-2, University of Bristol, Bristol, England) tool, as recommended by the Cochrane Collaboration.25,26  Two signaling questions were omitted because they were not relevant to WSI, and an additional 3 signaling questions were added to the tool. The modified QUADAS-2 tool used is shown in Table 1—“Modified Version of the QUADAS-2 Quality Assessment Tool”—with the additional signaling questions marked.25  Whole slide imaging was considered the index test, and LM was the reference standard. Studies that did not provide participants with the corresponding clinical information for cases, that involved users not trained in WSI, or that used alternate hardware, such as iPads (Apple, Cupertino, California), were considered to have both a high risk of bias and a high applicability concern for both the index test and the reference standard. The CAP guidelines recommend a 2-week minimum washout period between slide views.21  Therefore, a 2-week minimum interval between the index test and the reference standard was considered an appropriate interval between the index test and reference standard for the flow and timing domain.

Table 1.

Modified QUADAS-2 Quality Assessment Tool

Modified QUADAS-2 Quality Assessment Tool
Modified QUADAS-2 Quality Assessment Tool

Quantitative Synthesis

The studies identified in this review demonstrated high levels of heterogeneity in organ systems; study designs; WSI scanners, hardware, and software; index test conditions; and outcome measures. Therefore, statistical meta-analysis was not justified.27  A narrative review of the studies is provided.

In total, 1155 studies were identified. Of those, 1127 (98%) were sourced from electronic databases, 12 (1%) were from a grey literature search, 5 (<1%) were from citation tracking, and 11 (<1%) from manual reference searching. Two additional studies were obtained from contacted authors. Of the 33 authors contacted, 11 (33%) responded. Of the 1155 studies, 56 (5%) were identified as potentially relevant after an initial abstract screen, and the full text of those articles was sought. The number of studies excluded at each stage is shown in Figure 3. Ten (18%) of the included 56 studies were presentation abstracts only and were subsequently excluded. Of the remaining 46 studies, 10 (22%) were excluded after review of the full text. Two included articles (6%) were each felt to incorporate 2 distinctly different studies.28,29  Outcomes for each of those studies were recorded separately. In total, 38 studies were included in the review.6,7,18,2860  The study selection process is shown in Figure 3. Interreviewer agreement for article screening was excellent, with a Cohen κ coefficient of 0.90 and a 95% CI of 0.84–0.95.

Figure 3.

Flow diagram of the study-selection process. Following abstract screening, 56 studies were identified, 10 of which (18%) were presentation abstracts only and were subsequently excluded. Full-text pdfs of the remaining 46 studies were reviewed against the eligibility criteria, at which stage, 10 (22%) were excluded. A qualitative synthesis was performed on 38 separate studies.* Because of the high degree of heterogeneity among studies, no quantitative synthesis was conducted. Flow diagram reprinted from Moher D et al,62  Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PloS Med. 2009;6(7): e1000097. PLoS Med is an open-access journal distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium. * In total, 36 published studies were included; however, 2 studies were felt to incorporate multiple studies and were subsequently each split into 2 separate studies for the purposes of this review.

Figure 3.

Flow diagram of the study-selection process. Following abstract screening, 56 studies were identified, 10 of which (18%) were presentation abstracts only and were subsequently excluded. Full-text pdfs of the remaining 46 studies were reviewed against the eligibility criteria, at which stage, 10 (22%) were excluded. A qualitative synthesis was performed on 38 separate studies.* Because of the high degree of heterogeneity among studies, no quantitative synthesis was conducted. Flow diagram reprinted from Moher D et al,62  Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PloS Med. 2009;6(7): e1000097. PLoS Med is an open-access journal distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium. * In total, 36 published studies were included; however, 2 studies were felt to incorporate multiple studies and were subsequently each split into 2 separate studies for the purposes of this review.

Close modal

Study Characteristics

The 38 included studies consisted of 6 crossover studies (16%), 19 prospective comparative reviews (50%), and 13 retrospective retrieval and review studies (34%). The mean (SD) number of cases within the included studies was 140 (140). Sixteen studies (42%) used participants trained in using WSI systems. Washout periods between comparisons ranged from none to more than 12 months. Eight WSI scanner manufacturers were represented in the studies, with Aperio (Aperio, Vista, California) scanners used in the majority of the studies (n = 23; 61%). Interobserver agreement was measured in 6 studies (16%), whereas 32 studies (84%) measured intraobserver agreement. The most commonly studied individual organ system was the gastrointestinal system (n = 7; 18%). Ten studies (26%) were a mix of 2 or more distinct organ systems. A detailed breakdown of individual study characteristics can be found in the supplemental digital content.

Quality Assessment

A tabulated display of the quality-assessment results for individual studies is shown in Table 2. Graphic depictions of the quality-assessment results by assessment domain for risk of bias and applicability concerns are shown in Figure 4, A and B.

Table 2.

QUADAS-2 Assessment of Individual Studies

QUADAS-2 Assessment of Individual Studies
QUADAS-2 Assessment of Individual Studies
Figure 4.

Graphic display of QUADAS-2 quality assessment. A, Graphic display of the percentage of the risk of bias within the reviewed studies. B, Graphic display of the percentage of concern about the applicability within the reviewed studies. Graphs adopted from the QUADAS-2 resource page.69

Figure 4.

Graphic display of QUADAS-2 quality assessment. A, Graphic display of the percentage of the risk of bias within the reviewed studies. B, Graphic display of the percentage of concern about the applicability within the reviewed studies. Graphs adopted from the QUADAS-2 resource page.69

Close modal

Risk of Bias

Across the 4 domains (patient selection, index test, reference standard, and flow and timing), the percentage of studies with a high risk of bias ranged from 11% (n = 4) to 16% (n = 6) (Figure 4, A). The percentage of studies with a low risk of bias ranged from 32% (n = 12) to 74% (n = 28). The index test domain showed the highest risk of bias, with 16% of studies (n = 6) having a high risk. For the same domain, an unclear risk of bias was found in 53% of the studies (n = 20). The flow and timing domain had the lowest risk of bias (74% [28 cases] in the low-risk category).

Applicability

The patient-selection domain caused the least concern regarding applicability, with 100% of studies (n = 38) being classified as low concern (Figure 4, B). Greatest concern was for applicability of the index test domain with 18% (n = 7) of studies being classified as high concern and only 32% (n = 12) classified as low concern. Applicability of studies in the reference standard domain was reasonable (61% [n = 23] low concern).

Diagnostic Concordance

Diagnostic concordance was reported as the percentage of concordance (n = 25; 66%), κ agreement (n = 8; 21%), or both (n = 5; 13%). The diagnostic intraobserver reported concordance ranged from 63% to 100%, (κ coefficient range, 0.48–0.87). The diagnostic interobserver reported concordance ranged from 84% to 100%. A single interobserver κ coefficient value of 0.91 was reported. To obtain an idea of overall concordance corrected for study size, the cited concordance was adjusted to account for the number of cases per study. Across all studies, the mean percentage of diagnostic concordance was 92.4%, and the mean κ agreement was 0.75 (substantial agreement). Concordance across retrospective retrieval and review studies, prospective comparative review studies, and crossover studies was 92.9%, 92.4%, and 91.2%, respectively. Crossover studies and retrospective retrieval and review studies showed excellent agreement for calculated mean (SD) κ coefficients (0.87 [0] and 0.81 [0], respectively). Prospective comparative review studies showed substantial agreement (0.68 [0.16]).

Of the 30 studies (79%) that provided percentage of diagnostic concordance measurements, 18 (60%) reported a concordance of 90% or greater, and 10 of these (56%) showed a concordance rate of 95% or greater. Six studies (20%) reported a concordance of less than 85%. The weighted mean percentage diagnostic concordance of study LM diagnosis with original LM diagnosis was 93.4% across the 10 studies that measured it. Whole slide imaging and LM concordance across the same 10 studies was 90.9%. Graphic representations of the percentage of diagnostic concordance against study-design factors are shown in Figure 5, A through F.

Figure 5.

Graphs showing relationship among reported concordance and study design factors. A, A scatter plot of the total number of cases per study compared with the percentage of diagnostic concordance. B, A distribution dot plot of study design against the percentage of diagnostic concordance. C, A distribution dot plot showing whether study participants were trained in using whole slide imaging (WSI) compared with the percentage of diagnostic concordance. D, A distribution dot plot of organ system assessed compared with the percentage of diagnostic concordance. E, A distribution dot plot comparing study date to the percentage of diagnostic concordance. F, A distribution dot plot comparing the length of the washout period between views of slides to the percentage of diagnostic concordance. All graphs were created using StataIC 13 (Stata Statistical Software, Release 13, 2013. StataCorp, College Station, Texas). Abbreviation: NA, not applicable.

Figure 5.

Graphs showing relationship among reported concordance and study design factors. A, A scatter plot of the total number of cases per study compared with the percentage of diagnostic concordance. B, A distribution dot plot of study design against the percentage of diagnostic concordance. C, A distribution dot plot showing whether study participants were trained in using whole slide imaging (WSI) compared with the percentage of diagnostic concordance. D, A distribution dot plot of organ system assessed compared with the percentage of diagnostic concordance. E, A distribution dot plot comparing study date to the percentage of diagnostic concordance. F, A distribution dot plot comparing the length of the washout period between views of slides to the percentage of diagnostic concordance. All graphs were created using StataIC 13 (Stata Statistical Software, Release 13, 2013. StataCorp, College Station, Texas). Abbreviation: NA, not applicable.

Close modal

The percentage of concordance range (PCR) among the 10 studies with a mixed case load ranged from 75% to 97%.* The PCR for studies on the gastrointestinal system ranged from 70.0% to 99%.29,31,43,53,55,57,59  Table 3 displays the PCRs for each organ system included.

Table 3.

Percentage of Concordance Range (PCR) by Organ System

Percentage of Concordance Range (PCR) by Organ System
Percentage of Concordance Range (PCR) by Organ System

Time to Diagnosis

Out of the 4 studies that reported time to diagnosis, the 3 (75%) that compared WSI and LM times to diagnosis all found a longer time to diagnosis using WSI.38,43,45,58  Gui et al43  found an average WSI review time of 91.9 seconds (95% CI, 65.3–118.5 seconds) compared with an average LM review time of 57.1 seconds (95% CI, 55.9–72.3 seconds). Valez et al58  compared the viewing time per slide of 2 WSI viewing systems to LM. This reported a mean time to diagnoses of 38 seconds and 34 seconds for 2 WSI modalities and 23 seconds for LM. Jen et al45  reported the average time spent examining digital slides to be 1.4 times greater than spent on glass slides (P < .03).

Diagnostic Confidence

Of the 2 studies to report diagnostic confidence (5%), only Jukic et al46  compared WSI diagnostic confidence to LM diagnostic confidence.41  Mean diagnostic confidence in WSI and LM was calculated at 4.0 of 5.0 (95% CI, 2.9–5.1) and 4.1 of 5.0 (95% CI, 4.0–4.2), respectively.

Systematic reviews form the cornerstone of evidence-based medicine, at the top of the study-design hierarchy along with meta-analyses.61  Although there have been many studies published on the diagnostic concordance of WSI and LM, there has been no systematic assimilation of those studies, apart from the Lindsköld20  health technology assessment. This review was structured according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analysis) guidelines.62  The review was conducted by a team experienced in conducting systematic reviews, pathology, and digital-pathology research. The primary researcher (E.G.) undertook a formal, systematic-review training course before conducting the review.

This systematic review identified 1155 studies, 36 of which (3%) were included in the review. Two of the included studies seemed to incorporate multiple studies and were each viewed as 2 separate studies for the purposes of this review, resulting in a total of 38 studies included in the review. For these 38 studies included, the mean diagnostic concordance between diagnoses rendered by WSI and those rendered by LM was 92.4%. The mean κ agreement was 0.75 (substantial agreement). A mean diagnostic concordance of 93.4% was found among the 10 studies (26%) that compared prospective and retrospective diagnosis using LM. Few studies reported time to diagnosis and diagnostic confidence (n = 4 [11%] and n = 2 [5%], respectively). Where measured, the time to diagnosis was increased, and diagnostic confidence was less using WSI compared with LM.

These results complement the 2012 Lindsköld health technology assessment review, particularly for the intraobserver agreement ranges found.20  The PCR for interobserver ranges found in this review was less than the corresponding range reported in the Lindsköld review (84%–100% and 70%–100%, respectively), which may be due to the stricter eligibility criteria used in this review, reducing the level of heterogeneity between included studies. The Lindsköld review used the quantifiable GRADE (Grading of Recommendations Assessment, Development and Evaluation) system to assess the quality of studies.63  Unlike the GRADE system, the QUADAS-2 tool does not provide quantitative measures; therefore, we cannot determine whether the studies included in this review were of higher, equal, or lower quality. The QUADAS-2 assessment is, however, a more in-depth assessment of study quality and has subsequently been recommended by the Cochrane Collaboration.64 

For WSI to be validated for use in routine diagnostic work, diagnostic concordance does not need to be 100%. It does, however, need to be established as being noninferior to LM. Unfortunately, few studies have investigated the intraobserver concordance between LM diagnoses.36  This would suggest that the most appropriate study design for the validation of WSI is a crossover study. Such a study facilitates the direct comparison of intraobserver concordance for LM and intraobserver diagnostic concordance of WSI and LM. However, an insignificant difference between the concordance means does not necessarily imply equivalence. An adequate study size, determined by a noninferiority power calculation, is required to demonstrate the diagnoses rendered by WSI are not inferior to those rendered by LM in concordance.36 

The secondary outcomes measured in this review complement the findings in existing literature. A slower time to diagnosis has been reported in WSI.65,66  The few studies that reported the time to diagnosis all showed an increased time to diagnosis when using WSI compared with LM. An increased time to diagnosis highlights the inefficiency associated with WSI at present, which is a particular concern in financially pressured health care systems. Increased time to diagnosis and a reduced diagnostic confidence are likely to reduce the acceptability of WSI among pathologists, already a barrier to WSI implementation. However, the development of image-analysis software and improvement in workflow with digital systems may potentially decrease time to diagnosis in the future. Increased pathologist experience with WSI devices is also likely to decrease time to diagnosis and increase diagnostic confidence.

The quality and size of studies appeared to be rising over time. This could be related to the increasing guidance being published by authorities, such as the CAP and the FDA, and to the increasing adoption and understanding of WSI. Because of the strict eligibility criteria used in this review, which included only studies that used hematoxylin-eosin–stained human tissue, patient selection was 100% applicable. The risk of bias and applicability concerns of the index test were affected by studies that did not train participants in using WSI and by studies that did not provide participants with the corresponding clinical information. Figure 5, C, shows the effect of training in WSI on diagnostic concordance. In some cases, less-applicable technology was also used for the index test. For example, Brunelli et al38  used an iPad to view digital slides. This resulted in greater concern about its applicability because, in routine diagnoses in a hospital setting, a workstation with a specialized high-resolution monitor would most likely be used.

Seventy-four percent of studies (n = 28) had a low risk of bias in the flow and timing domain. The main variable affecting this domain was the washout period used between slide views. This review considered a washout period longer than 2 weeks to be appropriate, based on the recommendations in the 2013 CAP guidelines.21  The 2-week washout period is intended to minimize recall bias between slide views. In 2013, there was a shortage of evidence in this area.21  However, Campbell et al67  have since found recall bias between slide views to occur in washout periods of up to 4 weeks.

Because of the limited timescale of this review, its scope included published articles only. The vendors of WSI systems have conducted their own validation studies for use in submission to regulatory authorities, such as the FDA or Conformité Européene. To date, vendors have neither published nor publically released these internal data. However, follow-up work is planned that will include vendor data, where available. Twenty-nine percent of studies (n = 11) included in this review used fewer than 60 cases. This review endeavored to minimize the effect of such small studies by calculating means weighted according to the number of cases per study. Future reviews may wish to include a minimum number of cases as part of the eligibility criteria. Alternatively, the number of cases could be used in the quality assessment of included studies. In addition, future reviews may also wish to examine whether concordance was affected by differing definitions of concordance and differing case complexity among the included studies. By including only published journal articles, this review did not take into account the potential for publication bias. However, the grey literature search performed was intended to minimize the effect of such publication bias.

This review provides limited evidence of diagnostic concordance between WSI and LM. However, this finding is predominantly based on small studies that displayed heterogeneous study designs, participants, and case type. Larger study designs would provide greater confidence in the measured outcomes. Bauer et al36  conducted an a priori power calculation to estimate the number of cases needed to test the hypothesis that diagnoses rendered with WSI are not inferior to those rendered with LM. They found that 450 cases (225 glass and 225 whole slide images) were needed to establish a WSI noninferiority of 4% at a significance level of .05. This number of cases is noticeably different from the minimum number of 60 cases recommended by the 2013 CAP guidelines.21  Interestingly, only 1 of the 11 studies (9%) that had fewer than 60 cases showed a concordance percentage of 90% or greater.60  However, at present, it is not clear whether such sample-size calculations should determine the number of cases, the number of slides, or the number of pathologists.

There is a need for further work into the effect of the duration of the washout period on recall bias. In a 2015 study, Campbell et al67  found that, after 2 weeks—the CAP recommended washout period—pathologists were capable of recalling 40% of cases previously seen. Even after 4 weeks, pathologists were able to recall up to 31% of cases previously seen. The CAP guidelines acknowledged at the time of publication, the lack of studies comparing the effect of the duration of the washout period on outcomes. The FDA recommends that sufficient time be allowed between intraoperator reviews of the same imaging to reduce recall bias, but it does not provide a quantifiable definition of sufficient.68 

Although this review provides evidence to support the diagnostic concordance of WSI and LM in routine diagnoses, it also highlights the significant heterogeneity among validation study designs. This is unfortunate because, if the included studies had been of sufficient size, quality, and homogeneity, they would have enabled us to perform a meta-analysis on more than 5000 cases, significantly enhancing the quality of evidence available to pathologists and regulators about the validation of WSI. One method of reducing heterogeneity among future validation studies is for such studies to consult and adhere to the 2013 CAP guidelines.21 

At present, however, there is a lack of available evidence to validate the use of WSI in routine primary diagnosis. Regulators, industry, health care providers, and the academic community are all interested in the digitizing of pathology, so future validation studies are inevitable, and many are currently ongoing. By demonstrating the types of study designs available, this review may help in the design of future validation studies. In addition, this review highlights weaknesses present in previous validation studies, which future studies could avoid.

Advice on modifying the QUADAS-2 tool was provided by Penny Whiting. We thank the following individuals for their responses: W. Scott Campbell; Liron Pantanowitz; László Fónyad; Peter Furness; Anna R. Graham; Joseph P. Houghton; Bela Molnar; Ellen Mooney; Emily Shaw; Thomas P. Buck; and Carolina Reyes. Karen Lee assisted in the organization of the project.

1
Ghaznavi
F,
Evans
A,
Madabhushi
A,
Feldman
M.
Digital imaging in pathology: whole-slide imaging and beyond
.
Annu Rev Pathol
.
2013
;
8
:
331
359
.
2
Weinstein
RS,
Graham
AR,
Lian
F,
et al.
Reconciliation of diverse telepathology system designs. Historic issues and implications for emerging markets and new applications
.
APMIS
.
2012
;
120
(
4
):
256
275
.
3
Weinstein
RS,
Descour
MR,
Liang
C,
et al.
Telepathology overview: from concept to implementation
.
Hum Pathol
.
2001
;
32
(
12
):
1283
1299
.
4
Al-Janabi
S,
Huisman
A,
Van Diest
PJ.
Digital pathology: current status and future perspectives
.
Histopathology
.
2012
;
61
(
1
):
1
9
.
5
Gilbertson
J,
Yagi
Y.
Histology, imaging and new diagnostic work-flows in pathology
.
Diagn Pathol
.
2008
;
3
(
suppl 1
):
S14
.
6
Ho
J,
Parwani
AV,
Jukic
DM,
Yagi
Y,
Anthony
L,
Gilbertson
JR.
Use of whole slide imaging in surgical pathology quality assurance: design and pilot validation studies
.
Hum Pathol
.
2006
;
37
(
3
):
322
331
.
7
Gilbertson
JR,
Ho
J,
Anthony
L,
Jukic
DM,
Yagi
Y,
Parwani
AV.
Primary histologic diagnosis using automated whole slide imaging: a validation study
.
BMC Clin Pathol
.
2006
;
6
:
4
.
8
Rojo
MG,
Garcia
GB,
Mateos
CP,
Garcia
JG,
Vicente
MC.
Critical comparison of 31 commercially available digital slide systems in pathology
.
Int J Surg Pathol
.
2006
;
14
(
4
):
285
305
.
9
Hedvat
CV.
Digital microscopy: past, present, and future
.
Arch Pathol Lab Med
.
2010
;
134
(
11
):
1666
1670
.
10
Taylor
CR.
From microscopy to whole slide digital images: a century and a half of image analysis
.
Appl Immunohistochem Mol Morphol
.
2011
;
19
(
6
):
491
493
.
11
Huisman
A,
Looijen
A,
van den Brink
SM,
van Diest
PJ.
Creation of a fully digital pathology slide archive by high-volume tissue slide scanning
.
Hum Pathol
.
2010
;
41
(
5
):
751
757
.
12
Ayad
E,
Yagi
Y.
Virtual microscopy beyond the pyramids, applications of WSI in Cairo University for e-education & telepathology
.
Anal Cell Pathol (Amst)
.
2011
;
35
(
2
):
93
95
.
13
Pantanowitz
L,
Szymas
J,
Yagi
Y,
Wilbur
D.
Whole slide imaging for educational purposes
.
J Pathol Inform
2012
;
3
:
46
.
14
Romero Lauro
G,
Cable
W,
Lesniak
A,
et al.
Digital pathology consultations-a new era in digital imaging, challenges and practical applications
.
J Digit Imaging
.
2013
;
26
(
4
):
668
677
.
15
Brick
KE,
Sluzevich
JC,
Cappel
MA,
DiCaudo
DJ,
Comfere
NI,
Wieland
CN.
Comparison of virtual microscopy and glass slide microscopy among dermatology residents during a simulated in-training examination
.
J Cutan Pathol
.
2013
;
40
(
9
):
807
811
.
16
Hanna
MG,
Pantanowitz
L,
Evans
AJ.
Overview of contemporary guidelines in digital pathology: what is available in 2015 and what still needs to be addressed?
J Clin Pathol
.
2015
;
68
(
7
):
499
505
.
17
Faison
T.
FDA regulation of whole slide imaging (WSI) devices: current thoughts
. ,
2015
.
18
Al-Janabi
S,
Huisman
A,
Jonges
GN,
ten Kate
FJW,
Goldschmeding
R,
van Diest
PJ.
Whole slide images for primary diagnostics of urinary system pathology: a feasibility study
.
J Renal Inj Prev
.
2014
;
3
(
4
):
91
96
.
19
Association
DP.
DPA Recommends whole slide imaging manufacturers submit de novo applications to the FDA for primary diagnosis in the United States
. ,
2016
.
20
Lindsköld
L,
Samuelsson
B,
Carlberg
I,
et al.
Diagnostic agreement of digital whole slide imaging and routine light microscopy
.
Göteborg, Sweden
:
Regional Health Technology Assessment Centre (HTA-centrum)
.
HTA-rapport
2012
:
54.
21
Pantanowitz
L,
Sinard
JH,
Henricks
WH,
et al.
Validating whole slide imaging for diagnostic purposes in pathology: guideline from the College of American Pathologists Pathology and Laboratory Quality Center
.
Arch Pathol Lab Med
.
2013
;
137
(
12
):
1710
1722
.
22
University of York, Centre for Reviews and Dissemination
.
About PROSPERO
. ,
2015
.
23
Goacher
E,
Randell
R,
Treanor
D.
The diagnostic accuracy of digital microscopy: a systematic review protocol
.
PROSPERO: international prospective register of systematic reviews
.
2015
: CRD42015017859.
,
2015
.
24
Landis
JR,
Koch
GG.
The measurement of observer agreement for categorical data
.
Biometrics
.
1977
;
33
(
1
):
159
174
.
25
Whiting
PF,
Rutjes
AW,
Westwood
ME,
et al
QUADAS-2 Group. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies
.
Ann Intern Med
.
2011
;
155
(
8
):
529
536
.
26
Reitsma
JB,
Rutjes
AWS,
Whiting
P,
Vlassov
VV,
Leeflang
MMG,
Deeks
JJ.
Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy. Version 1.0.0
.
London, England
:
Cochrane Collaboration;
2009
.
27
Ryan
R;
Cochrane Consumers and Communication Review Group. Cochrane Consumers and Communication Review Group reviews: meta-analysis
. ,
2015
.
28
Campbell
WS,
Hinrichs
SH,
Lele
SM,
et al.
Whole slide imaging diagnostic concordance with light microscopy for breast needle biopsies
.
Hum Pathol
.
2014
;
45
(
8
):
1713
1721
.
29
Buck
TP,
Dilorio
R,
Havrilla
L,
O'Neill
DG.
Validation of a whole slide imaging system for primary diagnosis in surgical pathology: a community hospital experience
.
J Pathol Inform
.
2014
;
5
(
1
):
43
.
30
Al Habeeb
A,
Evans
A,
Ghazarian
D.
Virtual microscopy using whole-slide imaging as an enabler for teledermatopathology: a paired consultant validation study
.
J Pathol Inform
.
2012
;
3
(
1
):
2
.
31
Al-Janabi
S,
Huisman
A,
Vink
A,
et al.
Whole slide images for primary diagnostics of gastrointestinal tract pathology: a feasibility study
.
Hum Pathol
.
2012
;
43
(
5
):
702
707
.
32
Al-Janabi
S,
Huisman
A,
Vink
A,
et al.
Whole slide images for primary diagnostics in dermatopathology: a feasibility study
.
J Clin Pathol
.
2012
;
65
(
2
):
152
158
.
33
Al-Janabi
S,
Huisman
A,
Willems
SM,
Van Diest
PJ.
Digital slide images for primary diagnostics in breast pathology: a feasibility study
.
Hum Pathol
.
2012
;
43
(
12
):
2318
2325
.
34
Al-Janabi
S,
Huisman
A,
Nikkels
PG,
ten Kate
FJ,
van Diest
PJ.
Whole slide images for primary diagnostics of paediatric pathology specimens: a feasibility study
.
J Clin Pathol
.
2013
;
66
(
3
):
218
223
.
35
Arnold
MA,
Chenever
E,
Baker
PB,
et al.
The College of American Pathologists guidelines for whole slide imaging validation are feasible for pediatric pathology: a pediatric pathology practice experience
.
Pediatr Dev Pathol
.
2015
;
18
(
2
):
109
116
.
36
Bauer
TW,
Schoenfield
L,
Slaw
RJ,
Yerian
L,
Sun
Z,
Henricks
WH.
Validation of whole slide imaging for primary diagnosis in surgical pathology
.
Arch Pathol Lab Med
.
2013
;
137
(
4
):
518
524
.
37
Bauer
TW,
Slaw
RJ.
Validating whole-slide imaging for consultation diagnoses in surgical pathology
.
Arch Pathol Lab Med
.
2014
;
138
(
11
):
1459
1465
.
38
Brunelli
M,
Beccari
S,
Colombari
R,
et al.
iPathology cockpit diagnostic station: validation according to College of American Pathologists Pathology and Laboratory Quality Center recommendation at the Hospital Trust and University of Verona
.
Diagn Pathol
.
2014
;
9
(
suppl 1
):
S12
.
39
Campbell
WS,
Lele
SM,
West
WW,
Lazenby
AJ,
Smith
LM,
Hinrichs
SH.
Concordance between whole-slide imaging and light microscopy for routine surgical pathology
.
Hum Pathol
.
2012
;
43
(
10
):
1739
1744
.
40
Chargari
C,
Comperat
E,
Magné
N,
et al.
Prostate needle biopsy examination by means of virtual microscopy
.
Pathol Res Pract
.
2011
;
207
(
6
):
366
369
.
41
Fónyad
L,
Krenács
T,
Nagy
P,
et al.
Validation of diagnostic accuracy using digital slides in routine histopathology
.
Diagn Pathol
.
2012
;
7
:
35
.
42
Gage
JC,
Joste
N,
Ronnett
BM,
et al.
A comparison of cervical histopathology variability using whole slide digitized images versus glass slides: experience with a statewide registry
.
Hum Pathol
.
2013
;
44
(
11
):
2542
2548
.
43
Gui
D,
Cortina
G,
Naini
B,
et al.
Diagnosis of dysplasia in upper gastro-intestinal tract biopsies through digital microscopy
.
J Pathol Inform
.
2012
;
3
:
27
.
44
Houghton
JP,
Ervine
AJ,
Kenny
SL,
et al.
Concordance between digital pathology and light microscopy in general surgical pathology: a pilot study of 100 cases
.
J Clin Pathol
.
2014
;
67
(
12
):
1052
1055
.
45
Jen
KY,
Olson
JL,
Brodsky
S,
Zhou
XJ,
Nadasdy
T,
Laszik
ZG.
Reliability of whole slide images as a diagnostic modality for renal allograft biopsies
.
Hum Pathol
.
2013
;
44
(
5
):
888
894
.
46
Jukic
DM,
Drogowski
LM,
Martina
J,
Parwani
AV.
Clinical examination and validation of primary diagnosis in anatomic pathology using whole slide digital images
.
Arch Pathol Lab Med
.
2011
;
135
(
3
):
372
378
.
47
Krishnamurthy
S,
Mathews
K,
McClure
S,
et al.
Multi-institutional comparison of whole slide digital imaging and optical microscopy for interpretation of hematoxylin-eosin–stained breast tissue sections
.
Arch Pathol Lab Med
.
2013
;
137
(
12
):
1733
1739
.
48
Mooney
E,
Hood
AF,
Lampros
J,
Kempf
W,
Jemec
GB.
Comparative diagnostic accuracy in virtual dermatopathology
.
Skin Res Technol
.
2011
;
17
(
7
):
251
255
.
49
Nielsen
PS,
Lindebjerg
J,
Rasmussen
J,
Starklint
H,
Waldstrom
M,
Nielsen
B.
Virtual microscopy: an evaluation of its validity and diagnostic performance in routine histologic diagnosis of skin tumors
.
Hum Pathol
.
2010
;
41
(
12
):
1770
1776
.
50
Ordi
J,
Castillo
P,
Saco
A,
et al.
Validation of whole slide imaging in the primary diagnosis of gynaecological pathology in a university hospital
.
J Clin Pathol
.
2015
;
68
(
1
):
33
39
.
51
Ozluk
Y,
Blanco
PL,
Mengel
M,
Solez
K,
Halloran
PF,
Sis
B.
Superiority of virtual microscopy versus light microscopy in transplantation pathology
.
Clin Transplant
.
2012
;
26
(
2
):
336
344
.
52
Reyes
C,
Ikpatt
OF,
Nadji
M,
Cote
RJ.
Intra-observer reproducibility of whole slide imaging for the primary diagnosis of breast needle biopsies
.
J Pathol Inform
.
2014
;
5
(
1
):
5
.
53
Risio
M,
Bussolati
G,
Senore
C,
et al.
Virtual microscopy for histology quality assurance of screen-detected polyps
.
J Clin Pathol
.
2010
;
63
(
10
):
916
920
.
54
Rodriguez-Urrego
PA,
Cronin
AM,
Al-Ahmadie
HA,
et al.
Interobserver and intraobserver reproducibility in digital and routine microscopic assessment of prostate needle biopsies
.
Hum Pathol
.
2011
;
42
(
1
):
68
74
.
55
Sanders
DS,
Grabsch
H,
Harrison
R,
et al
AspECT Trial Management Group and Trial Principal Investigators. Comparing virtual with conventional microscopy for the consensus diagnosis of Barrett's neoplasia in the AspECT Barrett's chemoprevention trial pathology audit
.
Histopathology
.
2012
;
61
(
5
):
795
800
.
56
Shaw
EC,
Hanby
AM,
Wheeler
K,
et al.
Observer agreement comparing the use of virtual slides with glass slides in the pathology review component of the POSH breast cancer cohort study
.
J Clin Pathol
.
2012
;
65
(
5
):
403
408
.
57
van der Post
RS,
van der Laak
JAWM,
Sturm
B,
et al.
The evaluation of colon biopsies using virtual microscopy is reliable
.
Histopathology
.
2013
;
63
(
1
):
114
121
.
58
Velez
N,
Jukic
D,
Ho
J.
Evaluation of 2 whole-slide imaging applications in dermatopathology
.
Hum Pathol
.
2008
;
39
(
9
):
1341
1349
.
59
Wendum
D,
Lacombe
K,
Chevallier
M,
et al.
Histological scoring of fibrosis and activity in HIV-chronic hepatitis B related liver disease: performance of the METAVIR score assessed on virtual slides
.
J Clin Pathol
.
2009
;
62
(
4
):
361
363
.
60
Wilbur
DC,
Madi
K,
Colvin
RB,
et al.
Whole-slide imaging digital pathology as a platform for teleconsultation: a pilot study using paired subspecialist correlations
.
Arch Pathol Lab Med
.
2009
;
133
(
12
):
1949
1953
.
61
Whiting
P,
Rutjes
A,
Reitsma
J,
Bossuyt
P,
Kleijnen
J.
The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews
.
BMC Med Res Methodol
.
2003
;
3
(
1
):
1
13
.
62
Moher
D,
Liberati
A,
Tetzlaff
J,
Altman
DG,
The PG—preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement
.
PLoS Med
.
2009
;
6
(
7
):
e1000097
.
63
BMJ Clinical Evidence
.
What is GRADE?
April
10
2015
.
64
Higgins
JPT GS.
Cochrane Handbook for Systematic Reviews of Interventions. Vol 5.1.0
.
London, England
:
Cochrane Collaboration;
2013
.
65
Randell
R,
Ruddle
RA,
Thomas
RG,
Mello-Thoms
C,
Treanor
D.
Diagnosis of major cancer resection specimens with virtual slides: impact of a novel digital pathology workstation
.
Hum Pathol
.
2014
;
45
(
10
):
2101
2106
.
66
Treanor
D,
Quirke
P.
The virtual slide and conventional microscope—a direct comparison of their diagnostic efficiency
.
J Pathol
.
2007
;
213(suppl 1):7a.
67
Campbell
WS,
Talmon
GA,
Foster
KW,
Baker
JJ,
Smith
LM,
Hinrichs
SH.
Visual memory effects on intraoperator study design: determining a minimum time gap between case reviews to reduce recall bias
.
Am J Clin Pathol
.
2015
;
143
(
3
):
412
418
.
68
US Food and Drug Administration
.
Guidance for Industry: Developing Medical Imaging Drug and Biological Products
.
Rockville, MD
:
US Department of Health and Human Services;
2004
.
69
Whiting
P.
Resources. University of Bristol QUADAS Web site
. ,
2015
.
*

References 7, 29, 36–39, 41, 44, 46, 60.

Author notes

Supplemental digital content is available at www.archivesofpathology.org in the January 2017 table of contents.

Dr Treanor is on the advisory board of Sectra AB (Linköping, Sweden), and Leica Biosystems (Vista, California), as part of Leica Biosystems Imaging, Inc. He has a collaborative research project with FFEI Ltd (Hemel Hempstead, Hertfordshire, England), and is a coinventor on a digital pathology patent that is being licensed to a digital pathology vendor; he received no personal remuneration for any of these positions. The other authors have no relevant financial interest in the products or companies described in this article.

Competing Interests

Presented at the Joint Pathology Informatics and World Congress on Pathology Informatics; May 6, 2015; Pittsburgh, Pennsylvania.

Supplementary data