Relatively little is known about the significance and potential impact of glass-digital discordances, and this is likely to be of importance when considering digital pathology adoption.
To apply evidence-based medicine to collect and analyze reported instances of glass-digital discordance from the whole slide imaging validation literature.
We used our prior systematic review protocol to identify studies assessing the concordance of light microscopy and whole slide imaging between 1999 and 2015. Data were extracted and analyzed by a team of histopathologists to classify the type, significance, and potential root cause of discordances.
Twenty-three studies were included, yielding 8069 instances of a glass diagnosis being compared with a digital diagnosis. From these 8069 comparisons, 335 instances of discordance (4%) were reported, in which glass was the preferred diagnostic medium in 286 (85%), and digital in 44 (13%), with no consensus in 5 (2%). Twenty-eight discordances had the potential to cause moderate/severe patient harm. Of these, glass was the preferred diagnostic medium for 26 (93%). Of the 335 discordances, 109 (32%) involved the diagnosis or grading of dysplasia. For these cases, glass was the preferred diagnostic medium in 101 cases (93%), suggesting that diagnosis and grading of dysplasia may be a potential pitfall of digital diagnosis. In 32 of 335 cases (10%), discordance on digital was attributed to the inability to find a small diagnostic/prognostic object.
Systematic analysis of concordance studies reveals specific areas that may be problematic on whole slide imaging. It is important that pathologists are aware of these areas to ensure patient safety.
The capacity to digitally capture, view, analyze, store, share, and view whole slide pathology images has led to widespread adoption of digital pathology for education and research in the health care sector and institutes of higher education.1–3 Whole slide imaging (WSI) systems are used increasingly in Europe and North America for secondary diagnosis (eg, for second opinions or frozen section cases), but their use for large-scale primary diagnosis remains limited to a few centers internationally, including projects in Sweden, the Netherlands, and Canada.4–6 Implementation barriers include the cost-benefit of digital pathology, current lack of regulatory approval from the US Food and Drug Administration, a paucity of evidence validating the diagnostic concordance of WSI and light microscopy, and low acceptability of digital pathology among pathologists.1,7–11 If digital pathology is to be implemented for primary diagnosis on a large scale, regulatory bodies, both international and national; departmental heads; and individual pathologists will have to be confident that a diagnosis made on a digital microscope is equivalent to a diagnosis made by the same pathologist on a light microscope.
A limited number of studies have compared the diagnostic concordance of WSI and traditional light microscopy. In 2016, Goacher et al12 undertook a systematic review of these studies in which study quality was assessed. They identified 38 studies, and reported a mean diagnostic concordance of WSI and light microscopy, weighted by the number of cases per study, of 92.4%. Of the 30 studies quoting concordance as a percentage, 60% showed a concordance of 90% or greater, of which 10 showed a concordance of 95% or greater. There was a trend for increasing concordance in the more recent studies. The review found evidence to support a high level of diagnostic concordance for WSI overall.
The conclusions of the systematic review can be interpreted as encouraging for a diagnostic department that is considering a primary diagnostic digital adoption; however, if we invert the statistic, 92.4% concordance equates to 7.6% discordance. It can be argued that discordances are more valuable than concordances in analyzing the potential patient safety impact of digital diagnosis, and an evaluation of the type, severity, unexpectedness, and root cause of these discordant diagnoses can allow us to explore safety aspects of digital pathology adoption and identify potential pitfalls—areas of digital diagnostic interpretation that may require more attention or practice before full digital adoption for primary diagnosis. Experience in analyzing error rates and types in telepathology, including that published by Dunn et al,13 has contributed greatly to our understanding of the limitations and strengths of this diagnostic medium, and gathering and analyzing similar data from WSI studies is likely to prove equally beneficial.
The primary aim of this review and analysis was to systematically examine the published literature on discordant pathologic diagnoses rendered by WSI compared with those rendered by light microscopy, and to identify areas that may be problematic to diagnose using digital microscopy.
MATERIALS AND METHODS
A systematic review protocol that had been used in a previous systematic review of WSI concordance was used. The review protocol is registered with the PROSPERO database (registration number CRD42015017859), and can be accessed online.14
An electronic search was instigated on the databases Medline, Medline in Progress, EMBASE, and the Cochrane Library between 1999 and December 2015, using the previously published systematic review methodology.12,14 A search of clinicaltrials.gov (Bethesda, Maryland) was performed to identify any ongoing studies. Included studies underwent manual reference searching and citation tracking through PubMed and Google Scholar. Corresponding authors were contacted, where possible, to identify subsequent or ongoing research.
Two reviewers independently subjected the abstracts of papers to the previously used systematic review screening algorithm.12 In cases of disagreement, a third independent reviewer was consulted. Full texts of all papers that fulfilled the initial screening algorithm were retrieved and reviewed. Only published journal articles were included in the review.
Data Extraction and Analysis
A standardized data extraction protocol was applied to all included studies. The lead researcher extracted pairs of discordant diagnoses (preferred diagnosis with discordant diagnosis) that were stored in a spreadsheet in which the source study and the method of diagnosis (glass or digital) used to render each diagnosis were concealed from the reviewers. A team of 3 discordance reviewers was assembled, all of whom were professional diagnostic pathologists, with 6, 18, and 34 years of pathology experience. The 3 discordance reviewers evaluated each diagnostic pair and assigned it a category based on the Royal College of Pathologists System of Categorization for Discrepancies.38 In this system, discordances are assigned a letter code depending on the type of error (ie, errors in macroscopy, microscopy, clinical correlation, failing to seek a second opinion, misidentification). For this study, the B category, discrepancies in microscopy, was the most relevant. The B category errors were then stratified depending on how unexpected or understandable the error was (Table 1). Next, each reviewer assigned each discordant diagnostic pair a category corresponding to the potential for patient harm to be caused, from the Royal College of Pathologists guide to duty of care reviews (Table 2).39 The spectrum of harm ranges from no clinical impact, no harm, which is categorized as 1, to severe harm, categorized as 5. Dimensions such as delay in diagnosis, unnecessary further diagnostic efforts, delays in therapy, unnecessary therapy, and resultant levels of morbidity or mortality were considered.
All discordances were reviewed independently by the 3 discordance reviewers. For the potential for harm categorizations, the Royal College categories 2 and 3 (minimal harm, no morbidity, and minor harm, minor morbidity) were merged into a single category of minimal/minor harm, and categories 4 and 5 (moderate harm, moderate morbidity, and major harm, major morbidity) were merged into a single category of moderate/major harm. Where reviewers disagreed on the categorization of a diagnostic discordance, cases were discussed and a consensus reached. Expert opinion was sought on the renal transplant biopsy data, as the review team did not feel they had sufficient subspecialty expertise in this area (Dr Carole Angel, MB, ChB, Leeds Teaching Hospitals NHS Trust, written communication, March 2016).
After all cases had been categorized, the lead researcher reunited the discordant pairs with source study data. For each individual discordance, the source paper was examined to extract data on the preferred diagnostic modality for that particular case; the type of diagnosis required; the type of discordance; any specific diagnostic tasks, objects, or features that would have enabled the pathologist to make the true diagnosis; and detail from the paper of any particular difficulties/observations encountered by the study pathologists. See Table 3 for an example of a discordance analysis.
One thousand three hundred abstracts were checked and 39 full-text papers extracted. Of these, 23 contained detailed, extractable discordant diagnostic pair data.15–37 Publication dates ranged from 2006 through 2015, with the majority of studies published post-2010. These 23 papers included 8069 instances of a glass diagnosis and a digital diagnosis being compared. Out of these 8069 glass-digital read pairs, 335 instances of discordance were recorded, which represents approximately 4% of 8069 glass-digital comparisons.
The included studies used a range of scanners from 7 different vendors. Viewing hardware varied greatly both within and among studies. Many studies provided little information on viewing hardware/scanners or failed to standardize viewing hardware. The majority of studies scanned slides at a routine magnification of ×20, with more recent publications tending to use ×40, and some varying the scanning magnification depending on the type of case, for example, diagnostic specimens at ×40 and therapeutic specimens at ×20.
The majority of included studies scanned a mixture of cases from a number of histopathology subspecialties. Ten studies were a mix of 2 or more distinct organ systems, which we termed a case mix, with gastrointestinal and skin the most popular single pathology specialties examined. The majority of recorded discordances occurred in gastrointestinal, skin, genitourinary, and gynecologic cases. Unfortunately, many of the source studies lacked a sufficiently detailed breakdown of case types included, but it is likely that cases from these organ systems are overrepresented in the source studies; they are certainly all high-throughput specialties.
Severity and Implications of Discordance
Of the 335 reported discordances, glass was the preferred diagnostic modality in 286 cases (85%). Interestingly, the digital diagnosis was preferred in 44 cases (13%), with an equivocal response in the remaining 5 (2%).
The largest specific category of discordance was missed diagnosis of malignant/dysplastic/atypical conditions, where malignant tissue was given a benign diagnosis. In these cases, glass was the preferred diagnostic modality in 66 of 77 cases (86%). There were also 25 cases where benign tissue was erroneously diagnosed as malignant/atypical. Here glass was the preferred diagnostic modality in 23 cases (92%) (Table 4). The second greatest discordance type (70 cases) was where a case was recognized as malignant/atypical but incorrectly typed or graded. Here again, glass was the preferred diagnostic modality in 67 cases (96%). Discrepancies in the diagnosis of inflammation were also common.
Most discordances (169 cases) fell into the category of B3, areas of appreciable diagnostic difficulty and recognized interobserver variation, such as the difference between 2 adjacent grades of a malignant condition.
In total 21 B1 diagnoses were recorded. These would be regarded as surprising errors using the Royal College of Pathologists' System of Categorization for Discrepancies.38 One type E discordance was recorded—a misidentification error, where digital was the preferred diagnostic modality (Table 5).
The majority (242; 72%) of the 335 discordances reported had the potential to cause minimal or minor harm to patients. This represents 3.00% of all glass-digital comparisons (242 of 8069). Only 28 of 335 (8%) had the potential to cause moderate or major harm to patients. This represents 0.35% (28 of 8069) of all glass-digital comparisons. For these, glass was the preferred diagnosis in 26 (93%). Digital was preferred in 2 of 28 cases (7%) with the potential for moderate/major harm (Table 6). Table 7 shows specific instances of major/moderate harm recorded on diagnoses made using digital and conventional glass slides. Instances where the glass slide diagnosis was preferred included benign breast tissue erroneously reported as invasive carcinoma on the digital read and a benign lung biopsy erroneously diagnosed as non–small cell carcinoma. Examples where malignant diagnoses were missed on the digital read of a case included gastric adenocarcinoma called acute gastritis, metastatic melanoma missed in a lymph node, and chronic lymphocytic leukemia missed in a skin biopsy. Digital was the preferred diagnostic modality for 2 cases with the potential for moderate/major harm: a carcinoid tumor that was missed in the glass examination of an appendix, and benign breast tissue erroneously diagnosed as ductal carcinoma in situ on glass.
Types of Discordance
The included studies reported 108 of 335 discordances concerning the diagnosis of dysplasia, representing 32% of all reported discordances. These were predominantly cases from the upper gastrointestinal tract and the cervix. Dysplasia is an area of appreciable interobserver and intraobserver variation, but nonetheless, dysplasia discordances seemed to be particularly prevalent. The majority of discordances were instances of missed diagnosis, where dysplastic tissue was diagnosed as benign or reactive tissue. Fifty-one cases of this type were reported, and they represented 47% (108) of all dysplasia-related discordances.
Interestingly, where there were differences in grading, dysplastic lesions tended to be undercalled on the digital microscope (33 cases undercalled, 8 cases overcalled). There were also errors in the other direction, with erroneous dysplasia diagnosed in benign tissue (14 cases) and a smaller number of overcalls in grading (10 cases). Of all the discordant dysplasia diagnoses, glass diagnosis was preferred in 101 of 108 cases (94%). This indicates that diagnosis and grading of dysplasia may be a pitfall of digital diagnosis (Table 8).
Locating Small Diagnostic Objects/Features.—
Another common diagnostic feature implicated in discordance is the ability to find or not find a small diagnostic/prognostic object. Thirty-nine discordances of this type were recorded, with glass the preferred diagnostic medium in 30 of 39 (82%). The majority of these discordances would be classified as B2 errors in microscopy, which one expects to see in a small proportion of cases as a matter of course. Three small object location discordances were classified as surprising errors based on the context. In total, 5 of the small object location discordances could have resulted in moderate/major patient harm. The types of small object missed included a range of malignant and benign features. Perhaps the most concerning of these are small tumors, metastases, and microsatellites. The most common small objects missed were foci of inflammation, more specifically cryptitis in colon biopsies. The detection of microorganisms was also a theme raised in the literature (Table 9).
Specific Problematic Entities Reported in the Literature.—
Granulocytes were mentioned in 27 of 335 instances of discordance (11%). Disparities in detection of granulocytes, particularly eosinophils, may relate to differences in color of the cells on digital and their refractile textures. Similar issues with detection of other eosinophilic, refractile objects (nucleated red blood cells and eosinophilic granular bodies) were reported.
Difficulties were also described identifying 2 entities commonly recognized on the grounds of subtle textural and tinctorial qualities: blue mucin and amyloid. In the 2 reported cases of difficulty with amyloid,22 study participants were unable to detect the textural quality of amyloid on digital slides, which would have alerted them to examine the original glass slides with a polarizer (Table 10).
There was one confirmed misidentification error, where the reader providing a glass diagnosis viewed the wrong slides, and the digital read rendered the correct diagnosis.
The use of digital pathology in the clinic is increasing, with many departments piloting digital pathology in primary diagnostic settings. In light of this, guidance is needed regarding potential safety implications for patients. A number of validation studies have been reported in the literature, but as we discovered in our systematic review,12 these vary greatly in terms of the number and types of participants and cases, the methodology, and the technologies examined. In the absence of a multicenter clinical trial, a systematic review remains the highest level of digital pathology concordance evidence available for those engaged in regulatory efforts.
Goacher et al12 found a mean diagnostic concordance of WSI and light microscopy, weighted by the number of cases per study, of 92.4%. In this study, we aimed to complement this work with a systematic analysis of the discordant diagnoses reported in the validation literature, in the hope that this analysis would allow a more precise evaluation of primary digital diagnosis.
The diagnosis and grading of dysplasia is implicated as a possible pitfall of digital diagnosis. Most papers emphasize blurring of nuclear detail on digital scans, and implicate poor focus, exacerbated by compression artifact and limited dynamic range. These explanations focus on high-power diagnosis, but we should also consider low power. In cervical biopsies, dysplasia is often focal, and if focal abnormality is not picked up on the low-power assessment of the epithelium, confirmatory nuclear detail cannot be appreciated at high power. We may also need to consider the effect of scanning magnification and viewing hardware quality.
What potential strategies do we have to mitigate the risks of diagnosing dysplasia digitally? The first thing we could do is ensure pathologists are aware that dysplasia is a potential pitfall. Ordi et al34 describe an increase in glass-digital concordance for cervical dysplasia as their study progressed, suggesting that there is a significant learning curve effect for digital dysplasia diagnosis. This is an area of diagnosis that may need a longer settling-in period before pathologists can confidently and safely sign out digital cases.
Pathologists working in relevant specialties might want to consider a self-validation procedure, with digital-glass reconciliation of dysplasia diagnoses while they establish satisfactory glass-digital concordance. Alternatively, there might be a role for optional or mandatory checks on glass following a digital assessment in particular scenarios—for example, diagnosing dysplasia in Barrett esophagus, a practice used in some digital pathology deployments (Anna Boden, MD, Linköping; David Snead, MB, BS, Coventry; verbal communications, January 2016). Some authors describe limited improvement in digital dysplasia diagnosis with slides scanned at ×40, so there may be justification for mandatory scanning of selected specimens at ×40 (eg, cervical biopsies, upper gastrointestinal biopsies). Unfortunately, there are insufficient data at present to judge whether scanning at ×20 versus ×40 has a significant impact on overall discordance rates, and this is an area that deserves more attention in future studies.
Locating Small Diagnostic Objects/Focal Diagnostic Features
Locating small diagnostic objects is highlighted as a potential problem on digital slide reads. Navigation is certainly implicated, both within and among slides, and the effects of display resolution and scanning magnification also warrant consideration. In many studies, authors explicitly state that pathologists found navigating cases cumbersome. Specific training in safe and efficient navigation strategies using digital software should be available to pathologists who are expected to use digital images clinically.
Appropriate use of whole slide and whole case thumbnails can aid navigation, and safety features such as indicator lights to warn pathologists of missed slides/regions could help. There may be a case for modifying workflows to incorporate a mandatory glass check, at least in the initial phases of digital deployment, for specimens such as sentinel lymph nodes, where detection of micrometastases should be optimal.
There is little evidence in the literature regarding minimum specification for viewing hardware or standardization of viewing hardware. Many authors found diagnostic biopsies, particularly where detection of inflammatory disease is important, were best scanned at ×40, with Snead et al35 recommending ×60 in cases where detection of microorganisms is a priority.
Specific Objects/Features Causing Diagnostic Difficulty
Examination of the literature highlights a number of specific entities, including granulocytes, nucleated red blood cells, and amyloid, that were reported as having a different appearance on glass and digital slides. The importance or relevance of identifying these entities will vary among different subspecialties, and possibly among different pathologists, but it is important to mention areas where investigators have noticed an appreciable difference in appearance.
Specialty pathologists need to be aware of specialty specific diagnostic pitfalls and decide how important these features are to their own practice. Bauer and Slaw23 report that scanning gastrointestinal biopsies at ×40 improved the ability of their pathologists to detect and correctly categorize granulocytes. Color calibration may potentially play a role.
A single case of misidentification error was reported in the review source literature. In this case, the correct diagnosis was rendered on the digital slide, and the glass slide reviewer viewed the incorrect glass slide. It perhaps reminds us of the potential digital technology provides us to avoid the type of pathology error that should never occur: the misidentification of specimen, slides, or reports. The types of study design used in the source material for this analysis are unlikely to expose the full extent of misidentification errors, which should be considered when evaluating the total impact of digital versus glass technology on diagnostic error rates.
Given the increasing trend toward using digital pathology for clinical diagnosis, including primary diagnosis, the need for evidence-based digital pathology guidelines and a systematic evaluation of the available evidence is paramount. In this analysis of 8069 comparisons of glass and digital diagnoses, we found 335 discordances. Of these, only 28 had the potential to cause moderate or major patient harm.
We identified a number of problem areas in digital diagnosis that warrant further exploration and explanation, namely the identification and grading of dysplasia, the location of small diagnostic objects and features, and the identification of certain specialty-specific diagnostic features.
This information can be used to inform safe departmental or institutional adoption of digital pathology and to help design systems and process to address these areas in the future. We believe that although digital deployment for primary clinical diagnosis is in its infancy, it is important to collect and share data on pitfalls and problem cases. To this end, it might be helpful to create a centralized database of cases, recorded in a standardized format.
Education and continuing professional development of pathologists on an individual level is vital to ensure a safe and responsible rollout of digital microscopy. Pathologists should be encouraged to gain confidence in risk-free or risk-mitigated diagnostic environments before adopting a 100% digital workflow.
The perceived success or failure of digital pathology in a specific laboratory will stand or fall based on the competency and confidence of individual pathologists, and it is therefore important that pathologists understand the strengths and limitations of the WSI systems. The studies included in this analysis used a wide variety of different scanners, with different characteristics that could affect diagnostic interpretation of slides. In light of this, we believe it is important that diagnostic departments perform their own whole-system validations for WSI, to evaluate the strengths and weaknesses of the combination of hardware and software components they propose to use for primary diagnosis.
From the Departments of Histopathology (Dr Williams) and Cellular Pathology (Dr Treanor), Leeds Teaching Hospitals NHS Trust UK, University of Leeds, Leeds, United Kingdom; the Department of Histopathology, Airedale NHS Foundation Trust, Keighley, United Kingdom (Dr DaCosta); the Faculty of Medicine and Health, University of Leeds, Leeds, United Kingdom (Mr Goacher); and the Department of Cellular Pathology, Linköping University, Linköping, Sweden (Dr Treanor).
Dr Treanor is on the advisory board of Sectra (Sectra AB, Linköping, Sweden) and Leica Biosystems/Aperio (Vista, California) as part of Leica Biosystems Imaging, Inc, and has a collaborative research project with FFEI in 2014–2015 (FFEI Ltd, Hemel Hempstead, United Kingdom) where technical staff were funded by them. Dr Treanor is a coinventor on a digital pathology patent that has been assigned to Roche-Ventana on behalf of his employer in 2015. Dr Treanor receives (or will receive) no personal remuneration for any. The other authors have no relevant financial interest in the products or companies described in this article.
The content of this paper was presented as a platform talk at Pathology Informatics 2016; May 25, 2016; Pittsburgh, Pennsylvania.