Context.—

Novel therapeutics often target complex cellular mechanisms. Increasingly, quantitative methods like digital tissue image analysis (tIA) are required to evaluate correspondingly complex biomarkers to elucidate subtle phenotypes that can inform treatment decisions with these targeted therapies. These tIA systems need a gold standard, or reference method, to establish analytical validity. Conventional, subjective histopathologic scores assigned by an experienced pathologist are the gold standard in anatomic pathology and are an attractive reference method. The pathologist's score can establish the ground truth to assess a tIA solution's analytical performance. The paradox of this validation strategy, however, is that tIA is often used to assist pathologists to score complex biomarkers because it is more objective and reproducible than manual evaluation alone by overcoming known biases in a human's visual evaluation of tissue, and because it can generate endpoints that cannot be generated by a human observer.

Objective.—

To discuss common visual and cognitive traps known in traditional pathology-based scoring paradigms that may impact characterization of tIA-assisted scoring accuracy, sensitivity, and specificity.

Data Sources.—

This manuscript reviews the current literature from the past decades available for traditional subjective pathology scoring paradigms and known cognitive and visual traps relevant to these scoring paradigms.

Conclusions.—

Awareness of the gold standard paradox is necessary when using traditional pathologist scores to analytically validate a tIA tool because image analysis is used specifically to overcome known sources of bias in visual assessment of tissue sections.

Anatomic pathology, as a discipline, seeks to render a precise and complete diagnosis of lesions produced by a particular disease, to the correct patient, in a timely fashion, and in a way that is understandable and useful to the physician treating the patient.1  In a research setting, this definition is easily adaptable as an accurate and comprehensive assessment (diagnosis) provided to the correct specimen (patient) in a timely fashion and in a way that generates biologically meaningful data to benefit scientists (physicians) investigating biomedical research questions. Whether applied to a patient or a specimen, such assessments usually include not just the diagnostic opinion but also a lesion score defining the severity (or grade) of the finding, as collective experience over time has linked such scores to clinical outcome.

In recent years, anatomic pathology diagnoses have evolved from a reliance on traditional manually acquired lesion scores to an increasing emphasis on automated tissue image analysis (tIA). The rationale for this shift is that manual scores are qualitative or semiquantitative, and subjective even when assigned by a seasoned observer, whereas tIA data are quantitative, reproducible, and objective. Analytical validation of such novel tIA methods requires their comparison with an accepted reference standard, otherwise known as the gold standard (capturing ground truth). For tIA applications, the reference standard can be a similar modality that has been validated through extensive use (eg, a manual histopathology score for the same tissue section) or an orthogonal (ie, independent) methodology (eg, quantitative polymerase chain reaction or Western blot analysis of tissue from the same specimen). However, appropriate reference standards to validate novel tIA methods seldom exist in anatomic pathology practice. In nearly all cases, manual pathology scoring by a trained and competent (ie, board-certified) pathologist serves as the reference standard for anatomic pathology endpoints.24  Manual scores assigned by different practitioners may exhibit the same trend but are seldom identical; moreover, human fallibility is a well-known source of bias in manual scoring.1,5  These imperfections have been addressed by devising innovative digital pathology techniques, especially tIA, to overcome known pitfalls with manual scoring of biomarker expression by using the strengths of anatomic pathologists (eg, biomedical knowledge, cognitive abilities) to provide more reproducible data for making clinical decisions and for testing hypotheses in the translational environment. The rise of such innovative tIA practices, however, has engendered a critical paradox: what is the best gold standard for biomarker analysis in modern anatomic pathology practice? In other words, what defines or captures the ground truth for lesion scoring?

The source of this paradox is that the conventional practice of using manual scores generated by a trained and experienced pathologist provides the most straightforward reference standard for novel tIA-based methods. The same stained tissues may be scored manually and then by tIA, after which the results may be compared directly to demonstrate equivalency between the manual and tIA procedures. Alternatively, an orthogonal modality may be used as the benchmark, but the need for different material degrades the contextual information (eg, cellular localization) that is crucial for assessing biomarker expression in tissues. In either scenario, therefore, the nature of ground truth represents an informed choice.

The following sections review several common manual scoring paradigms that may be used to generate reference standard data to evaluate the analytical performance of novel immunohistochemistry (IHC)–based tIA assays, explores important cognitive and visual traps that can profoundly affect reference standard data, and addresses methods for maximizing manual score consistency and data reproducibility. These sources of bias are placed in a tIA context to highlight how properly established computer algorithms that leverage the strength of the anatomic pathologist and the algorithm can be used to avoid these traps and produce data of higher quality. Understanding these concepts and their correct application is critical given the increasing value of tIA as a tool in clinical and translational biomarker assessments to guide novel therapeutic development and treatment applications.

Histopathologic scoring entails the pathologist's ability to obtain qualitative or semiquantitative data from tissue.6  In the clinical setting, manual scoring supports treatment decisions, predicts patient prognosis, and determines clinical trial design, enrollment, and outcome measures.7  A manual scoring system should be definable, be reproducible, and produce meaningful results.6,8  Scoring strategies may be devised for specific studies, starting with an adequate experimental design and appropriate choice of tissue harvesting techniques,6  or may implement a predefined scoring paradigm that is part of an in vitro diagnostic kit. In both cases, the scoring scheme should be used only after the definitions of the different scoring categories and the boundaries that define each one have been reviewed by all individuals who will be engaged in lesion and biomarker scoring.9,10 

The most basic scoring paradigm for IHC staining assessment categorizes entire tissue sections as positive or negative, based on the presence (ie, expression) of a particular biomarker. In this scenario, any staining observed that is above background staining level will classify the sample as positive. Obviously, the amount of information collected for such samples will be very limited. To address this paucity, scoring paradigms can be devised that provide somewhat more information, such as the estimation of the number or percentage of cells that exhibit positive (or negative) staining. The denominator of this assessment can either be the entire cell population present in the tissue section or a subcompartment (ie, tumor cells within the tissue). In their simplest incarnation, such scoring paradigms are deployed without considering variations in staining intensity when assigning scores.

Adding a semiquantitative evaluation of staining intensity increases the complexity of the scoring paradigm. A tiered approach (eg, grades of 0, 1+, 2+, 3+; none, little, some, strong; or low, high) is often used; close attention to the specific defining criteria for each tier when assigning scores reduces the number of occasions in which borderline lesions are misclassified. This approach combining number of labeled cells with labeling intensity is used by the widely used DAKO (Carpinteria, California) HercepTest in vitro diagnostic kit for breast cancer.11  A breast biopsy is scored as 0 (ie, negative) when membrane staining of any intensity is completely absent or seen in less than 10% of tumor cells. Faint staining in more than 10% of tumor cells is classified as a score of 1+, weak staining in more than 10% of tumor cells is designated as 2+, and strong membrane staining in more than 10% of tumor cells is called 3+. The difficulty with assigning scores arises because the exquisite images of representative labeling for each score in the product insert do not clearly define the category boundaries. Uncertainty regarding the appearance of a “high 1+” versus a “low 2+” score and whether or not this difference is correlated to a specific clinical outcome could produce considerable variability in the manual scores awarded by different pathologists.

A variant of tiered scoring designed to capture even more information and account for tumor heterogeneity is the histochemical score (H-score), a weighted approach used widely in the clinical setting to evaluate breast cancer samples that also can be adapted in many other settings. The H-score combines an assessment of staining intensity with an estimation of the percentage of stained cells at each intensity. Staining intensity is evaluated as described above in a 0 to 3+ fashion, and for each category the percentage of stained target cells is recorded. The final H-score is the sum of each intensity rating multiplied with its corresponding percentage, generating a score on a continuous scale of 0 to 300.1214 

Like the H-score, the Allred score also combines the assessment of staining intensity (0–3+) and proportion of cells staining (categories of percentage predefined). However, the formula for the Allred score (also termed the quick score) equals the intensity score + the proportional score, and thus follows a noncontinuous scale including 0 and 2 through 8.15  This scoring paradigm is in clinical use in combination with the DAKO ER/PR pharmDx in vitro diagnostic kit.16 

It is beyond the scope of this manuscript to discuss the comparative clinical value of these scoring paradigms. Properly deployed, all manual scoring methods have potential value. The key to their effective use is the recognition that all these approaches generate semiquantitative scores as ordinal numbers or derivatives thereof, and as such should not be used as quantitative measurements of ground truth.6,7 

Bias (systematic error) can affect scientific investigations and may even destroy a study's validity depending upon its degree.17,18  It can occur at any phase of research, and to some degree it is nearly always present in research studies. Bias is independent of the sample size and statistical significance.19 

In the following sections, we review a number of visual and cognitive traps that can affect manual pathology scoring and that can be avoided when using digital image analysis (Table). Visual traps (also called optical illusions) are phenomena in which the perceived image differs from objective reality. In contrast, cognitive traps represent tendencies to think in a biased way that can lead to systematic errors or deviations from rational thinking. For the purpose of this review, particular focus is given to the impact of these traps in the manual scoring of IHC staining.

Overview of Visual and Cognitive Traps

Overview of Visual and Cognitive Traps
Overview of Visual and Cognitive Traps

Visual Traps as Sources of Bias

The Illusion of Size

The extensively studied Ebbinghaus illusion of size illustrates that our perception of size is greatly influenced by the context in which an object is presented. For example, in Figure 1, A, the inner circles are the same size but appear to be different because of the distances between objects and the divergent sizes of the adjacent circles.5,2022  This illusion applies to settings in which intermingled cells of divergent sizes must be discerned from each other.

Figure 1. 

Examples of visual traps as sources of diagnostic bias. A, Ebbinghaus illusion of size. Although both center circles are equivalent in diameter, the circle on the left appears larger because of the context (Reprinted with permission from Elsevier from 22Plodowski A, Jackson SR. Vision: getting to grips with the Ebbinghaus illusion. Curr Biol. 2001;11[8]:R304–R306.). B, Delboeuf illusion of size whereby equivalent circles appear different because of the context.83  C, Example of how the Delboeuf illusion could appear with membranous immunohistochemistry (IHC) staining adjacent to a nucleus (image digitally modified to place 2 cells side by side). D, Craik-O'Brien-Cornsweet illusion, in which fields to the right and left of the center “edge” look lighter and darker, respectively, when in fact the brightness of both areas is the same; this paradox can be seen when the interface is covered.84  E, Checker shadow illusion, in which the fields marked at A and B actually are the same color but are perceived as different based on preconceived knowledge.85  F, Watercolor effect (Reprinted with permission from Elsevier from 38Devinck F, Spillmann L. The watercolor effect: spacing constraints. Vision Res. 2009;49[24]:2911–2917.). G through J, Example of flawed color intensity perception in IHC. (Reprinted with permission of Springer from 5Conway C, Dobson L, O'Grady A, Kay E, Costello S, O'Shea D. Virtual microscopy as an enabler of automated/quantitative assessment of protein expression in TMAs. Histochem Cell Biol. 2008;130:447–463.) G and H, Both images have equivalent membrane staining intensity by image analysis, but visually these intensities are perceived to be different because of differences in cytoplasmic staining. I and J, Digital annotation (green markup) represents membrane staining as assessed quantitatively via tissue image analysis. K and L, An IHC example of subtle color hue change. Primary antibody target: CD8. K, Tissue stained with IHC monoplex protocol using alkaline phosphatase (AP) red as chromogen. L, Same tissue as in K, stained with duplex protocol for 2 biomarkers using 2 chromogens (AP red and diaminobenzidine [DAB]); the subtle change in the shade of the red chromogen (added brown hue from DAB) used for the first biomarker is a sign that the second biomarker's secondary antibody cross-reacted with the primary antibody or the linker used for the first marker. M, Hermann grid illusion, in which small gray squares are perceived in the intersections of white lines between 4 adjacent black squares86  (original magnification ×20 [K and L).

Figure 1. 

Examples of visual traps as sources of diagnostic bias. A, Ebbinghaus illusion of size. Although both center circles are equivalent in diameter, the circle on the left appears larger because of the context (Reprinted with permission from Elsevier from 22Plodowski A, Jackson SR. Vision: getting to grips with the Ebbinghaus illusion. Curr Biol. 2001;11[8]:R304–R306.). B, Delboeuf illusion of size whereby equivalent circles appear different because of the context.83  C, Example of how the Delboeuf illusion could appear with membranous immunohistochemistry (IHC) staining adjacent to a nucleus (image digitally modified to place 2 cells side by side). D, Craik-O'Brien-Cornsweet illusion, in which fields to the right and left of the center “edge” look lighter and darker, respectively, when in fact the brightness of both areas is the same; this paradox can be seen when the interface is covered.84  E, Checker shadow illusion, in which the fields marked at A and B actually are the same color but are perceived as different based on preconceived knowledge.85  F, Watercolor effect (Reprinted with permission from Elsevier from 38Devinck F, Spillmann L. The watercolor effect: spacing constraints. Vision Res. 2009;49[24]:2911–2917.). G through J, Example of flawed color intensity perception in IHC. (Reprinted with permission of Springer from 5Conway C, Dobson L, O'Grady A, Kay E, Costello S, O'Shea D. Virtual microscopy as an enabler of automated/quantitative assessment of protein expression in TMAs. Histochem Cell Biol. 2008;130:447–463.) G and H, Both images have equivalent membrane staining intensity by image analysis, but visually these intensities are perceived to be different because of differences in cytoplasmic staining. I and J, Digital annotation (green markup) represents membrane staining as assessed quantitatively via tissue image analysis. K and L, An IHC example of subtle color hue change. Primary antibody target: CD8. K, Tissue stained with IHC monoplex protocol using alkaline phosphatase (AP) red as chromogen. L, Same tissue as in K, stained with duplex protocol for 2 biomarkers using 2 chromogens (AP red and diaminobenzidine [DAB]); the subtle change in the shade of the red chromogen (added brown hue from DAB) used for the first biomarker is a sign that the second biomarker's secondary antibody cross-reacted with the primary antibody or the linker used for the first marker. M, Hermann grid illusion, in which small gray squares are perceived in the intersections of white lines between 4 adjacent black squares86  (original magnification ×20 [K and L).

Close modal

The Delboeuf illusion of size acknowledges that dissimilar objects can impact perception as well. In Figure 1, B, a circle around 1 of 2 identical black circles makes the ringed disk appear larger. This visual illusion demonstrates why a nucleus surrounded by a labeled membrane may be perceived as having a different size from a nucleus with an unlabeled membrane (Figure 1, C). It is important to note that the proximity of the concentric ring is important: a larger distance between the ring and inner disk can actually make the central object appear smaller.23 

The illusion of size is influenced by the observer's preconceived notions. For example, the perception of size is impacted by the pathologist's memory regarding prior samples with analogous features. The effect of prior knowledge with respect to estimates of size also extends to the words describing an object. This facet of the illusion is shown most readily by contemplating our conceptions of various animals (eg, ant versus elephant).20 

Craik-O'Brien-Cornsweet Illusion

The Craik-O'Brien-Cornsweet illusion is the perception of an illusory effect to objects or surfaces based on the optical characteristics of their edges. Adjacent objects are perceived as having different brightness based on variations in staining intensity, when in fact they are the same (Figure 1, D).2426  This illusion may influence a pathologist's assessment of the staining of adjacent cells, where staining must be assessed for both the cytoplasm (middle) and membrane (edge); dissimilarities of staining intensity at the membrane may lead to misinterpretation of cytoplasmic staining intensity.

Checker Shadow Illusion

Similar to the Craik-O'Brien-Cornsweet illusion, this illusion demonstrates flawed perception arising from differences in brightness (Figure 1, E). Our previous knowledge of the checkerboard's arrangement influences our perception of brightness for the squares, some of which we know to be lighter even though covered by shadow; we perceive these squares as having a different brightness compared with other light and dark squares not covered by a shadow. This effect is combined with the visual system's tendency to ignore the soft edges of shadows and see only sharp edges of the squares.27,28 

Distinguishing Gradients of Colors and Hues

Although the number of colors that the human eye can distinguish continues to be elusive, published estimates based on graded color strips or wheels range from 30 000 to 10 million.29  Tremendous person-to-person variation exists for differentiating colors and color hues.5,30  However, when people are shown different colors and color hues in isolation, recognition of differences quickly drops significantly.31  In addition, age, certain disease states, and lifestyle choices can have an effect on the ability to see and distinguish colors.3234  In comparison, a 12-bit camera can accurately and independently distinguish 68 billion colors. Because of this superiority of the digital system to assess color hues, specific attention needs to be given to setting a correct threshold within the digital analysis to ensure that background staining, however faint, does not result in false-positive identification.35 

The perceived color is known to depend not only on the spectral distribution of the light reflected by that particular color, but also on the light nearby. This phenomenon is called chromatic or color induction, and is composed of 2 components: color contrast and color assimilation. Color contrast presents itself as follows: yellow light will appear yellow (580-nm wavelength) when viewed independently of other colors, but will appear greenish when it is surrounded by red (670-nm wavelength).36  Color assimilation occurs when the color appearance shifts away from the appearance of the nearby light, whereas the latter occurs when the appearance shifts toward the color of the inducing light.37  Color assimilation has been demonstrated in several situations, including the watercolor effect (Figure 1, F).5  In the watercolor effect, 2 uneven colored contours, one darker than the other, surrounding an uncolored space results in the uncolored area appearing to take on the lighter color.38  For example, when a white space is surrounded by an orange inner and purple outer border, the enclosed area appears to be orange tinged.39 

In practical terms, these color phenomena may impact efforts to score the level of cytoplasmic staining in the context of adjacent membrane staining of different hues, or vice versa (Figure 1, G through J). Variability in color hue recognition may also impact the evaluation of multicolor bright-field staining (multiplex IHC) where one must differentiate colors and color hues that are in close proximity or colocalized (Figure 1, K and L).

Lateral Inhibition

Lateral inhibition is an optical effect resulting in altered color perception arising from neuronal interference in the retina and optical tracts. For example, the Hermann grid illusion (Figure 1, M) results in perception of false points at inflection points between interfaces of light and dark lines and squares forming a grid. This occurs when excited neurons reduce the activity of their neighbors to prevent the spread of action potentials in a lateral direction. When a person looks at a light edge next to a dark edge, lateral inhibition makes the light edge appear even lighter, because the inhibition from neighboring neurons activated by the darker edge have a smaller inhibitory effect than a light region surrounded by a light border. Similarly, the dark border next to the light border is perceived as even darker. This results in an increase in the perceived contrast and sharpness of images at the cost of viewing certain things as lighter or darker than they actually are.40,41 

Inattentional Blindness

When we are engaged in a demanding task, our focus may blind us from observing obvious and salient events.42  This phenomenon is called inattentional blindness. Some reports suggest that expert observers may partly but not completely mitigate this effect.43  However, even experts may experience this effect; 83% of radiologists asked to find lung cancer nodules in stacks of patient magnetic resonance images missed the faint outline of a gorilla present in 5 consecutive images even though the gorilla was 48 times larger than the cancer nodules and eye tracking revealed that most of the radiologists had looked right at the gorilla.44  Similar studies have shown that this phenomenon also occurs if a stimulus is removed instead of added.45  For the practicing pathologist, a similar result may occur when manually evaluating tissue sections or digitized images.

Cognitive Traps as Sources of Bias

Confirmation Bias

Confirmation bias, in its psychologic definition, refers to the unwitting selectivity in the acquisition and use of evidence.46,47  In the context of research, this bias acknowledges people's tendency to seek information that is considered supportive of a favored hypothesis or existing belief, and to interpret information in ways that are partial to that hypothesis or belief. Although this bias can be deliberate or unintentional, the effect on a study is most commonly subconscious.46  In fact, the tendency to deliberately resist confirmation bias is one key way that the scientific method differs from ordinary thinking.46  Applying the principles of psychology to the work of pathologists, it appears clear that the confirmation bias of inadequate search (ie, confirming a preconceived diagnosis and not investigating further) can have an effect on data quality. Bias can be introduced at any stage of an experiment: design, analysis, or interpretation.17,19 

One method used for decreasing the impact of confirmation bias is masking (also known as blinding or coding) of the observer to treatment group identities.6  The tradeoff is that masking methods may hamper a pathologist's ability to fully evaluate tissue samples/lesions. In the context of improving quality, it has been shown that blinded evaluations do not significantly improve study data.48  More recently, published best practices only recommend masking or blinding of pathologists (ie, not providing information on clinical or treatment parameters) in limited scenarios and instead generally advise that pathologists be unblinded to treatment groups and other study data when performing their analysis in a research setting.49  Within the context of clinical evaluation, blinding pathologists to patient data is neither feasible nor helpful, as total patient data evaluation (ie, histology slide set and patient file/history) is paramount for adequate assessment.

Diagnostic Drift

Diagnostic drift describes the situation in which scoring values vary slightly and in a consistent fashion throughout a study.7,8,50  This effect is not a cognitive trap per se, but the impact of subtle variations over time—especially for longer studies with multiple endpoints—nonetheless substantially affects data quality. Even proficient pathologists with extensive experience may fall prey to this trap when manually scoring lesions. The easiest way to avoid this situation is to periodically review the scoring criteria during the course of a study.

Anchoring

Anchoring, or tunnel vision, describes the phenomenon in which exclusive attention is given to only one aspect of a problem while failing to fully assess and understand the entire situation.51,52  Oftentimes, the first piece of information given or communicated on a topic (the anchor) causes fixation on the anchor; its validity or appropriateness is overestimated, whereas new and supplemental information is ignored.47  In diagnostic medicine, anchoring results in premature closure and acceptance of a diagnosis prior to its full validation.53 

For example, pathologists diagnosing prostate cancer often report similar values for the Gleason score (based on tissue architecture) and the now-obsolete World Health Organization grading system (based on nuclear morphology). This concordance of independent scoring paradigms based upon very distinct morphologic features has been shown to arise from cognitive bias, which represents anchoring.54  In this example, nuclear images were scored by pathologists independent of their tissue context and compared with tIA data for the same nuclei. Grading the nuclei in isolation from the tissue context significantly improved manual scoring relative to tIA, although prognostic power was lost when tissue features were not assessed.

Search Satisfaction

Search satisfaction describes the phenomenon of a false-negative error, in which a specific target or event is more likely to be missed when it occurs in concordance with one or more additional anomalies. In other words, a visual search is abandoned once the searcher finds a target and becomes satisfied with the meaning of the image after reaching a certain information threshold.5557  This effect has been widely studied in radiology55,5759  as well as various scenarios outside the medical field, including luggage screening at airports.60  In general, the more rare an event, the more likely it is that it will be missed.60  To our knowledge, this phenomenon has not been studied in the diagnostic pathology setting, but it is reasonable to expect that pathologists would exhibit a cognitive bias resembling that of radiologists. A possible search satisfaction dilemma in anatomic pathology might represent the inability to detect immune cell subsets occurring at low frequencies in a tissue, thereby leading to missed disease- or treatment-relevant phenotypes. Although using tIA may assist in the detection of rare events, the pathologist must confirm that only the appropriate tissue features are captured and measured so to avoid the opposite problem (false-positive data points) and overinterpretation.35 

Context Bias

Observers are more likely to consider a sample as abnormal when it is reviewed in a specifically assembled sample set with high disease prevalence than when the sample is interpreted either as part of a group with lower disease prevalence or in isolation.17  This effect also has been studied in radiology.61  A recent study in which pathologists interpreted 60 biopsy specimens blindly twice with a 6-month washout period between sessions demonstrated that practitioners' diagnoses were influenced by the severity of previously interpreted cases. The extent of context bias was similar when pathologists' responses were evaluated with respect to 1 or 5 preceding cases.62 

Avoidance of Extreme Ranges

The tendency to avoid extremes in scoring and grading systems is well recognized.7  Studies have shown that a majority of tumors are classified subjectively as being moderately differentiated, thus avoiding the outlying options of well or poorly differentiated.63  Similarly, in some scoring systems, normal biopsies tend to be classed as abnormal because practitioners seek to avoid the lowest end of the range—normal.64  Within IHC evaluation, reports have confirmed that the middle scoring category is overused,65  leading some researchers to suggest that the number of categories be minimized (ie, low and high grade only, instead of multiple tiers) to improve diagnostic reproducibility.7,66 

Number Preference

Number preference (also known as digit preference, terminal/end digit bias, and heaping) recognizes the fact that not all numbers are created equal.67,68  Humans have been shown to prefer rounding data points to numbers ending in 0 or 5.6971  Within the medical field, this effect has been extensively studied in the recording of blood pressure measurements and body mass index as well as in surveys that require self-reporting of numbers. In these studies, data sets also show significant heaping of data points ending in 0 and 5.7276  For manual pathology scoring, when percentage estimates are assigned, a higher chance of recording scores that end in 0 and 5 has been described, which demonstrates that scoring is biased away from the full continuum of values between 0 and 100. This bias in scoring impacts the types of statistical analyses appropriate for manual scores relative to quantitative scores that can be generated by tIA.

Aside from the specific effect of number preference, a pathologist's ability to accurately estimate numbers needs to be critically evaluated. In a study conducted by the College of American Pathologists, 197 participating labs were provided with 10 images of hematoxylin-eosin–stained colonic adenocarcinomas and asked to report the percentage of neoplastic cells present; this interlaboratory comparison was done in the context of molecular testing of genetic alterations, where overestimating the percentage of neoplastic cells specifically results in false-negative test results and significant consequences for patients. The survey demonstrated low interlaboratory precision among pathologists' scores, although mean estimates were somewhat accurate. For example, in some cases estimates only varied by 1% among practitioners, but in other cases the difference from the criterion standard exceeded 24%. More importantly, 50% of the evaluated cases had estimates that were different by more than 10%.77  A previous study, specifically evaluating non–small cell lung cancer biopsies, has shown that visual assessment of estimated neoplastic cell concentration overestimates the tumor burden by 10% to 20%.78 

Gambler's Fallacy

The gambler's fallacy acknowledges the struggle of humans to interpret independent events as being independent. It arises in 2 forms. First, it is fallacious to assume that a certain random event is less likely after a series of the same event occurring; conversely, it is fallacious to assume that a certain event is more likely after a series of different events occurring.79,80  The most common example of this bias is predicting the outcome of a coin toss. After a string of heads, a subject will expect that tails are “due.” However, the result of previous coin tosses is totally independent of the outcome of the next toss.81  Humans consistently make suboptimal decisions regarding independent and identically distributed events.82  Interestingly, this fallacy has been reported more frequently in highly educated individuals who are faced with probability judgment situations.80  Within the medical field, research has shown that many residents are prone to the gambler's fallacy, indicating that their biomedical education did not provide sufficient, or even any, training to avoid this bias.81 

In manual pathology scoring or object counting, the gambler's fallacy predicts that pathologists will struggle with seeing samples within a study as independent events to be evaluated. A frequent presentation of this scenario is that a pathologist may be reluctant to record several identical scores or counts in a row, and may subconsciously attempt to correct the observed rate by scoring differently in subsequent evaluations. The logical solution to this dilemma is that practitioners should regularly remind themselves to assign scores strictly according to the diagnostic criteria observed in the sample, without reference to any other unrelated specimens.

In conventional anatomic pathology practice, manually assigned lesion and biomarker scores are seen as the gold standard. Such scores generally are assigned by trained and experienced pathologists, and as such are viewed as the best estimate of ground truth (or “reality”) despite their subjective and qualitative to semiquantitative nature. Nonetheless, the variability among pathologists coupled with the inherent heterogeneity among disease phenotypes indicates that a more objective and truly quantitative strategy is the true gold standard toward which the biomedical community should strive. Achievement of this ideal would diminish or even eradicate the gold standard paradox.

A movement toward replacing manual scoring with automated tIA would solve issues associated with anatomic pathology assessment, albeit these problems may be subconscious and not generally perceived as problems. For example, many aspects of manual pathology scoring can be impacted by visual traps, cognitive traps, and other biases. In contrast, an automated quantitative tIA method can be developed in a manner that minimizes or eliminates many of these traps and biases. The potential superiority of tIA is possible because a consistent and objective rule set (ie, cell classification, staining quantification) can be applied on a cell-by-cell basis in whole tissue sections across an entire study or patient population. Scores assigned by computer algorithms are not prone to biases such as number preference or avoidance of extreme ranges. Color, hue, intensity, and contrast are captured by digital cameras and evaluated by computational algorithms that quantitatively perform at levels that far exceed human visual perception. Importantly, analysis via tIA algorithms ensures that each section and each scoring event is viewed as an independent event, based on predefined metrics, and unaltered by the sections evaluated before or by adjacent cells. In addition, tIA algorithms overcome challenges created by interobserver variability, avoid the possibility of diagnostic drift, and provide data sets of quantitative pathology scores on a continuous scale that are readily amenable to rigorous statistical analysis.

However, the transition toward automated analysis as a future gold standard in many situations must acknowledge the fact that computer-based algorithms have not been able to model the extreme complexity of the critically thinking human brain. Specifically, within the space of histology interpretation by a pathologist, the human mind contributes the sum not only of the formal training, which can be emulated by artificial intelligence programs, but also of the full constellation of learning experiences that transpired prior to the evaluation of that one sample. Cumulative experience over time coupled with the ability to interpret visual cues within a larger tissue context and also the context of patient clinical history and other demographic data are unique aspects of a human practitioner that computer algorithms have been unable to mimic or replace. Despite the example described above, where radiologists unconsciously ignored a gorilla image superimposed on lung radiographs as irrelevant, the complexity of the human mind remains necessary to process the often-complex anatomy present in a given tissue section—to understand both which features should be included and which excluded in the final analysis; this ability is especially true for very subtle gradations in phenotype. In this manner, the combination of a human observer, the pathologist, and the tIA software is greatly advantageous for modern diagnostic pathology because it complements the intrinsic weaknesses of the human observer with the strengths of computer analysis and also overcomes the innate weaknesses of the computer-based algorithm with the strengths of the human observer (Figure 2).35 

Figure 2. 

Schematic diagram outlining the synergy between manual pathology scoring assigned by a pathologist and tissue image analysis (tIA) data obtained via algorithms. The combination of human oversight when building tIA solutions is better able to yield robust, reproducible, and quantitative data.

Figure 2. 

Schematic diagram outlining the synergy between manual pathology scoring assigned by a pathologist and tissue image analysis (tIA) data obtained via algorithms. The combination of human oversight when building tIA solutions is better able to yield robust, reproducible, and quantitative data.

Close modal

The choice of selecting a reference standard for validating an IHC-tIA assay must consider the strengths and weaknesses of both the computer and the human observer. When manual pathology scoring is deemed appropriate for validating a novel tIA method, it is particularly important to understand potential visual and cognitive traps as well as biases that may be encountered when scoring a specific tissue type and biomarker so that performance assessments of tIA approaches are not skewed by error that is integral to a manually defined reference standard. In instances where manual scoring is severely hampered by the above-mentioned traps and biases, an alternative reference standard (eg, biomarker levels in a homogenized tissue sample) may be considered in assessing IHC-tIA assay performance. Regardless of the choice of gold standard, the rationale for its selection must be communicated completely and justified rationally when reporting pathology scores.

The authors would like to thank Matthew Steaffens for his support in creating figures.

1
Sirota
RL.
Defining error in anatomic pathology
.
Arch Pathol Lab Med
.
2006
;
130
(
5
):
604
606
.
2
Fleming
MG.
Pigmented lesion pathology: what you should expect from your pathologist, and what your pathologist should expect from you
.
Clin Plast Surg
.
2010
;
37
(
1
):
1
20
.
3
Daunoravicius
D,
Besusparis
J,
Zurauskas
E,
et al.
Quantification of myocardial fibrosis by digital image analysis and interactive stereology
.
Diagn Pathol
.
2014
;
9
:
114
.
4
Raab
SS.
Improving patient safety by examining pathology errors
.
Clin Lab Med
.
2004
;
24
(
4
):
849
863
.
5
Conway
C,
Dobson
L,
O'Grady
A,
Kay
E,
Costello
S,
O'Shea
D.
Virtual microscopy as an enabler of automated/quantitative assessment of protein expression in TMAs
.
Histochem Cell Biol
.
2008
;
130
(
3
):
447
463
.
6
Gibson-Corley
KN,
Olivier
AK,
Meyerholz
DK.
Principles for valid histopathologic scoring in research
.
Vet Pathol
.
2013
;
50
(
6
):
1007
1015
.
7
Cross
SS.
Grading and scoring in histopathology
.
Histopathology
.
1998
;
33
(
2
):
99
106
.
8
Crissman
JW,
Goodman
DG,
Hildebrandt
PK,
et al.
Best practices guideline: toxicologic histopathology
.
Toxicol Pathol
.
2004
;
32
(
1
):
126
131
.
9
Thoolen
B,
Maronpot
RR,
Harada
T,
et al.
Proliferative and nonproliferative lesions of the rat and mouse hepatobiliary system
.
Toxicol Pathol
.
2010
;
38(7)(suppl):5S–81S.
10
Shackelford
C,
Long
G,
Wolf
J,
Okerberg
C,
Herbert R.
Qualitative
and quantitative analysis of nonneoplastic lesions in toxicology studies
.
Toxicol Pathol
.
2002
;
30
(
1
):
93
96
.
11
DAKO
.
HercepTest interpretation manual—breast cancer
.
2016
.
12
Putti
TC,
El-Rehim
DM,
Rakha
EA,
et al.
Estrogen receptor-negative breast carcinomas: a review of morphology and immunophenotypical analysis
.
Mod Pathol
.
2005
;
18
(
1
):
26
35
.
13
Harbeck
N,
Dettmar
P,
Thomssen
C,
et al.
Prognostic significance of the S-phase and MIB1 (Ki-67) proliferation parameters in node-negative breast carcinoma [in German]
.
Gynakol Geburtshilfliche Rundsch
.
1995
;
35
(
suppl 1
):
142
147
.
14
McCarty
KS,
Jr.,
Szabo
E,
Flowers
JL,
et al.
Use of a monoclonal anti-estrogen receptor antibody in the immunohistochemical evaluation of human tumors
.
Cancer Res
.
1986
;
46(8)(suppl):4244s–4248s.
15
Allred
DC,
Harvey
JM,
Berardo
M,
Clark
GM.
Prognostic and predictive factors in breast cancer by immunohistochemical analysis
.
Mod Pathol
.
1998
;
11
(
2
):
155
168
.
16
DAKO
.
ER/PR pharmDx interpretation manual
.
2016
.
17
Sica
GT.
Bias in research studies
.
Radiology
.
2006
;
238
(
3
):
780
789
.
18
Whiting
P,
Rutjes
AW,
Reitsma
JB,
Glas
AS,
Bossuyt
PM,
Kleijnen
J.
Sources of variation and bias in studies of diagnostic accuracy: a systematic review
.
Ann Intern Med
.
2004
;
140
(
3
):
189
202
.
19
Pannucci
CJ,
Wilkins
EG.
Identifying and avoiding bias in research
.
Plast Reconstr Surg
.
2010
;
126
(
2
):
619
625
.
20
Rey
AE,
Vallet
GT,
Riou
B,
Lesourd
M,
Versace
R.
Memory plays tricks on me: perceptual bias induced by memory reactivated size in Ebbinghaus illusion
.
Acta Psychol (Amst)
.
2015
;
161
:
104
109
.
21
Coren
S,
Enns
JT.
Size contrast as a function of conceptual similarity between test and inducers
.
Percept Psychophys
.
1993
;
54
(
5
):
579
588
.
22
Plodowski
A,
Jackson
SR.
Vision: getting to grips with the Ebbinghaus illusion
.
Curr Biol
.
2001
;
11
(
8
):
R304
R306
.
23
McClain
AD,
van den Bos
W,
Matheson
D,
Desai
M,
McClure
SM,
Robinson
TN.
Visual illusions and plate design: the effects of plate rim widths and rim coloring on perceived food portion size
.
Int J Obes (Lond)
.
2014
;
38
(
5
):
657
662
.
24
Kurki
I,
Peromaa
T,
Hyvarinen
A,
Saarinen
J.
Visual features underlying perceived brightness as revealed by classification images
.
PLoS One
.
2009
;
4
(
10
):
e7432
.
25
Purves
D,
Shimpi
A,
Lotto
RB.
An empirical explanation of the Cornsweet effect
.
J Neurosci
.
1999
;
19
(
19
):
8542
8551
.
26
Masuda
A,
Watanabe
J,
Terao
M,
Yagi
A,
Maruya
K.
A temporal window for estimating surface brightness in the Craik-O'Brien-Cornsweet effect
.
Front Hum Neurosci
.
2014
;
8
:
855
.
27
Albert
MK.
Occlusion, transparency, and lightness
.
Vision Res
.
2007
;
47
(
24
):
3061
3069
.
28
Adelson
E.
Lightness perception and lightness illusions
.
In
:
Gazzaniga
M,
ed
.
The New Cognitive Neurosciences. 2nd ed
.
Cambridge, MA
:
MIT Press;
2001
:
339
351
.
29
Masaoka
K,
Berns
RS,
Fairchild
MD,
Moghareh Abed F. Number of discernible object colors is a conundrum
.
J Opt Soc Am A Opt Image Sci Vis
.
2013
;
30
(
2
):
264
277
.
30
Perales
E,
Martinez-Verdu
FM,
Linhares
JM,
Nascimento
SM.
Number of discernible colors for color-deficient observers estimated from the MacAdam limits
.
J Opt Soc Am A Opt Image Sci Vis
.
2010
;
27
(
10
):
2106
2114
.
31
Bae
GY,
Olkkonen
M,
Allred
SR,
Flombaum
JI.
Why some colors appear more memorable than others: s model combining categories and particulars in color working memory
.
J Exp Psychol Gen
.
2015
;
144
(
4
):
744
763
.
32
Hardy
JL,
Delahunt
PB,
Okajima
K,
Werner
JS.
Senescence of spatial chromatic contrast sensitivity, I: detection under conditions controlling for optical factors
.
J Opt Soc Am A Opt Image Sci Vis
.
2005
;
22
(
1
):
49
59
.
33
Brasil
A,
Castro
AJ,
Martins
IC,
et al.
Colour vision impairment in young alcohol consumers
.
PLoS One
.
2015
;
10
(
10
):
e0140169
.
34
Arda
H,
Mirza
GE,
Polat
OA,
Karakucuk
S,
Oner
A,
Gumus
K.
Effects of chronic smoking on color vision in young subjects
.
Int J Ophthalmol
.
2015
;
8
(
1
):
77
80
.
35
Aeffner
F,
Wilson
K,
Bolon
B,
et al.
Commentary: roles for pathologists in a high-throughput image analysis team
.
Toxicol Pathol
.
2016
;
44
(
6
):
825
834
.
36
Shevell
SK,
Wei
J.
Chromatic induction: border contrast or adaptation to surrounding light?
Vision Res
.
1998
;
38
(
11
):
1561
1566
.
37
Cao
D,
Shevell
SK.
Chromatic assimilation: spread light or neural mechanism?
Vision Res
.
2005
;
45
(
8
):
1031
1045
.
38
Devinck
F,
Spillmann
L.
The watercolor effect: spacing constraints
.
Vision Res
.
2009
;
49
(
24
):
2911
2917
.
39
Devinck
F,
Spillmann
L,
Werner
JS.
Spatial profile of contours inducing long-range color assimilation
.
Vis Neurosci
.
2006
;
23
(
3–4
):
573
577
.
40
Spillmann
L.
The Hermann grid illusion: a tool for studying human perspective field organization
.
Perception
.
1994
;
23
(
6
):
691
708
.
41
Kingdom
FA.
Mach bands explained by response normalization
.
Front Hum Neurosci
.
2014
;
8
:
843
.
42
Raffone
A,
Srinivasan
N,
van Leeuwen
C.
The interplay of attention and consciousness in visual search, attentional blink and working memory consolidation
.
Philos Trans R Soc Lond B Biol Sci
.
2014
;
369
(
1641
):
20130215
.
43
Memmert
D.
The effects of eye movements, age, and expertise on inattentional blindness
.
Conscious Cogn
.
2006
;
15
(
3
):
620
627
.
44
Drew
T,
Vo
ML,
Wolfe
JM.
The invisible gorilla strikes again: sustained inattentional blindness in expert observers
.
Psychol Sci
.
2013
;
24
(
9
):
1848
1853
.
45
Potchen
EJ.
Measuring observer performance in chest radiology: some experiences
.
J Am Coll Radiol
.
2006
;
3
(
6
):
423
432
.
46
Nickerson
RS.
Confirmation bias: a ubiquitous phenomenon in many guises
.
Rev Gen Psychol
.
1998
;
2
(
2
):
175
220
.
47
Ditrich
H.
Cognitive fallacies and criminal investigations
.
Sci Justice
.
2015
;
55
(
2
):
155
159
.
48
Rouse
R,
Min
M,
Francke
S,
et al.
Impact of pathologists and evaluation methods on performance assessment of the kidney injury biomarker, Kim-1
.
Toxicol Pathol
.
2015
;
43
(
5
):
662
674
.
49
Burkhardt
JE,
Pandher
K,
Solter
PF,
et al.
Recommendations for the evaluation of pathology data in nonclinical safety biomarker qualification studies
.
Toxicol Pathol
.
2011
;
39
(
7
):
1129
1137
.
50
McInnes
EF,
Scudamore
CL.
Review of approaches to the recording of background lesions in toxicologic pathology studies in rats
.
Toxicol Lett
.
2014
;
229
(
1
):
134
143
.
51
Stiegler
MP,
Neelankavil
JP,
Canales
C,
Dhillon
A.
Cognitive errors detected in anaesthesiology: a literature review and pilot study
.
Br J Anaesth
.
2012
;
108
(
2
):
229
235
.
52
Zhao
Q.
Effects of accuracy motivation and anchoring on metacomprehension judgment and accuracy
.
J Gen Psychol
.
2012
;
139
(
3
):
155
174
.
53
Ogdie
AR,
Reilly
JB,
Pang
WG,
et al.
Seen through their eyes: residents' reflections on the cognitive and contextual components of diagnostic errors in medicine
.
Acad Med
.
2012
;
87
(
10
):
1361
1367
.
54
Fandel
TM,
Pfnur
M,
Schafer
SC,
et al.
Do we truly see what we think we see?: the role of cognitive bias in pathological interpretation
.
J Pathol
.
2008
;
216
(
2
):
193
200
.
55
Fleck
MS,
Samei
E,
Mitroff
SR.
Generalized “satisfaction of search”: adverse influences on dual-target search accuracy
.
J Exp Psychol Appl
.
2010
;
16
(
1
):
60
71
.
56
Craig
AB,
Phillips
ME,
Zaldivar
A,
Bhattacharyya
R,
Krichmar
JL.
Investigation of biases and compensatory strategies using a probabilistic variant of the Wisconsin Card Sorting Test
.
Front Psychol
.
2016
;
7
:
17
.
57
Tuddenham
WJ.
Visual search, image organization, and reader error in roentgen diagnosis: studies of the psycho-physiology of roentgen image perception
.
Radiology
.
1962
;
78
:
694
704
.
58
Berbaum
KS,
Schartz
KM,
Caldwell
RT,
et al.
Satisfaction of search from detection of pulmonary nodules in computed tomography of the chest
.
Acad Radiol
.
2013
;
20
(
2
):
194
201
.
59
Berbaum
KS,
Krupinski
EA,
Schartz
KM,
et al.
Satisfaction of search in chest radiography 2015
.
Acad Radiol
.
2015
;
22
(
11
):
1457
1465
.
60
Wolfe
JM,
Horowitz
TS,
Kenner
NM.
Cognitive psychology: rare items often missed in visual searches
.
Nature
.
2005
;
435
(
7041
):
439
440
.
61
Egglin
TK,
Feinstein
AR.
Context bias: a problem in diagnostic radiology
.
JAMA
.
1996
;
276
(
21
):
1752
1755
.
62
Frederick
PD,
Nelson
HD,
Carney
PA,
et al.
The influence of disease severity of preceding clinical cases on pathologists' medical decision making
.
Med Decis Making
.
2017
;
37
(
1
):
91
100
.
63
Thomas
GD,
Dixon
MF,
Smeeton
NC,
Williams
NS.
Observer variation in the histological grading of rectal carcinoma
.
J Clin Pathol
.
1983
;
36
(
4
):
385
391
.
64
Kay
EW,
O'Dowd
J,
Thomas
R,
et al.
Mild abnormalities in liver histology associated with chronic hepatitis: distinction from normal liver histology
.
J Clin Pathol
.
1997
;
50
(
11
):
929
931
.
65
Kay
EW,
Walsh
CJ,
Cassidy
M,
Curran
B,
Leader
M.
C-erbB-2 immunostaining: problems with interpretation
.
J Clin Pathol
.
1994
;
47
(
9
):
816
822
.
66
Morris
JA.
Information and observer disagreement in histopathology
.
Histopathology
.
1994
;
25
(
2
):
123
128
.
67
Towse
JN,
Loetscher
T,
Brugger
P.
Not all numbers are equal: preferences and biases among children and adults when generating random sequences
.
Front Psychol
.
2014
;
5
:
19
.
68
Cai
YC,
Li
SX.
Small number preference in guiding attention
.
Exp Brain Res
.
2015
;
233
(
2
):
539
550
.
69
Huttenlocher
J,
Hedges
LV,
Bradburn
NM.
Reports of elapsed time: bounding and rounding processes in estimation
.
J Exp Psychol Learn Mem Cogn
.
1990
;
16
(
2
):
196
213
.
70
Pickering
RM.
Digit preference in estimated gestational age
.
Stat Med
.
1992
;
11
(
9
):
1225
1238
.
71
Wen
SW,
Kramer
MS,
Hoey
J,
Hanley
JA,
Usher
RH.
Terminal digit preference, random error, and bias in routine clinical measurement of blood pressure
.
J Clin Epidemiol
.
1993
;
46
(
10
):
1187
1193
.
72
Dibao-Dina
C,
Lebeau
JP,
Huas
D,
Boutitie
F,
Pouchain
D;
French National College of Teachers in General Practice. ESCAPE ancillary blood pressure measurement study 2: changes in end-digit preference after 2 years of a cluster randomized trial
.
Blood Press Monit
.
2015
;
20
(
6
):
346
350
.
73
Townsend
N,
Rutter
H,
Foster
C.
Improvements in the data quality of a national BMI measuring programme
.
Int J Obes (Lond)
.
2015
;
39
(
9
):
1429
1431
.
74
Thavarajah
S,
White
WB,
Mansoor
GA.
Terminal digit bias in a specialty hypertension faculty practice
.
J Hum Hypertens
.
2003
;
17
(
12
):
819
822
.
75
Wang
Y,
Wang
Y,
Qain
Y,
et al.
Longitudinal change in end-digit preference in blood pressure recordings of patients with hypertension in primary care clinics: Minhang study
.
Blood Press Monit
.
2015
;
20
(
2
):
74
78
.
76
Crawford
FW,
Weiss
RE,
Suchard
MA.
Sex, lies and self-reported counts: Bayesian mixture models for heaping in longitudinal count data via birth-death processes
.
Ann Appl Stat
.
2015
;
9
(
2
):
572
596
.
77
Viray
H,
Li
K,
Long
TA,
et al.
A prospective, multi-institutional diagnostic trial to determine pathologist accuracy in estimation of percentage of malignant cells
.
Arch Pathol Lab Med
.
2013
;
137
(
11
):
1545
1549
.
78
Warth
A,
Penzel
R,
Brandt
R,
et al.
Optimized algorithm for Sanger sequencing-based EGFR mutation analyses in NSCLC biopsies
.
Virchows Arch
.
2012
;
460
(
4
):
407
414
.
79
Tversky
A,
Kahneman
D.
Judgment under uncertainty: heuristics and biases
.
Science
.
1974
;
185
(
4157
):
1124
1131
.
80
Xue
G,
He
Q,
Lei
X,
et al.
The gambler's fallacy is associated with weak affective decision making but strong cognitive ability
.
PLoS One
.
2012
;
7
(
10
):
e47019
.
81
Msaouel
P,
Kappos
T,
Tasoulis
A,
et al.
Assessment of cognitive biases and biostatistics knowledge of medical residents: a multicenter, cross-sectional questionnaire study
.
Med Educ Online
.
2014
;
19
:
23646
.
82
Xue
G,
Juan
CH,
Chang
CF,
Lu
ZL,
Dong
Q.
Lateral prefrontal cortex contributes to maladaptive decisions
.
Proc Natl Acad Sci U S A
.
2012
;
109
(
12
):
4401
4406
.
83
Aaen-Stockdale
C.
Delboeuf illusion
.
2016
.
84
Fibonacci
.
Cornsweet illusion
.
2016
.
85
Adelson
E.
Checker shadow illusion
.
2016
.
86
Aaen-Stockdale
C.
Hermann grid illusion
.
http://en.wikipedia.org/wiki/File:HermannGrid.gif. Published 2007. Accessed May 2,
2016
.

Author notes

From Flagship Biosciences Inc, Westminster, Colorado. Dr Bolon is now with GEMpath Inc, Longmont, Colorado. Ms Koegler is now with Portland Gastroenterology Center, Portland, Maine.

This article is loosely based on a presentation given at the 2016 Annual Meeting of the American Society of Investigative Pathology at Experimental Biology; April 4, 2016; San Diego, California.

Competing Interests

All authors were full-time or part-time employees at Flagship Biosciences Inc at the time of drafting of this manuscript. The authors have no other relevant financial interest in the products or companies described in this article.