Novel therapeutics often target complex cellular mechanisms. Increasingly, quantitative methods like digital tissue image analysis (tIA) are required to evaluate correspondingly complex biomarkers to elucidate subtle phenotypes that can inform treatment decisions with these targeted therapies. These tIA systems need a gold standard, or reference method, to establish analytical validity. Conventional, subjective histopathologic scores assigned by an experienced pathologist are the gold standard in anatomic pathology and are an attractive reference method. The pathologist's score can establish the ground truth to assess a tIA solution's analytical performance. The paradox of this validation strategy, however, is that tIA is often used to assist pathologists to score complex biomarkers because it is more objective and reproducible than manual evaluation alone by overcoming known biases in a human's visual evaluation of tissue, and because it can generate endpoints that cannot be generated by a human observer.
To discuss common visual and cognitive traps known in traditional pathology-based scoring paradigms that may impact characterization of tIA-assisted scoring accuracy, sensitivity, and specificity.
This manuscript reviews the current literature from the past decades available for traditional subjective pathology scoring paradigms and known cognitive and visual traps relevant to these scoring paradigms.
Awareness of the gold standard paradox is necessary when using traditional pathologist scores to analytically validate a tIA tool because image analysis is used specifically to overcome known sources of bias in visual assessment of tissue sections.
Anatomic pathology, as a discipline, seeks to render a precise and complete diagnosis of lesions produced by a particular disease, to the correct patient, in a timely fashion, and in a way that is understandable and useful to the physician treating the patient.1 In a research setting, this definition is easily adaptable as an accurate and comprehensive assessment (diagnosis) provided to the correct specimen (patient) in a timely fashion and in a way that generates biologically meaningful data to benefit scientists (physicians) investigating biomedical research questions. Whether applied to a patient or a specimen, such assessments usually include not just the diagnostic opinion but also a lesion score defining the severity (or grade) of the finding, as collective experience over time has linked such scores to clinical outcome.
In recent years, anatomic pathology diagnoses have evolved from a reliance on traditional manually acquired lesion scores to an increasing emphasis on automated tissue image analysis (tIA). The rationale for this shift is that manual scores are qualitative or semiquantitative, and subjective even when assigned by a seasoned observer, whereas tIA data are quantitative, reproducible, and objective. Analytical validation of such novel tIA methods requires their comparison with an accepted reference standard, otherwise known as the gold standard (capturing ground truth). For tIA applications, the reference standard can be a similar modality that has been validated through extensive use (eg, a manual histopathology score for the same tissue section) or an orthogonal (ie, independent) methodology (eg, quantitative polymerase chain reaction or Western blot analysis of tissue from the same specimen). However, appropriate reference standards to validate novel tIA methods seldom exist in anatomic pathology practice. In nearly all cases, manual pathology scoring by a trained and competent (ie, board-certified) pathologist serves as the reference standard for anatomic pathology endpoints.2–4 Manual scores assigned by different practitioners may exhibit the same trend but are seldom identical; moreover, human fallibility is a well-known source of bias in manual scoring.1,5 These imperfections have been addressed by devising innovative digital pathology techniques, especially tIA, to overcome known pitfalls with manual scoring of biomarker expression by using the strengths of anatomic pathologists (eg, biomedical knowledge, cognitive abilities) to provide more reproducible data for making clinical decisions and for testing hypotheses in the translational environment. The rise of such innovative tIA practices, however, has engendered a critical paradox: what is the best gold standard for biomarker analysis in modern anatomic pathology practice? In other words, what defines or captures the ground truth for lesion scoring?
The source of this paradox is that the conventional practice of using manual scores generated by a trained and experienced pathologist provides the most straightforward reference standard for novel tIA-based methods. The same stained tissues may be scored manually and then by tIA, after which the results may be compared directly to demonstrate equivalency between the manual and tIA procedures. Alternatively, an orthogonal modality may be used as the benchmark, but the need for different material degrades the contextual information (eg, cellular localization) that is crucial for assessing biomarker expression in tissues. In either scenario, therefore, the nature of ground truth represents an informed choice.
The following sections review several common manual scoring paradigms that may be used to generate reference standard data to evaluate the analytical performance of novel immunohistochemistry (IHC)–based tIA assays, explores important cognitive and visual traps that can profoundly affect reference standard data, and addresses methods for maximizing manual score consistency and data reproducibility. These sources of bias are placed in a tIA context to highlight how properly established computer algorithms that leverage the strength of the anatomic pathologist and the algorithm can be used to avoid these traps and produce data of higher quality. Understanding these concepts and their correct application is critical given the increasing value of tIA as a tool in clinical and translational biomarker assessments to guide novel therapeutic development and treatment applications.
MANUAL PATHOLOGY SCORING
Histopathologic scoring entails the pathologist's ability to obtain qualitative or semiquantitative data from tissue.6 In the clinical setting, manual scoring supports treatment decisions, predicts patient prognosis, and determines clinical trial design, enrollment, and outcome measures.7 A manual scoring system should be definable, be reproducible, and produce meaningful results.6,8 Scoring strategies may be devised for specific studies, starting with an adequate experimental design and appropriate choice of tissue harvesting techniques,6 or may implement a predefined scoring paradigm that is part of an in vitro diagnostic kit. In both cases, the scoring scheme should be used only after the definitions of the different scoring categories and the boundaries that define each one have been reviewed by all individuals who will be engaged in lesion and biomarker scoring.9,10
The most basic scoring paradigm for IHC staining assessment categorizes entire tissue sections as positive or negative, based on the presence (ie, expression) of a particular biomarker. In this scenario, any staining observed that is above background staining level will classify the sample as positive. Obviously, the amount of information collected for such samples will be very limited. To address this paucity, scoring paradigms can be devised that provide somewhat more information, such as the estimation of the number or percentage of cells that exhibit positive (or negative) staining. The denominator of this assessment can either be the entire cell population present in the tissue section or a subcompartment (ie, tumor cells within the tissue). In their simplest incarnation, such scoring paradigms are deployed without considering variations in staining intensity when assigning scores.
Adding a semiquantitative evaluation of staining intensity increases the complexity of the scoring paradigm. A tiered approach (eg, grades of 0, 1+, 2+, 3+; none, little, some, strong; or low, high) is often used; close attention to the specific defining criteria for each tier when assigning scores reduces the number of occasions in which borderline lesions are misclassified. This approach combining number of labeled cells with labeling intensity is used by the widely used DAKO (Carpinteria, California) HercepTest in vitro diagnostic kit for breast cancer.11 A breast biopsy is scored as 0 (ie, negative) when membrane staining of any intensity is completely absent or seen in less than 10% of tumor cells. Faint staining in more than 10% of tumor cells is classified as a score of 1+, weak staining in more than 10% of tumor cells is designated as 2+, and strong membrane staining in more than 10% of tumor cells is called 3+. The difficulty with assigning scores arises because the exquisite images of representative labeling for each score in the product insert do not clearly define the category boundaries. Uncertainty regarding the appearance of a “high 1+” versus a “low 2+” score and whether or not this difference is correlated to a specific clinical outcome could produce considerable variability in the manual scores awarded by different pathologists.
A variant of tiered scoring designed to capture even more information and account for tumor heterogeneity is the histochemical score (H-score), a weighted approach used widely in the clinical setting to evaluate breast cancer samples that also can be adapted in many other settings. The H-score combines an assessment of staining intensity with an estimation of the percentage of stained cells at each intensity. Staining intensity is evaluated as described above in a 0 to 3+ fashion, and for each category the percentage of stained target cells is recorded. The final H-score is the sum of each intensity rating multiplied with its corresponding percentage, generating a score on a continuous scale of 0 to 300.12–14
Like the H-score, the Allred score also combines the assessment of staining intensity (0–3+) and proportion of cells staining (categories of percentage predefined). However, the formula for the Allred score (also termed the quick score) equals the intensity score + the proportional score, and thus follows a noncontinuous scale including 0 and 2 through 8.15 This scoring paradigm is in clinical use in combination with the DAKO ER/PR pharmDx in vitro diagnostic kit.16
It is beyond the scope of this manuscript to discuss the comparative clinical value of these scoring paradigms. Properly deployed, all manual scoring methods have potential value. The key to their effective use is the recognition that all these approaches generate semiquantitative scores as ordinal numbers or derivatives thereof, and as such should not be used as quantitative measurements of ground truth.6,7
BIAS IN PATHOLOGY SCORING
Bias (systematic error) can affect scientific investigations and may even destroy a study's validity depending upon its degree.17,18 It can occur at any phase of research, and to some degree it is nearly always present in research studies. Bias is independent of the sample size and statistical significance.19
In the following sections, we review a number of visual and cognitive traps that can affect manual pathology scoring and that can be avoided when using digital image analysis (Table). Visual traps (also called optical illusions) are phenomena in which the perceived image differs from objective reality. In contrast, cognitive traps represent tendencies to think in a biased way that can lead to systematic errors or deviations from rational thinking. For the purpose of this review, particular focus is given to the impact of these traps in the manual scoring of IHC staining.
Visual Traps as Sources of Bias
The Illusion of Size
The extensively studied Ebbinghaus illusion of size illustrates that our perception of size is greatly influenced by the context in which an object is presented. For example, in Figure 1, A, the inner circles are the same size but appear to be different because of the distances between objects and the divergent sizes of the adjacent circles.5,20–22 This illusion applies to settings in which intermingled cells of divergent sizes must be discerned from each other.
The Delboeuf illusion of size acknowledges that dissimilar objects can impact perception as well. In Figure 1, B, a circle around 1 of 2 identical black circles makes the ringed disk appear larger. This visual illusion demonstrates why a nucleus surrounded by a labeled membrane may be perceived as having a different size from a nucleus with an unlabeled membrane (Figure 1, C). It is important to note that the proximity of the concentric ring is important: a larger distance between the ring and inner disk can actually make the central object appear smaller.23
The illusion of size is influenced by the observer's preconceived notions. For example, the perception of size is impacted by the pathologist's memory regarding prior samples with analogous features. The effect of prior knowledge with respect to estimates of size also extends to the words describing an object. This facet of the illusion is shown most readily by contemplating our conceptions of various animals (eg, ant versus elephant).20
The Craik-O'Brien-Cornsweet illusion is the perception of an illusory effect to objects or surfaces based on the optical characteristics of their edges. Adjacent objects are perceived as having different brightness based on variations in staining intensity, when in fact they are the same (Figure 1, D).24–26 This illusion may influence a pathologist's assessment of the staining of adjacent cells, where staining must be assessed for both the cytoplasm (middle) and membrane (edge); dissimilarities of staining intensity at the membrane may lead to misinterpretation of cytoplasmic staining intensity.
Checker Shadow Illusion
Similar to the Craik-O'Brien-Cornsweet illusion, this illusion demonstrates flawed perception arising from differences in brightness (Figure 1, E). Our previous knowledge of the checkerboard's arrangement influences our perception of brightness for the squares, some of which we know to be lighter even though covered by shadow; we perceive these squares as having a different brightness compared with other light and dark squares not covered by a shadow. This effect is combined with the visual system's tendency to ignore the soft edges of shadows and see only sharp edges of the squares.27,28
Distinguishing Gradients of Colors and Hues
Although the number of colors that the human eye can distinguish continues to be elusive, published estimates based on graded color strips or wheels range from 30 000 to 10 million.29 Tremendous person-to-person variation exists for differentiating colors and color hues.5,30 However, when people are shown different colors and color hues in isolation, recognition of differences quickly drops significantly.31 In addition, age, certain disease states, and lifestyle choices can have an effect on the ability to see and distinguish colors.32–34 In comparison, a 12-bit camera can accurately and independently distinguish 68 billion colors. Because of this superiority of the digital system to assess color hues, specific attention needs to be given to setting a correct threshold within the digital analysis to ensure that background staining, however faint, does not result in false-positive identification.35
The perceived color is known to depend not only on the spectral distribution of the light reflected by that particular color, but also on the light nearby. This phenomenon is called chromatic or color induction, and is composed of 2 components: color contrast and color assimilation. Color contrast presents itself as follows: yellow light will appear yellow (580-nm wavelength) when viewed independently of other colors, but will appear greenish when it is surrounded by red (670-nm wavelength).36 Color assimilation occurs when the color appearance shifts away from the appearance of the nearby light, whereas the latter occurs when the appearance shifts toward the color of the inducing light.37 Color assimilation has been demonstrated in several situations, including the watercolor effect (Figure 1, F).5 In the watercolor effect, 2 uneven colored contours, one darker than the other, surrounding an uncolored space results in the uncolored area appearing to take on the lighter color.38 For example, when a white space is surrounded by an orange inner and purple outer border, the enclosed area appears to be orange tinged.39
In practical terms, these color phenomena may impact efforts to score the level of cytoplasmic staining in the context of adjacent membrane staining of different hues, or vice versa (Figure 1, G through J). Variability in color hue recognition may also impact the evaluation of multicolor bright-field staining (multiplex IHC) where one must differentiate colors and color hues that are in close proximity or colocalized (Figure 1, K and L).
Lateral inhibition is an optical effect resulting in altered color perception arising from neuronal interference in the retina and optical tracts. For example, the Hermann grid illusion (Figure 1, M) results in perception of false points at inflection points between interfaces of light and dark lines and squares forming a grid. This occurs when excited neurons reduce the activity of their neighbors to prevent the spread of action potentials in a lateral direction. When a person looks at a light edge next to a dark edge, lateral inhibition makes the light edge appear even lighter, because the inhibition from neighboring neurons activated by the darker edge have a smaller inhibitory effect than a light region surrounded by a light border. Similarly, the dark border next to the light border is perceived as even darker. This results in an increase in the perceived contrast and sharpness of images at the cost of viewing certain things as lighter or darker than they actually are.40,41
When we are engaged in a demanding task, our focus may blind us from observing obvious and salient events.42 This phenomenon is called inattentional blindness. Some reports suggest that expert observers may partly but not completely mitigate this effect.43 However, even experts may experience this effect; 83% of radiologists asked to find lung cancer nodules in stacks of patient magnetic resonance images missed the faint outline of a gorilla present in 5 consecutive images even though the gorilla was 48 times larger than the cancer nodules and eye tracking revealed that most of the radiologists had looked right at the gorilla.44 Similar studies have shown that this phenomenon also occurs if a stimulus is removed instead of added.45 For the practicing pathologist, a similar result may occur when manually evaluating tissue sections or digitized images.
Cognitive Traps as Sources of Bias
Confirmation bias, in its psychologic definition, refers to the unwitting selectivity in the acquisition and use of evidence.46,47 In the context of research, this bias acknowledges people's tendency to seek information that is considered supportive of a favored hypothesis or existing belief, and to interpret information in ways that are partial to that hypothesis or belief. Although this bias can be deliberate or unintentional, the effect on a study is most commonly subconscious.46 In fact, the tendency to deliberately resist confirmation bias is one key way that the scientific method differs from ordinary thinking.46 Applying the principles of psychology to the work of pathologists, it appears clear that the confirmation bias of inadequate search (ie, confirming a preconceived diagnosis and not investigating further) can have an effect on data quality. Bias can be introduced at any stage of an experiment: design, analysis, or interpretation.17,19
One method used for decreasing the impact of confirmation bias is masking (also known as blinding or coding) of the observer to treatment group identities.6 The tradeoff is that masking methods may hamper a pathologist's ability to fully evaluate tissue samples/lesions. In the context of improving quality, it has been shown that blinded evaluations do not significantly improve study data.48 More recently, published best practices only recommend masking or blinding of pathologists (ie, not providing information on clinical or treatment parameters) in limited scenarios and instead generally advise that pathologists be unblinded to treatment groups and other study data when performing their analysis in a research setting.49 Within the context of clinical evaluation, blinding pathologists to patient data is neither feasible nor helpful, as total patient data evaluation (ie, histology slide set and patient file/history) is paramount for adequate assessment.
Diagnostic drift describes the situation in which scoring values vary slightly and in a consistent fashion throughout a study.7,8,50 This effect is not a cognitive trap per se, but the impact of subtle variations over time—especially for longer studies with multiple endpoints—nonetheless substantially affects data quality. Even proficient pathologists with extensive experience may fall prey to this trap when manually scoring lesions. The easiest way to avoid this situation is to periodically review the scoring criteria during the course of a study.
Anchoring, or tunnel vision, describes the phenomenon in which exclusive attention is given to only one aspect of a problem while failing to fully assess and understand the entire situation.51,52 Oftentimes, the first piece of information given or communicated on a topic (the anchor) causes fixation on the anchor; its validity or appropriateness is overestimated, whereas new and supplemental information is ignored.47 In diagnostic medicine, anchoring results in premature closure and acceptance of a diagnosis prior to its full validation.53
For example, pathologists diagnosing prostate cancer often report similar values for the Gleason score (based on tissue architecture) and the now-obsolete World Health Organization grading system (based on nuclear morphology). This concordance of independent scoring paradigms based upon very distinct morphologic features has been shown to arise from cognitive bias, which represents anchoring.54 In this example, nuclear images were scored by pathologists independent of their tissue context and compared with tIA data for the same nuclei. Grading the nuclei in isolation from the tissue context significantly improved manual scoring relative to tIA, although prognostic power was lost when tissue features were not assessed.
Search satisfaction describes the phenomenon of a false-negative error, in which a specific target or event is more likely to be missed when it occurs in concordance with one or more additional anomalies. In other words, a visual search is abandoned once the searcher finds a target and becomes satisfied with the meaning of the image after reaching a certain information threshold.55–57 This effect has been widely studied in radiology55,57–59 as well as various scenarios outside the medical field, including luggage screening at airports.60 In general, the more rare an event, the more likely it is that it will be missed.60 To our knowledge, this phenomenon has not been studied in the diagnostic pathology setting, but it is reasonable to expect that pathologists would exhibit a cognitive bias resembling that of radiologists. A possible search satisfaction dilemma in anatomic pathology might represent the inability to detect immune cell subsets occurring at low frequencies in a tissue, thereby leading to missed disease- or treatment-relevant phenotypes. Although using tIA may assist in the detection of rare events, the pathologist must confirm that only the appropriate tissue features are captured and measured so to avoid the opposite problem (false-positive data points) and overinterpretation.35
Observers are more likely to consider a sample as abnormal when it is reviewed in a specifically assembled sample set with high disease prevalence than when the sample is interpreted either as part of a group with lower disease prevalence or in isolation.17 This effect also has been studied in radiology.61 A recent study in which pathologists interpreted 60 biopsy specimens blindly twice with a 6-month washout period between sessions demonstrated that practitioners' diagnoses were influenced by the severity of previously interpreted cases. The extent of context bias was similar when pathologists' responses were evaluated with respect to 1 or 5 preceding cases.62
Avoidance of Extreme Ranges
The tendency to avoid extremes in scoring and grading systems is well recognized.7 Studies have shown that a majority of tumors are classified subjectively as being moderately differentiated, thus avoiding the outlying options of well or poorly differentiated.63 Similarly, in some scoring systems, normal biopsies tend to be classed as abnormal because practitioners seek to avoid the lowest end of the range—normal.64 Within IHC evaluation, reports have confirmed that the middle scoring category is overused,65 leading some researchers to suggest that the number of categories be minimized (ie, low and high grade only, instead of multiple tiers) to improve diagnostic reproducibility.7,66
Number preference (also known as digit preference, terminal/end digit bias, and heaping) recognizes the fact that not all numbers are created equal.67,68 Humans have been shown to prefer rounding data points to numbers ending in 0 or 5.69–71 Within the medical field, this effect has been extensively studied in the recording of blood pressure measurements and body mass index as well as in surveys that require self-reporting of numbers. In these studies, data sets also show significant heaping of data points ending in 0 and 5.72–76 For manual pathology scoring, when percentage estimates are assigned, a higher chance of recording scores that end in 0 and 5 has been described, which demonstrates that scoring is biased away from the full continuum of values between 0 and 100. This bias in scoring impacts the types of statistical analyses appropriate for manual scores relative to quantitative scores that can be generated by tIA.
Aside from the specific effect of number preference, a pathologist's ability to accurately estimate numbers needs to be critically evaluated. In a study conducted by the College of American Pathologists, 197 participating labs were provided with 10 images of hematoxylin-eosin–stained colonic adenocarcinomas and asked to report the percentage of neoplastic cells present; this interlaboratory comparison was done in the context of molecular testing of genetic alterations, where overestimating the percentage of neoplastic cells specifically results in false-negative test results and significant consequences for patients. The survey demonstrated low interlaboratory precision among pathologists' scores, although mean estimates were somewhat accurate. For example, in some cases estimates only varied by 1% among practitioners, but in other cases the difference from the criterion standard exceeded 24%. More importantly, 50% of the evaluated cases had estimates that were different by more than 10%.77 A previous study, specifically evaluating non–small cell lung cancer biopsies, has shown that visual assessment of estimated neoplastic cell concentration overestimates the tumor burden by 10% to 20%.78
The gambler's fallacy acknowledges the struggle of humans to interpret independent events as being independent. It arises in 2 forms. First, it is fallacious to assume that a certain random event is less likely after a series of the same event occurring; conversely, it is fallacious to assume that a certain event is more likely after a series of different events occurring.79,80 The most common example of this bias is predicting the outcome of a coin toss. After a string of heads, a subject will expect that tails are “due.” However, the result of previous coin tosses is totally independent of the outcome of the next toss.81 Humans consistently make suboptimal decisions regarding independent and identically distributed events.82 Interestingly, this fallacy has been reported more frequently in highly educated individuals who are faced with probability judgment situations.80 Within the medical field, research has shown that many residents are prone to the gambler's fallacy, indicating that their biomedical education did not provide sufficient, or even any, training to avoid this bias.81
In manual pathology scoring or object counting, the gambler's fallacy predicts that pathologists will struggle with seeing samples within a study as independent events to be evaluated. A frequent presentation of this scenario is that a pathologist may be reluctant to record several identical scores or counts in a row, and may subconsciously attempt to correct the observed rate by scoring differently in subsequent evaluations. The logical solution to this dilemma is that practitioners should regularly remind themselves to assign scores strictly according to the diagnostic criteria observed in the sample, without reference to any other unrelated specimens.
In conventional anatomic pathology practice, manually assigned lesion and biomarker scores are seen as the gold standard. Such scores generally are assigned by trained and experienced pathologists, and as such are viewed as the best estimate of ground truth (or “reality”) despite their subjective and qualitative to semiquantitative nature. Nonetheless, the variability among pathologists coupled with the inherent heterogeneity among disease phenotypes indicates that a more objective and truly quantitative strategy is the true gold standard toward which the biomedical community should strive. Achievement of this ideal would diminish or even eradicate the gold standard paradox.
A movement toward replacing manual scoring with automated tIA would solve issues associated with anatomic pathology assessment, albeit these problems may be subconscious and not generally perceived as problems. For example, many aspects of manual pathology scoring can be impacted by visual traps, cognitive traps, and other biases. In contrast, an automated quantitative tIA method can be developed in a manner that minimizes or eliminates many of these traps and biases. The potential superiority of tIA is possible because a consistent and objective rule set (ie, cell classification, staining quantification) can be applied on a cell-by-cell basis in whole tissue sections across an entire study or patient population. Scores assigned by computer algorithms are not prone to biases such as number preference or avoidance of extreme ranges. Color, hue, intensity, and contrast are captured by digital cameras and evaluated by computational algorithms that quantitatively perform at levels that far exceed human visual perception. Importantly, analysis via tIA algorithms ensures that each section and each scoring event is viewed as an independent event, based on predefined metrics, and unaltered by the sections evaluated before or by adjacent cells. In addition, tIA algorithms overcome challenges created by interobserver variability, avoid the possibility of diagnostic drift, and provide data sets of quantitative pathology scores on a continuous scale that are readily amenable to rigorous statistical analysis.
However, the transition toward automated analysis as a future gold standard in many situations must acknowledge the fact that computer-based algorithms have not been able to model the extreme complexity of the critically thinking human brain. Specifically, within the space of histology interpretation by a pathologist, the human mind contributes the sum not only of the formal training, which can be emulated by artificial intelligence programs, but also of the full constellation of learning experiences that transpired prior to the evaluation of that one sample. Cumulative experience over time coupled with the ability to interpret visual cues within a larger tissue context and also the context of patient clinical history and other demographic data are unique aspects of a human practitioner that computer algorithms have been unable to mimic or replace. Despite the example described above, where radiologists unconsciously ignored a gorilla image superimposed on lung radiographs as irrelevant, the complexity of the human mind remains necessary to process the often-complex anatomy present in a given tissue section—to understand both which features should be included and which excluded in the final analysis; this ability is especially true for very subtle gradations in phenotype. In this manner, the combination of a human observer, the pathologist, and the tIA software is greatly advantageous for modern diagnostic pathology because it complements the intrinsic weaknesses of the human observer with the strengths of computer analysis and also overcomes the innate weaknesses of the computer-based algorithm with the strengths of the human observer (Figure 2).35
The choice of selecting a reference standard for validating an IHC-tIA assay must consider the strengths and weaknesses of both the computer and the human observer. When manual pathology scoring is deemed appropriate for validating a novel tIA method, it is particularly important to understand potential visual and cognitive traps as well as biases that may be encountered when scoring a specific tissue type and biomarker so that performance assessments of tIA approaches are not skewed by error that is integral to a manually defined reference standard. In instances where manual scoring is severely hampered by the above-mentioned traps and biases, an alternative reference standard (eg, biomarker levels in a homogenized tissue sample) may be considered in assessing IHC-tIA assay performance. Regardless of the choice of gold standard, the rationale for its selection must be communicated completely and justified rationally when reporting pathology scores.
The authors would like to thank Matthew Steaffens for his support in creating figures.
From Flagship Biosciences Inc, Westminster, Colorado. Dr Bolon is now with GEMpath Inc, Longmont, Colorado. Ms Koegler is now with Portland Gastroenterology Center, Portland, Maine.
This article is loosely based on a presentation given at the 2016 Annual Meeting of the American Society of Investigative Pathology at Experimental Biology; April 4, 2016; San Diego, California.
All authors were full-time or part-time employees at Flagship Biosciences Inc at the time of drafting of this manuscript. The authors have no other relevant financial interest in the products or companies described in this article.