Context.—Accuracy is an important feature of any diagnostic test. There has been an increasing awareness of deficiencies in study design that can create bias in estimates of test accuracy. Many pathologists are unaware of these sources of bias.
Objective.—To explain the causes and increase awareness of several common types of bias that result from deficiencies in the design of diagnostic accuracy studies.
Data Sources.—We cite examples from the literature and provide calculations to illustrate the impact of study design features on estimates of diagnostic accuracy. In a companion article by Schmidt et al in this issue, we use these principles to evaluate diagnostic studies associated with a specific diagnostic test for risk of bias and reporting quality.
Conclusions.—There are several sources of bias that are unique to diagnostic accuracy studies. Because pathologists are both consumers and producers of such studies, it is important that they be aware of the risk of bias.
Accuracy is an important feature of any diagnostic test. Accuracy estimates play an important role in evidence-based medicine. They guide clinical decisions and are used to develop diagnostic algorithms and clinical guidelines. Poor estimates of accuracy can contribute to mistreatment, increased costs, or patient injury. Thus, it is important for accuracy estimates to be reliable.
There has been increasing awareness of deficiencies in study design and reporting in diagnostic test accuracy studies1–5 and it is now recognized that diagnostic accuracy studies are subject to unique sources of bias. Pathologists are often involved in diagnostic accuracy studies and, as specialists in test methodology, play a key role in the generation of data on diagnostic accuracy. It is important for pathologists to understand the limitations of diagnostic studies and the methodologic issues that can lead to bias in accuracy estimates.
Over the years, there have been several efforts to make researchers aware of the methodologic issues associated with diagnostic tests. In 1999, the Cochrane Diagnostic and Screening Test Methods Working Group first convened to reduce deficiencies in diagnostic test reporting. Since then, the STARD (Standards of Reporting Diagnostic Accuracy) checklist,6,7 QUADAS (Quality Assessment of Diagnostic Accuracy Studies) instrument,8,9 and the QAREL (Quality Appraisal of Reliability Studies)10 instrument have been introduced as evidence-based quality assessment tools to use in the systematic review of diagnostic accuracy studies. The STARD initiative alone has been adopted by more than 200 journals, spanning basic research to medicine. QUADAS has been widely adopted11 and has been cited more than 500 times. It is recommended for use in systematic reviews of diagnostic accuracy by the Agency for Healthcare Research and Quality, Cochrane Collaboration, and the National Institute for Health and Clinical Evidence12 in the United Kingdom.
The problems associated with diagnostic tests are well recognized; however, the concepts involved are often subtle and unfamiliar to many pathologists. Because they play a key role in the production and interpretation of information on diagnostic test accuracy, it is important for pathologists to understand the types of bias that arise in diagnostic accuracy studies and their impact on accuracy estimates. Accuracy estimates are increasingly obtained from meta-analysis and, as noted above, an assessment of the risk of bias is now a standard part of any review of diagnostic accuracy. Owing to the increasing emphasis on evidence-based medicine, pathologists will be required to produce or interpret findings on the risk of bias in diagnostic studies.
Our objective is to provide an explanation of the common sources of bias in diagnostic studies. This information should help pathologists to identify risks of bias in diagnostic studies, to predict the impact of bias on study outcomes and, as producers of diagnostic studies, to avoid some of the methodologic issues that commonly cause bias in diagnostic studies.
FRAMEWORK FOR APPRAISAL
To be useful, a study must address a clinical question. Such questions are formulated in the familiar PICO format. For a diagnostic study, the PICO parameters are population, index test (the test under examination), comparator or reference test (the gold standard), and outcomes. The value of a study is a function of its capacity to resolve a clinical question. A clinical question can arise in the context of clinical work (Can this study help me to diagnose this patient's condition?), or in meta-analysis (Can this study help to resolve the question of the meta-analytic study?). To answer a clinical question correctly, a study must provide information that is both reliable (internal validity) and applicable (external validity). Internal validity is a function of bias and precision. A framework for appraisal is presented in Figure 1.
Bias is defined as a systematic difference in an observed measurement from the true value. For example, miscalibration causes systematic measurement errors that lead to analytic bias. In the context of a diagnostic accuracy study, bias occurs when the overall estimates of sensitivity or specificity systematically deviate from the real value. If bias exists, a study would consistently overestimate or underestimate the true accuracy parameters if the study were repeated. Thus, bias is error that does not “balance out” upon repetition. Unlike bias, precision is a function of random error and “balances out” upon repetition. Both bias and imprecision can render measurements unreliable. When measurements are unreliable, they may fail to represent the true value of the phenomenon being measured. A study is said to lack internal validity when it fails to measure what it purports to measure. Bias and random error (imprecision) are both sources of variation that can cause a measurement to differ from the true value. Both of these sources of variation negatively affect internal validity.
Bias and precision are a function of study design. Random error (imprecision) is determined by sample size and sound experimental design. Bias can occur at several different levels. In a diagnostic accuracy study it can arise from individual test measurements (analytic bias) or methodologic issues related to study design. Evaluation of internal validity involves an assessment of the risk of bias and the level of variability caused by imprecision. This, in turn, requires an assessment of study design features that could lead to bias or imprecision. Our discussion will focus on bias; however, it is important to recognize that internal validity is a function of both bias and imprecision. Risk of bias can only be evaluated if sufficient information about the study design is provided to allow for an assessment. Quality of reporting has a significant impact on internal validity which, in turn, is a determinant of study value.
The other determinant of study value is applicability. This is assessed by comparing the conditions of the study under evaluation (population, index test, reference test, outcomes) with those of the clinical question (Figure 2). Changes in any of these study parameters can cause changes in test accuracy. Such changes reflect true variability in test conditions and are not due to bias. For example, accuracy of fine-needle aspiration cytology (FNAC) might depend on the experience of the pathologist. The accuracy obtained in a study with an experienced cytologist would be higher than the accuracy obtained with a relatively inexperienced cytologist. If the difference were large, results from 2 studies conducted by pathologists with different levels of experience would necessitate a comparison of the different levels of experience. Because differences in methodology can lead to different outcomes, it is important for studies to fully report all of the methodology associated with both the index test and reference test so that sources of variation can be appreciated and applicability can be evaluated.
Since differences in methodology, patients, or other factors can lead to differences in accuracy measurements, studies conducted at different sites might show different levels of accuracy due to differences in the conditions at each site. Such differences cause difficulties in comparing studies, but this is again distinct from issues of bias.
Diagnostic studies often show considerable variation in outcomes. As discussed above, there are 3 possible reasons for variation: differences in study parameters, imprecision, and bias. As an example, Figure 3 depicts the results from a recent meta-analysis on the diagnostic accuracy of FNAC for parotid gland lesions. The results show considerable variability in accuracy and are quite heterogeneous. The heterogeneity implies that the studies differ owing to differences in study design (bias) or to real differences in the study parameters (PICO). Clearly, only a subset of these studies would be likely to provide valuable information with respect to a particular clinical question (eg, What is the accuracy of fine-needle aspiration [FNA] in a 1-cm lesion presenting in a US hospital, which appears benign on magnetic resonance imaging, was sampled with a 22-gauge needle with 4 passes, and was evaluated by a pathologist with 10 years of experience who specializes in head and neck tumors?). To make this determination, one would have to assess the reliability (risk of bias) and applicability of each study. This example provides a good example of the role of different types of variation in study appraisal.
In discussing issues of bias, we will refer to the QUADAS-2 framework.12 QUADAS-2 is a survey that is used to assess the risk of bias in diagnostic studies and is organized into 4 domains: patient selection, index test, reference test, and patient flow. QUADAS-2 is closely aligned with the PICO format. QUADAS assesses risk of bias and applicability in each domain. Methodologic deficiencies can also give rise to subtle issues of applicability. We will discuss applicability that arises from design deficiencies but will not discuss applicability due to real differences in study conditions.
QUADAS-2 is designed to assess the value of a study with respect to a clinical question, but is not designed to assess reporting. The STARD guidelines are designed to insure high-quality reporting and can also be used to assess reporting quality. The assessment of reliability and applicability requires good reporting (Figure 1). QUADAS and STARD therefore serve 2 related but distinct functions.
Differences in patient populations can affect accuracy, and the comparison of studies conducted in different populations raises questions of applicability.
Population Selection and Applicability
Study participants are obtained from a process of selection that starts from a target population and ends with the study participants (Figure 4). The study target population is conceptual and is obtained from a clinical problem (PICO). In a given study, the target population describes the patients for whom the results of the study are intended to apply. The extent to which the results of the study apply to the study target population depends on how well the actual study participants match the target population defined in the study question. The internal validity of a study depends on the applicability of the study participants to the study target population (ie, defined by the clinical question in the current study). Assessment of applicability requires an evaluation of each step of the selection process. For example, the study participants must be representative of the study entrants in order for the study participants. External validity depends on the applicability of the study participants to other target populations (ie, to clinical questions other than those posed by the present study) as shown in Figure 2.
It is generally easier to detect advanced disease than early-stage disease, for which the signs are often subtle and difficult to distinguish from normal (Figure 5). The key parameter in patient spectrum is the difference in the measured test parameter between the disease and nondisease cases. We would expect diagnostic accuracy to be greater in a study conducted in a population with advanced disease than a population with less severe disease and, for this reason, studies may not be comparable if they are conducted on populations with significant differences in disease severity. Disease severity can be influenced by many factors such as the setting, referral patterns, and prior testing. All of these factors could give rise to differences in test performance that reflect actual differences in disease severity. Thus, it is important to fully describe the severity of disease in the patient population along with other factors that could be associated with disease severity. Although differences in disease severity are often referred to as “spectrum bias,” we view these differences as issues of applicability because they reflect real differences in populations.
Referral patterns are an important determinant of patient spectrum (Figure 6). Each stage in the referral process can produce diagnoses that remove cases from the initial distribution. Thus, the patient spectrum is altered by each referral. In general, one would expect the spectrum to narrow with each referral as “easy” cases are removed from the tails of the distribution. Test performance increases when the distribution is wide and, for that reason, diagnosis is more challenging at later stages than in the initial stages of the process, and a given test would be expected to be less accurate in later stages. Thus, prior testing and referral patterns can be an important factor when comparing test performance.
Diagnostic tests are often complex, multistep processes that can be performed in many different ways. For example, even for a simple procedure such as FNAC, there are many parameters involving the sample acquisition (needle size, number of passes, use of guidance techniques, experience of aspirator, use of rapid on-site evaluation, etc), sample processing (type of stain, use of ancillary techniques), and interpretation (number of pathologists who read the slide, experience level of the pathologist, availability of clinical information, etc). Each of these factors has the potential to affect test accuracy, and one can think of each variation as a different test with different performance characteristics. As indicated above, differences in accuracy that reflect differences in test conditions are not a source of bias but do give rise to issues of comparability. For example, is the accuracy of FNAC performed with a 22-gauge needle and ultrasound guidance equivalent to the accuracy obtained with a 26-gauge needle without guidance? Because differences in test conditions have the potential to affect results, it is important for studies to fully specify the methods.
We recently conducted a meta-analysis on the accuracy of FNAC for diagnosis of salivary gland lesions and found considerable heterogeneity in the results (Figure 3).13 The question arose as to whether the variation in accuracy could be explained by differences in methodology. Unfortunately, the methods were insufficiently reported so that the effects of differences in methods could not be explored. In a subsequent study, we looked at the way in which FNAC studies described methods and found significant variation in reporting.14 These examples illustrate the possible impact of test conditions on test results and why it is vital for studies to provide detailed descriptions of methods.
No test is perfect. Errors in the reference test cause classification bias. There are 2 types of classification bias: differential misclassification and nondifferential misclassification. In differential misclassification, the error rate is associated with the index result. In the case of FNAC, positive FNA results may have a higher misclassification rate than negative FNA results, owing to error rates in histologic diagnosis. In nondifferential misclassification, the error rate is independent of the index test result, but this can underestimate sensitivity and specificity. An example is provided in Table 1. The magnitude of the bias depends on the disease prevalence, the accuracy of the index test, and the degree of misclassification. As shown in the example, misclassification can have significant effects. The misclassification rate can vary from site to site depending on the methodology associated with the reference test (eg, skill of the pathologist, use of ancillary tests). Misclassification can be estimated by interrater reliability studies, but such studies are rarely referenced in FNAC diagnostic accuracy studies.
Diagnostic Review Bias and Incorporation Bias
These types of bias occur when the interpretation of the reference test is not independent of the index test, which weakens the results of retrospective studies.
Diagnostic review bias occurs when the pathologist interpreting the final histopathology is aware of the FNA result. This can affect results in that a pathologist might search more carefully for evidence of cancer if the FNA result is positive, and a strong FNA result might influence the interpretation of a borderline histologic result. Clinically, while it is important to use all information when making a diagnosis, the bias that results weakens studies of diagnostic accuracy. A rigorous study would require either reporting that the results are blinded, or that the cases were reviewed again to obtain a blinded diagnosis. In our experience, reporting of blinding in FNAC studies is quite poor.
In some cases, the result of the index test is explicitly used as a criterion for the reference test. Incorporation bias is best exemplified by clinical laboratory testing, specifically in the evaluation of β-d-glucan for diagnosis of invasive fungal infections. Invasive fungal infections are traditionally diagnosed by culture, imaging, and biopsy. β-d-Glucan is a blood-based test that offers an opportunity for an earlier diagnosis. By the European Organisation for Research and Treatment of Cancer criteria, the gold standard for invasive fungal infections includes a positive β-d-glucan test result.15 In this case, the index test comprises part of the gold standard. In FNAC studies, incorporation bias occurs when a positive FNAC result is accepted as the gold standard as sometimes occurs in FNAC accuracy studies of the lung and mediastinum. While this criterion may be reasonable in clinical practice, it is a source of bias in diagnostic studies.
PATIENT FLOW AND OUTCOMES
Ideally, all those who are tested with the index test should receive verification by the reference test (gold standard). Failure to do so can cause bias in accuracy estimates and is known as partial verification bias. Partial verification can arise from different causes. A study may be designed so that positive cases are sampled more intensively than negative cases. Or a study may be designed so that all patients are referred for verification but, for various reasons, some patients do not present for verification. The first case represents a problem in design and the second represents a problem in study implementation. We discuss both types below.
Partial verification bias is common in FNAC accuracy studies for which the usual gold standard (histopathology) is invasive or expensive. Furthermore, most of these studies are retrospective, for which cases are identified from surgery or histopathology records. Such studies fail to record the results of those patients who received the index test but who did not receive surgery and histopathologic verification.
The example in Figure 7 demonstrates the effect of partial verification bias. In 1000 cases with a disease prevalence of 20%, there is an assumed sensitivity of 80% and specificity of 90%. Positive cases (ie, those with a positive FNA result) are verified at a higher rate (80%) than negative cases (20%). The results from the example are presented in Table 2, where the observed accuracy statistics are compared to the actual accuracy (ie, the accuracy that would be obtained with full verification). The table shows that sensitivity is falsely elevated from 80% to 94% and specificity is falsely decreased from 90% to 69%. In our experience, these numbers and the associated bias are typical for FNA studies.
It is important to note that partial verification only creates bias when the verification rate depends on the index test result. Partial verification would not occur if the positive and negative cases were randomly sampled at the same rate. Thus, if verification is limited by cost considerations, one can prevent bias by changing the sampling plan to make the verification rate independent of the outcome of the index test.
Withdrawals can have a similar impact if the withdrawal rate depends on the result of the index test. Withdrawals are common and occur for a variety of reasons. For example, patients initially screened at a community clinic may go to a tertiary care hospital for follow-up. Withdrawals can have the same effect as partial verification due to design; however, the magnitude of partial verification is generally less when it is due to withdrawals.
Obviously, partial verification bias can be eliminated by verifying all cases; however, this would not be practical or ethical for invasive procedures. An alternative is to verify the remaining cases with a different reference test (a “brass standard”) such as clinical follow-up. The problem with this solution is that the accuracy of the 2 reference standards may differ and the accuracy of the cases referred to the inferior test will suffer from classification bias. The overall accuracy estimates will be obtained from a combination of the biased and unbiased results. The resulting bias is called differential verification bias. To illustrate the effects of differential verification bias, we continue the FNA example from above but apply a different reference standard (eg, clinical follow-up) to the cases that were previously unobserved. We assume that the alternative brass standard has a 10% nondifferential misclassification rate. Differential verification can have a substantial effect as shown in Table 3.
These examples illustrate why documentation of flows is so critical in diagnostic accuracy studies. In our experience, withdrawals are often poorly documented in FNA diagnostic accuracy studies. The impact of partial verification bias can be estimated if the flows are well documented,16 but it is better to prevent partial verification by good study design and management.
Inconclusive results affect the applicability of one study to a population. Studies often aggregate test results or exclude results in particular categories. In FNA studies, common diagnostic categories include inadequate, negative for malignancy, atypical, suspicious, and positive for malignancy. As a first step, it is important that an article provide definitions for each of the indeterminate categories. Second, to maintain applicability, it is important that researchers report all results before aggregating results into categories. We often see articles in which results are grouped in different ways. For example, one article may count inadequate results and another article may exclude inadequate results from accuracy calculations. The different assumptions may be valid in the context of individual articles, but may not be applicable to other study populations.
It should be noted that the magnitude of the indeterminate rate can also affect applicability. The indeterminate rate can show significant variation between study sites. Differences in the indeterminate rate can reflect differences in criteria, differences in the sample population, or differences in methodology. Paradoxically, a study in which a cytopathologist defines 15% of cases as “indeterminate” may have better accuracy than a study with an indeterminate rate of 1% because the study with the high rate is only making a diagnosis on the easy cases.
We have explained the basis of several common types of bias that are unique to diagnostic studies. Our objective has been to provide a framework to assist consumers of diagnostic accuracy studies in critically appraising results and to assist producers of diagnostic accuracy studies in avoiding many common sources of bias. It is important to recognize that no study is perfect and that bias and applicability are a matter of degree. Assessment of a study depends on quality of reporting. One cannot assess risk of bias or applicability of a study unless the details of the population, methods, and outcomes are fully reported. Thus, high-quality reporting is vital. These issues are likely to become more important in the future as evidence-based medicine increasingly relies upon systematic reviews and meta-analysis to study test performance.
The authors have no relevant financial interest in the products or companies described in this article.