Coronavirus infectious disease-19 (COVID-19) diagnostics require understanding of how predictive values depend on sensitivity, specificity, and especially, low prevalence. Clear expectations, high sensitivity and specificity, and manufacturer disclosure will facilitate excellence of tests.
To derive mathematical equations for designing and interpreting COVID-19 tests, assess US Food and Drug Administration (FDA) Emergency Use Authorization and Health Canada minimum requirements, establish sensitivity and specificity tiers, and enhance clinical performance in low prevalence settings.
PubMed and other sources generated articles on COVID-19 testing and prevalence. EndNote X9.1 consolidated references. Mathematica and open access software helped prove equations, perform recursive calculations, graph multivariate relationships, and visualize patterns, including a new relationship, predictive value geometric mean-squared.
Derived equations were used to illustrate shortcomings of COVID-19 diagnostics in low prevalence. Visual logistics helped establish sensitivity/specificity tiers. FDA/Canada's 90% sensitivity, 95% specificity minimum requirements generate excessive false positives at low prevalence. False positives exceed true positives at prevalence lower than 5.3%, or if sensitivity is improved to 100% and specificity to 98%, at prevalence lower than 2%. Recursive testing improves predictive value. Three tiers emerged from these results. With 100% sensitivity, physicians can select desired predictive values, then input local prevalence, to determine suitable specificity.
Understanding low prevalence impact will help health care providers meet COVID-19 needs for effective testing. Laypersons should receive clinical performance disclosure when submitting specimens. Home testing needs to meet the same high standards as other tests. In the long run, it will be more cost-effective to improve COVID-19 point-of-care tests rather than repeat testing multiple times.
The idea of this research is to derive, create, and illustrate mathematical relationships (Table 1) that facilitate understanding of coronavirus infectious disease-19 (COVID-19) diagnostic tests in settings of low prevalence (0%–20%), while highlighting major challenges and encouraging improvement in clinical performance.
At the same time, this article strives to assist the learning process by means of a series of graphics focusing on logical classification of tests based on progressively higher tiers of sensitivity and specificity. The importance of sensitivity and specificity becomes apparent when one considers the high and low extremes of disease prevalence in a given population.
The article concludes with assessment of the implications for standards of care in hopes of informing physicians and protecting the public at large from misleading claims about diagnostic tests that could put them and the people they are around at risk of COVID-19 infection during opening/closing cycles and new waves of infection.
Table 2 summarizes the ranges of prevalence, positivity rates, and COVID-19 test volumes in select settings and regions of the United States and other countries during roughly the first half of 2020. Best available published data were obtained and collated in this table.1–17 Apparent prevalence varies widely because of hotspots of contagion, uncertain clinical diagnoses, incomplete testing, pooling of samples, delayed reporting of laboratory data, and other factors, such as poor reliability of the assays used. In the current stage of the pandemic, most regions of the United States have low prevalence in the range of 0% to 20%, and COVID-19 diagnostics must be optimized for it.
The objectives are to derive key mathematical equations, then using the math, to create visual logistics for interpreting COVID-19 test results, to assess US Food and Drug Administration (FDA) Emergency Use Authorization (EUA) specifications and Health Canada minimum requirements for COVID-19 tests, to establish design tiers for sensitivity and specificity, and to enhance diagnostic standards by illustrating the striking impact of low prevalence on the clinical performance of diagnostic tests.
PubMed, the Centers for Disease Control and Prevention (CDC), the FDA, the World Health Organization, Cable News Network (CNN), WebMD, newsprint (primarily the Wall Street Journal, New York Times, Washington Post, 360Dx, Diagnostics World, Live Science, and Medical Laboratory Observer), the World Wide Web, and other sources were explored for articles on COVID-19 molecular diagnostics, antigen and antibody testing, geographic prevalence (aka “cumulative incidence”), and the use of diagnostic tests for opening up and closing down the economy. EndNote X9.1 (Clarivate Analytics) robotically retrieved and consolidated relevant papers as URLs and PDFs. Prevalence data were obtained from The COVID Tracking Project (https://covidtracking.com/; accessed August 4, 2020), the Johns Hopkins Coronavirus Resource Center (https://coronavirus.jhu.edu/; accessed August 4, 2020), state public health agencies, and published articles.
Table 1 lists fundamental relationships used and equations derived in this research. The equations are needed for analyzing public health reports regarding testing, clinical performance of diagnostics, and documented findings regarding COVID-19 assays. Equations 1 through 6 in Table 1 represent the fundamental definitions used to derive Equations 7 through 21. These relationships reflect core concepts of evidence-based medicine.18 They form the foundation blocks for quantitative analysis and machine computations underlying visual logistics (graphic illustrations) in this article.
Equations 7 through 14 were derived in order to calculate positive predictive value (PPV) and negative predictive value (NPV), plus associated parameters through rearrangement of variables. These equations reflect the post hoc, or Bayesian (conditional probability) viewpoint of the health care provider, who must judge whether a positive test result is believable or not, and likewise, decide on the merits of a negative test result.
Thus, the health care provider might ask, “Can I count on that positive serology result? My patient wants to get back to work!” Or perhaps the clever patient will grab the positive antibody test result and run, knowing it is a ticket to restart employment following furlough, even if it is a false positive and he or she places others at risk of contagion. Nonetheless, the mathematical derivations and set of equations in Table 1 are applicable to all types of assays, including assays for direct molecular detection of SARS-CoV-2.
Ratios and Rates
The ratios in Equations 15 through 17 reveal the challenges in settings of low prevalence for COVID-19 antibody tests under current FDA EUA specifications and Health Canada minimum requirements. The ratio of false positives to true positives (FP:TP), Equation 16, was derived to investigate test results obtained from patients in geographically isolated or sparsely populated community settings where coronavirus contagion is limited. There, COVID-19 prevalence can hover around 2% and may not exceed 5%.
RFP, the false positive rate (Equation 19), is quoted frequently in evaluations of COVID-19 tests. Note that RFP (ie, 1 − specificity) represents the horizontal axis of a receiver operator curve (ROC) where the vertical axis is the true positive rate, Equation 18. Thus, the ROC plots sensitivity [(TP/(TP + FN)] versus 1 – specificity = 1 − [TN/(TN + FP)] = FP/(TN + FP), where TP indicates true positive; FN, false negative; TN, true negative; and FP, false positive.
RFO, the false omission rate (Equation 20), is a straightforward function of NPV. That is, RFO = 1 − NPV. RFO was addressed recently in the archives by Raschke et al,19 who introduced an empirical algorithm for polymerase chain reaction (PCR) testing to help avoid missing COVID-19 diagnoses. RPOS (Equation 21) is the positivity rate commonly used to monitor the control, or lack thereof, of outbreaks, especially in regional comparisons of cities, counties, and states.
In Table 1, Equations 22 through 25 were derived in order to (1) help estimate the impact of repeated testing, ideally with a different test design (recursive formula, Equation 22), (2) approximate prevalence based on public health databases reporting only the percentage of positive test results and the number of people tested (Equation 23), and (3) determine predictive value when designing or selecting assays based on their specificity, with sensitivity set at 100% (Equation 24).
With an assumption of 100% sensitivity, the positivity rate (Equation 21) can be simplified to yield prevalence using Equation 23, although ideally, prevalence should be determined from regional or local raw data and broad-based testing. An RPOS of 5% or less for 14 days has become one de facto indicator of whether communities are doing too much or too little testing and in control or not,16 while an RPOS of 8% represents the threshold for potential return to lockdown in California and other states.17
For the pandemic as a whole, test positivity is said to be approximately 3.42%.17 In most US regions, test positivity is claimed to be under 10%,14 although as the first wave progresses huge peaks are appearing (see Table 2). The overall ratio of cases to people in the United States was 4 732 418/330 066 730 = 1.4% on August 4, 2020.
Predictive Value Geometric Mean-Squared
This article introduces a new visual logistic, predictive value geometric mean-squared, or “PV GM2” (Equation 25). A geometric mean uses the product of values, as opposed to the arithmetic mean, which uses their sum, then takes the nth root of the product of the n numbers. Here, PV GM2 (Equation 25) is created by simply multiplying PPV and NPV (PPV • NPV), that is, by multiplying the right hand sides of Equation 7 and Equation 11, each expressed as decimal fractions with ranges from 0 to 1.0.
The purpose of introducing PV GM2 is to create visual logistics graphs useful for comparing tiered sensitivity and specificity levels and also commercial claims over the entire range of prevalence. This enhances awareness when assessing different tests across the broad range of low to high prevalence and is especially revealing for low prevalence (0%–20%), when FPs surge if specificity is poor. Likewise, it shows the weakness of poor sensitivity in settings of high (70%–100%) prevalence, when there are more FNs from suboptimal sensitivity. An advantage of PV GM2 is that it allows one to visualize the impact of both low and high prevalence in one graphic at the same time, while adjusting sensitivity and specificity thresholds to suit the clinical purpose of the test.
One common equation for diagnostic test accuracy, the last in Table 1, was not explored, because of inherent duplicity (see Table 1 note). For example, if x represents sensitivity, and y, specificity, then with prevalence of 50%, either of the following (x, y) ordered pairs, (90%, 100%) or conversely (100%, 90%) where the values of x and y have been interchanged, will generate the same accuracy of 95%. This measure of accuracy should not be used because duplicity of values undermines the concept, and the single index can reflect more than 1 pair of sensitivity and specificity values.
Multivariate open access software, Desmos Graphing Calculator (https://www.desmos.com/calculator; accessed July 31, 2020), was used to be certain that readers could duplicate the graphical results and explore their own analytic goals at no expense other than time and effort. After deriving Equations 7 through 21, Mathematica (Wolfram, https://www.wolfram.com/mathematica/; accessed July 31, 2020) and open access Symbolab Math Solver (https://www.symbolab.com/; July 31, 2020) were used to confirm them. Wolfram Alpha Widgets Rearrange It (https://www.wolframalpha.com/widgets; July 31, 2020) enabled rapid rearrangements of variables in Equations 8 through 10, and 12 through 14.
Readers can use the relationships in Table 1 to enter the desired abscissa, ordinate, and mathematical relationship in Desmos Graphing Calculator equation boxes. Adjust the axes for percentage or integer increments to produce the appropriately ranged displays. Do this by scaling with 1/10 or 1/100 on the left-hand (dependent) and right-hand (independent) sides of the equations, then expanding or contracting the screen view to the relevant domain. Select desired points from visual inspection of the graphs. Next, confirm numerical output rounded to the nearest 10th for establishing test specifications using a governing equation and spot calculations.
Human subjects were not involved. Illustrative prevalence or rate data were obtained from public domain sources, de-identified databases, and the World Wide Web.
Assessing FDA EUA Specifications for Serologic Tests
Figure 1 (left side) illustrates the impact of FDA requirements on COVID-19 diagnostics qualifying under EUAs for serologic tests reporting the presence or absence of SARS-CoV-2 antibodies (immunoglobulin [Ig] G, or, IgG and IgM). These specifications call for sensitivity of 90% and specificity of 95%.20 Graphs were created by using Equations 16 (main curve) and 17 (inset). The left curve traces the envelope of the ratio of FP to TP test results.
When prevalence is 5.3%, PPV = 50%, and TPs equal FPs—no better than tossing a coin. With prevalence of 2%, the FP:TP ratio is nearly 3 and rising; PPV is only 26.9%. With prevalence less than 5%, PPV deteriorates rapidly as the relative proportion of FPs increases, because of suboptimal specificity. With prevalence of 20%, the FP:TP ratio is 0.22; there are 2 FPs for each 9 TPs. For prevalence of 20%, PPV is calculated by using Equation 7 as follows: PPV = [0.90 • 0.20]/[(0.90 • 0.20) + (1 – 0.95) • (1 – 0.20)], which equals 81.8%.
Prevalence across America may be as low as or lower than 2% regionally in several states and rural communities (see Table 2). With an estimate of prevalence, a health care provider or patient who receives a positive test result can use Figure 1 to determine the relative chance that the test result is an FP or TP. Only when the prevalence approaches 20% do chances of misleading test results diminish significantly. Note that for the FDA requirement of at least 70% sensitivity for IgM antibody tests20 and at 2% prevalence, the FP:TP ratio would be 3.5 (7 FPs for each 2 TPs) and PPV, only 22.2%.
The upper right inset in Figure 1 illustrates the relationship of FNs to TNs, that is, the ratio FN:TN, for high prevalence. One observes poor performance for prevalence above 80%, due to the increase in FNs, relative to TNs, attributable to the sensitivity of 90%. For prevalence ranges of 20% or less, the FN:TN ratio is insignificant. For example, at 5% prevalence, the FN:TN ratio is less than 1% (0.006) (Equation 17). The inset is somewhat like a reflection of the FP:TP curve, but not a mirror image, because sensitivity and specificity are not equal.
Positive predictive value reflects the impact of sensitivity and specificity for a given prevalence. Figure 2 compares predictive values under FDA EUA specifications and Health Canada minimum requirements (left panel) versus those with the more stringent “target values” of 95% sensitivity and 98% specificity (right) published by Health Canada.21,22 The line graphs were created by using Equation 7. If sensitivity were one 5% step higher at 100%, NPV would be 100%, because there are no FNs [NPV = TN/(TN + FN) = TN/TN = 1].
Figure 2 shows that under FDA EUA antibody test specifications and at a prevalence of 2%, the PPV of the first test result is only 26.9%. Then, recursive calculations using Equation 22 show improved PPVs of 86.9% and 99.2% for the second and third repeated tests, respectively. When sensitivity is increased to 95% and specificity to 98% for Canadian target values (right frame), the PPV of the first test is 49.2%, and with repeated testing, 97.9% and ∼100% on the second and third round, respectively.
Establishing Sensitivity/Specificity Tiers
Consideration of sensitivity and specificity, prevalence, and post hoc (Bayesian) diagnostic outcomes leads to designation of 3 tiers, where a tier is meant to signify the band at and above the specified sensitivity and specificity thresholds:
Tier 1—Mainly for point-of-care (POC) serologic tests reporting the presence or absence of antibodies to SARS-CoV-2 (IgG, or IgG and IgM) with minimum sensitivity 90% and minimum specificity 95%.
Tier 2—Marginally improved performance, with sensitivity 95% and specificity 97.5%, which is suitable for moderate levels of prevalence, say 20% and higher.
Tier 3—High performance, with high sensitivity of 100% and specificity 99%, for all types of tests and levels of prevalence.
These 3 tiers identify stepwise sensitivity and specificity thresholds for improving the clinical performance of COVID-19 diagnostics in the context of prevailing prevalence.
Applying Tiers and Assessing Clinical Performance
Figure 3 compares the FP:TP ratio for incrementally enhanced sensitivity and specificity of 90% and 95% (Tier 1), 95% and 97.5% (Tier 2), and 100% and 99% (Tier 3), respectively. At low prevalence of 2%, the FP:TP ratio for both Tiers 1 and 2 is higher than that of a coin toss (PPV = 50%).
Tightening specifications pushes the FP:TP curves down and to the left, thereby lowering FP:TP ratios and if Tier 3, rendering COVID-19 tests more practical and cost-effective at low prevalence. When prevalence increases to ∼20%, things improve substantially.
The inset graph (upper right) shows PPV for the full range of prevalence from 0% to 100%, along with NPV (Equation 11) and the false omission rate, RFO (Equation 20), which reflects the chances of missing the diagnosis, which would increase the risk of contagion.
PPV and NPV form slightly asymmetrical curves (see inset) across the vertical meridian (50% prevalence), where the degree of symmetry depends on the relative magnitudes of sensitivity and specificity, illustrated by comparing the different intersection points for Tiers 1 and 2. Since RFO equals 1 − NPV, the curves for NPV and RFO are mirror images around the center line (50% horizontally).
For the highest tier, Tier 3, NPV = 100% (top constant line, Figure 3) and RFO = 0% (bottom constant line) across the entire range of prevalence from 1% to 100%. For this tier, NPV = TN/(TN + FN) = 1, since FN = 0, because sensitivity for Tier 3 is 100% [ie, (TP/(TP + FN) = 1]. Similarly, RFO = FN/(TP + FN) = 1 – NPV = 0. Tier 2 is deemed marginal, because of the poor FP:TP ratio at low prevalence.
Customizing Clinical Performance by Selecting Specificity
Figure 4 allows the reader to select the desired PPV, input prevalence in the relevant local setting, and then determine the specificity that would be needed. Isopleths of equal value reflect prevalence increasing to population immunity starting at approximately 60%. The range of specificity was limited to 90% to 100% because COVID-19 tests with specificity less than 90% would perform poorly in virtually all settings and in fact those with specificity less than 95% would not qualify for FDA EUA status or meet Health Canada minimum requirements.
In Figure 4, if a PPV of 90% is desired (Step 1) for serologic testing in a community with prevalence of 5% (Step 2), then the specificity would need to be 99.4% (Step 3) or greater, verified by calculation using Equation 10. Sensitivity was set at 100% to put Tier 3 within reach. Prevalence of 20% would relax the specificity requirement to 97.2%, or roughly Tier 2. The FP:TP ratio would be 0.112, and there would be ∼1 FP for each 9 TPs (Equation 16).
Tables 3 and 4 illustrate these concepts with worked examples for the FDA EUA specifications and Health Canada minimum requirements (Tier 1), while Tables 5 and 6 contrast the results for Tier 3 higher sensitivity of 100% and higher specificity of 99%.
Interpreting False Positive Rates
One of the most popular evaluators of a diagnostic test is the false positive rate, RFP, which equals 1 − specificity (Equation 19). RFP reflects the chances of a misleading positive test result. Figure 5 illustrates how to establish the PPV given the RFP and prevalence using Equations 19 and 24 with 100% sensitivity.
For example, assume that on the horizontal axis the RFP is 2.5% (Step A) and the prevalence is 5% (Step B). Then, the PPV will be 67.8% (Step C). Since RFP is FP/(TN + FP), that is, the number of FPs divided by the number of those who do not have COVID-19, the best rate would be zero, which is where the curves converge at 100% PPV, because there are no FPs.
Numerous COVID-19 antibody tests have hit the market with little or no proof of the quality level (tier). RFP as high as 15% or more has been observed.23 By inspection of Figure 5, even if the prevalence were 20%, a test with RFP of 15% would produce a PPV of only 62.5%, marginally useful at best for judgments about presumed immunity, and also raising uncertainty about what to advise the patient or worker about returning to work safely.
Enhancing Insight Through Visual Logistics
Figure 6 illustrates the new visual logistic, predictive value geometric mean-squared, or PV GM2. This simultaneously reflects contributions of both PPV and NPV for different tiers of sensitivity and specificity. Curves were plotted by using Equations 7 for PPV, 11 for NPV, and 25 for the multiplication of the two, that is, PPV • NPV. Setting the range of predictive value from 0 to 1.0 and multiplying PPV and NPV produces characteristic patterns of clinical performance in the context of the full range of prevalence for Tiers 1, 2, and 3.
For Tiers 1 and 2, PV GM2 highlights the rapid fall-off due to the influence of degradation of NPV with higher prevalence on the right. This occurs because the number of FNs resulting from low sensitivity [TP/(TP + FN)] increases relative to the diminishing number of TNs in the denominator of NPV [which is TN/(TN + FN)] as disease prevalence or the presence of antibodies nears 100% and positive test results dominate among the subjects tested. Since the sensitivity is 100% for the top 2 curves, there are no FNs, the NPV is 100% (NPV = 1), and PV GM2 = PPV • 1 = PPV. Hence, the top 2 curves do not fall off to the right.
The curve marked “★” in the top left corner reflects both outstanding sensitivity of 100% and exceptional specificity of 99.8% (14 days post PCR confirmation), one of the highest commercial claims for an anti–SARS-CoV-2 antibody test on a mainframe chemistry analyzer.24 A competitor claims equivalent high sensitivity and specificity of 100% and 99.89%, respectively.25 The “★” curve also demonstrates progressively better clinical performance across the entire range of prevalence.
As we have seen from the other figures, the curves fall off precipitously with low prevalence on the left because the number of FPs resulting from specificity lower than 100% increases in relative frequency compared to TPs, of which there are diminishingly fewer in the mainly disease-free population as the prevalence of COVID-19 approaches zero. Hence, PPV = TP/(TP + FP) plummets as FPs in the denominator increase relative to modest numbers of TPs in the setting of prevalence of 2% or lower. Note that commercial assays with claims of specificity of nearly 100% will help minimize this FP problem in the clinical context of low prevalance.22,25
Needs and Expectations
Sensitivity and specificity tiers help clarify needs and expectations for COVID-19 diagnostics. The tiers (1) demonstrate that the FDA EUA specifications and Health Canada minimum requirements for serologic tests (Tier 1) are not practical, in fact, according to some physicians, misdirected26 ; (2) allow some flexibility (Tier 2) for drive-in, walk-up, and point-of-care testing (POCT) programs when and where COVID-19 prevalence approximates 20%; and (3) illustrate that tests with 100% sensitivity and specificity of 99% or higher (Tier 3) are badly needed, whether the tests are performed in the laboratory or at the point of care.
Table 7 compares US, Canada, and United Kingdom specifications for serologic tests in relation to the 3 tiers.21,22 Knowledge of local and regional prevalence will provide valuable information for designing sensitivity and specificity. Lax FDA EUA specifications allow production of assays subject to cross-reactivity with other coronavirus antibodies and excessive FPs.
The CDC recognizes these weaknesses for serologic tests and suggests that “… an orthogonal testing algorithm (i.e., employing two independent tests in sequence when the first test yields a positive result) can be used when the expected positive predictive value of a single test is low.”27 However, recursive testing (Figure 2) entails extra expense, time, and effort for both provider and patient. Clinical performance proven in well-designed studies using diversified populations would facilitate more cost-effective deployment of COVID-19 diagnostics.
With the current cycles of “opening up/closing down” in different states, the positivity rate (RPOS) has become a moving target (see Table 2). Early on in metropolitan and various contagion clusters like some boroughs of New York City, RPOS was 1 case in 23 persons, or 4.3%,10 and among emergency medicine residents working in New York City, as high as 70% (adjusted 69.2%).6 In one borough in late May, the positivity rate was reported as 47%, probably because of subject selection (symptomatic patients, limited testing) early in the outbreak.11 Recently, RPOS has skyrocketed in Central California. Positivity rates must be interpreted cautiously, because poor test specificity generates unwanted and unrecognized FPs, especially with low prevalence.
The approximate level of population immunity thought to limit COVID-19 transmission will occur when prevalence is 60% to 70%. These percentages are derived by setting “R naught” (R0), the “basic reproduction number” times “u,” the proportion of the uninfected population susceptible to COVID-19, equal to 1, that is, R0 • u = 1.
In other words, when an infected person can transmit the SARS-CoV-2 virus to only 1 other person, exponential growth ends, steady-state occurs, and the epidemic curve “flattens.” Since the percentage immune, p, equals 1 − u, then R0 • u = R0 • (1 − pc) = 1. Solving for pc, the critical percentage of immune persons needed to curtail an epidemic is pc = 1 − (1/R0). Thus, for 60% to 70% prevalence, R0 is approximately 2.5 to 3.3.
Although assumptions about the reproduction number and the homogeneity of susceptibility can alter projections, and hence, anticipated trends in prevalence significantly, one can still use the mathematical foundations in Table 1 to design and interpret COVID-19 tests. Clinical performance must meet the challenges of low prevalence seen during the initial stages of the pandemic.
Figure 1 through 5 graphics help interpret COVID-19 diagnostics. Figure 6, PV GM2, shows the trade-offs of FPs and FNs in the clinical context of prevailing prevalence. Suboptimal tests do more harm than good and should not be marketed. COVID-19 tests must be independently evaluated in populations with wide ranges of prevalence, and if prevalence is below 20%, then focused on that range with meaningful numbers of subjects and controls.
High sensitivity (Equation 1) is used to rule out, while high specificity (Equation 2), to rule in. For SARS-CoV-2 detection assays, FPs may generate unnecessary quarantine or misguided treatment. False negatives put everyone at risk, especially if negative results qualify returning to normalcy, as suggested in plans that use weekly universal national testing, 1-month hold and quarantine, and then release back to workplace mingling.28 For serologic antibody tests, FPs may mislead patients to think they have immunity when they do not. False negatives may impede returning to normal activities and slow opening of the economy.
Visual logistics demonstrate the impact of these trades-offs of sensitivity versus specificity. One manufacturer claims 100% sensitivity (no FNs) and specificity of 99.8% for a serologic antibody test24 (curve “★” in Figure 6), and another specificity as high as 99.95%.25 Other EUA claims may state similar high specifcations.29 Overall, societal risk will be reduced if both FPs and FNs are minimized, as the PV GM2 curves in Figure 6 illustrate when using a tier concept of progressive excellence.
The PV GM2 curves in Figure 6 are helpful for 2-dimensional visual pattern recognition of relative clinical performance for either detection or immune status assays. A PV GM2 curve is not intended for integration or point comparisons. As a single index, PV GM2 may not be unique when sensitivity and specificity pairs are interchanged. The continuous curves allow easy visual comparisons of clinical performance for both low (0%–20%) and high (70%–100%) prevalence. They also show that when a condition has high pretest probability (prevalence ∼50%), tests tend to work well.
Health care forecasters and civic leaders30–33 have cited the need for expanded testing as high as 5 million tests per day for 1 month, or 6 million per week continuously for safe opening, and evenly distributed national universal testing,27 as well as creative implementation of point-of-need testing34 in geospatial hotspots worldwide.35 Recently, the FDA revoked EUAs for several antibody tests, including at-home kits.36–40 EUA specifications, including those for home testing, need to be pushed up to or above the Tier 3 specificity threshold (to decrease FPs) with the caveat that until then, several current POC serologic antibody tests are not suitable for settings with prevalence below 20%.
The CDC notes, “In the current pandemic, maximizing specificity and thus PPV in a serologic algorithm is preferred in most instances, since the overall prevalence of antibodies in most populations is likely low.”27 The CDC offers 3 options: (1) a very high specificity test to be used when prevalence is 5% or higher, (2) a focus on those with history suggestive of COVID-19, and (3) “…an orthogonal testing algorithm in which persons who initially test positive are tested with a second test...testing a patient sample with two tests, each with unique design characteristics (eg, antigens or formats).”27
By definition, CDC option “1” is ruled out in settings with prevalence below 5%, and option “2” is not feasible when faced with asymptomatic subjects presenting, for example, for workplace screening. Repetitive testing with different assays (Figure 2), the CDC option “3,” is doable, albeit for millions of people, not time- or cost-effective for care providers, patients, or the nation as a whole. The CDC termed repeated testing orthogonal, but the implied statistical independence is difficult, if not impossible to guarantee, even with different assay methods. For example, preanalytic missteps resulting from lack of training in collection of nasopharyngeal or throat swab samples could introduce serious systemic errors across all tests.
Therefore, it would be wiser to use a Tier 3 test with sensitivity 100% and specificity 99.8% (Figure 3, purple curve/star), but currently this generally means the patient will have to submit a specimen and wait out long delays for transport and mainframe testing. Hence, it makes sense that the National Institute of Biomedical Imaging and Bioengineering is “…overseeing the Rapid Acceleration of Diagnostic Technologies (RADx Tech) program, a $500 million effort to significantly increase testing capacity and accessibility for SARS-CoV-2…and…is supporting several areas of technology development outlined in three Notices of Special Interest, which, for example, support rapid POC and home-based testing and diagnostics.41,42
Pooling and Empowerment
Vice President Mike Pence recently promoted a positivity rate of 5% or lower as a threshold for safe opening (see Table 2) in his Wall Street Journal article.14 There are confounding issues, however, and some states, such as California, cite an operational percentage of 8% positivity for considering a return to shutdown.17 Positivity rates depend on who submits to testing, that is, whether the cohort is symptomatic or not, when in the disease course the specimen is obtained, and test specificity, which if Tier 1 or 2, creates FPs that will corrupt RPOS.
Antibody detection is best about 14+ days after onset of presumed illness, which approximates the time needed to clear the virus, such that it cannot be grown in culture media. Errors may result from preanalytic problems during sampling strategies.19 “Cases” based on diagnostic criteria other than testing are being collated with positive tests, and results from different types of assays are being merged. Database merging of different assay results obfuscates interpretation of testing results and statistical analyses.
Sample pooling (aka “Dorfman testing” invented in WWII to screen soldiers for syphilis) combines patient samples to screen groups, an approach used to control the “second wave” outbreak in Beijing, China.32 If a pool is positive, then individual samples are assayed. The White House has been promoting this approach, supported by claims that with an overall positivity rate in the United States of 6%, 5 or more individual specimens could be pooled for SARS-CoV-2 testing.43 Laboratories must establish limits of detection before pooling, because the dilution of targets decreases analytic sensitivity. Pooling must not increase the risk of missing infected subjects, who unbeknownst spread contagion.
Pooling could expand testing from 0.5 to 5 million per day and aid contact tracing among those asymptomatic in 3100 county communities to abate local outbreaks.43 Investigators have devised multistage and algorithmic Web-based protocols for pooling of many samples with high efficiency, but the efficacy of such pooling depends on prevalence and geospatial evaluation in context.44,45 The FDA provided developer templates, and then issued the first EUA for pooled testing, which was followed by CDC guidance.46–49 Broader testing and pooling might help alleviate mortality disparities among low-income workers and Hispanic persons hard hit by the pandemic.
According to the FDA, developers with a test that has not previously received an EUA should establish the pool size for the claim and perform a clinical validation study large enough to ensure at least 30 samples test positive with a comparator method.47 All samples must also be tested individually. Adding pooling to an existing EUA requires a clinical study large enough to include 20 positive samples. The FDA recommends developers collect samples at a minimum of 3 geographically diverse sites for both claims.47 (Subjects should be diverse too.)
Group pooling for testing can help offset the extraordinary costs incurred from reopening companies and testing employees. However, workers are declining opportunities for free testing out of concerns for privacy, retribution, missing work, and forfeiting pay.50 Test costs should be posted, so that informed people might elect to obtain testing at competitive prices. Methods for protecting privacy are badly needed to instill POC culture51–53 that will protect the elderly and vulnerable populations and give people a satisfying sense of caring for others.
Rapid response COVID-19 POCT can enable self-tracing of family, friends, and colleagues who have been around the person who performed the self-test, and then if positive, recommends quarantine. This personal empowerment sequence is: POC Test → Trace → Target, or for short, “POC•TTT.” Micro-empowerment will allow public health officers and their governments to stop short of indiscriminate and damaging general shutdowns. Instilling POC culture will empower people to help inform contacts, thereby avoiding contract tracing overload and also notorious scams.
New Epicenters and Pandemic Waves
Debate continues regarding projected second, third, or more waves occurring during 2020 or 2021, as businesses open up and people mingle professionally and socially.50 However, the first wave appears to be continuing throughout summer in some regions, and there is little disagreement that testing volumes should increase. Increases in volumes will reveal additional COVID-19 cases, which is not necessarily politically popular in an election year.
A bad test is worse than no test at all. Requirements must be tightened to make diagnostic testing results in future waves intelligible and reliable. Both FPs (higher specificity) and FNs (100% sensitivity) must be addressed. The sensitivity and specificity thresholds for an FDA EUA need to be much higher to avoid rendering a disservice to the medical profession,26 patients, and point-of-careologists54 in the United States and abroad.
Ill-advised commercial investment in low-performance testing will incur significant economic losses on an international scale during pandemic waves that cross borders. Continental epicenters are appearing in Africa and Latin America. Mexico tests 6.3/1000, and Argentina, 6.5/1000, compared to 83/1000 people in the United States.55 Deaths in these regions could overtake those in the United States.
The FDA has tightened EUA requirements for antibody tests used to screen asymptomatic subjects in revised templates published in June 2020.46,47 To add asymptomatic population screening to a test that already has an EUA, a postauthorization study may qualify if there are a minimum of 20 positive specimens and at least 100 negative specimens. The FDA expectation is that positive percent agreement should be 95% or greater and negative percent agreement should be 98% or greater,46,47 equivalent to the Health Canada target values, for which the PPV graphic is illustrated in the right frame of Figure 2.
These specifications do not qualify a test as Tier 3 and should be tightened further to 100% sensitivity and at least 99% specificity because, by definition, subsets of asymptomatic people cannot be screened clinically for symptoms and signs of COVID-19. Hence, the cohort tested will probably represent one, if not the lowest, of groups with low prevalence. The test must be specific enough to avoid cross-reactivity with other common coronaviruses.
Temporal Uncertainty and Trend Analysis
Notable limitations56–58 affecting the design of COVID-19 diagnostics comprise applications, operator skills, context, quality assurance, and importantly, time. Sensitivity and specificity are estimates when they are based on a subset of known subjects from the intended population.59 If an alternative subset is tested or the same subjects tested at a different time, the sensitivity and specificity obtained might differ.
Positive percent agreement and negative percent agreement for a COVID-19 test are documented by the manufacturer when processing FDA EUA credentials. However, prevalence varies as a function of time, p(t), and therefore, so do the numbers of FPs and FNs when the test is applied clinically: FP = FP(t) and FN = FN(t). That means PPV(t) and NPV(t) must be treated as dynamic, time-variable parameters subject to change as the pandemic progresses temporally and expands geospatially.
Rashid et al60 recommend comprehensive evaluation of serologic tests, for which they found sensitivity varies from 72.7% to 100%, and specificity, 98.7% to 100%. Tores and Rinder61 recommend that serologic tests for SARS-CoV-2 antibodies perform as well as intended and that information be provided that enables health care providers, administrators, and health officials to best interpret and apply the available evidence. Confidence intervals, significance levels, and discrimination intervals62 (not displayed in the figures) reflect statistical uncertainty and should be part of validation data analysis.
Preanalytic errors can add to COVID-19 test diagnostic uncertainty. Errors may result from improper sampling during swab collections or bronchoalvelolar lavage, decreasing or fluctuating viral loads, variations in SARS-CoV-2 viral counts in blood, ineffective instrument maintenance, and environmental temperature shock of target media during transport. Self-sampling at-home testing without adequate training may degrade test results. Pooling of contact samples, an expedient used extensively in China,32 may be affected adversely by changes with time. Trend analysis of quantitative antibody levels and viral loads will allow better understanding of disease dynamics, risks associated with new strains of SARS-CoV-2, and spread of contagion.
Sites, Samples, and Companion Diagnostics
The FDA has authorized an EUA for a COVID-19 diagnostic that uses at-home sample collection63 and has issued a template for molecular and antigen diagnostic COVID-19 tests for use in nonlaboratory settings.64 For nonprescription (over the counter) tests intended for use in nonlaboratory settings, FDA recommends positive percent agreement of 90% or greater for asymptomatic and symptomatic subjects and negative percent agreement of 99% or greater with the lower bound of the 2-sided confidence interval being at least 95%. To add asymptomatic postauthorization to a test already authorized for nonlaboratory use without an asymptomatic claim, the FDA expects positive percent agreement of 95% or greater and negative percent agreement of 98% or greater.
For symptomatic patients only and prescription nonlaboratory use, the FDA recommends positive percent agreement of 80% or greater and negative percent agreement of 99% or greater with a lower bound of 95% or greater.64 The rationale given is that, “…the inclusion of symptoms as a requirement for testing increases the pre-test probability of a positive result (higher prevalence) and therefore increases the PPV of the test. FDA believes that a PPV of a test with below 90% positive percent agreement would be insufficient without this mitigation (confirming symptoms).”64
These complex sets of FDA specifications are challenging for manufacturers, providers, and laypersons to understand. Simplification using a tier concept would facilitate FDA's responsibility of educating the public and health care professionals about COVID-19 diagnostics. For antibody tests, the sampling time for an individual patient may not synch with the pattern of the immune response, and in fact, trends in antibody titers and viral load are yet to be mapped out precisely.65–68 New semiquantitative tests will help.25,69 Whether or not the presence of antibodies qualifies as protection against recurrences or new infections is uncertain. Therefore, interpretation of test results must yield to these unknowns. Companion diagnostics, such as interleukin 6 and D-dimer, can help shore up clinical impressions.70
CONCLUSIONS AND STRATEGIES
Mathematical analysis and visual logistics provide a sound foundation for the design, selection, and understanding of COVID-19 diagnostics. Low prevalence is disruptive to all but the highest caliber tests. Unfortunately, at this stage in the pandemic, prevalence is unpredictable, inconsistent, and largely unknown, which adds to uncertainty.
Rapid response testing, patient access, and effective diagnosis promote realistic physician, public health, and POC decision-making. Fast POC detection can help stop transmission, quell outbreaks, and improve standards of care. Consistent consensus national guidelines for test specifications and requirements would improve standards of care, especially for POCT.
Rapid response diagnostics have fallen behind an exponentially accelerating pandemic. Investment in the development of new POC technologies, such as the NIBIB initiative,41,42 is warranted by the inevitable spread of SARS-CoV-2 and new strains to rural areas with low prevalence. De-identified demographics, assay targets, test characteristics, and test results should be collected in an open access national database that segregates molecular, antibody, and other assay concepts. The centralized database should also tell us where testing capacity is available.
People have basic rights—universal access to testing, fast tests results, understanding of what they mean, and confidentiality. Fair allocation of testing will help avoid disproportionate socioeconomic effects on vulnerable groups. Vaccination is months away, possibly longer for all 330 million Americans. Population immunity is a long way off, if ever attainable. Personal and family self-testing will motivate people to wear masks, self-quarantine if necessary, and protect their children and teachers, as they experience POC culture.51–53
Individuals submitting a specimen should be given a disclosure that documents the sensitivity and specificity tier. The disclosure should be clearly illustrated for facile comprehension, understanding, and learning. Point-of-care testing should not be thought of as an excuse for inaccuracy.
Providers will need to explain the impact of low prevalence and link test metrics with diagnostic performance along the spatial care path71–73 from the home, drive-up, and emergency room to the intensive care unit in the hospital. The pretest probability of COVID-19 infection increases progressively from home to hospital, and POCT compresses time along this geospectrum. Time is of the essence in the emergency room, for example, and physicians there see more infected patients, so the pretest probability of COVID-19 increases. As a result, the effective prevalence increases as well.
Expanded access to COVID-19 diagnostics with 100% sensitivity and high specificity of 99% or higher, capable of detecting SARS-CoV-2 across the broad range of prevalence, combined with discovery of trends in antibody immune response through quantitative antibody testing, will improve the nation's approach to public health and collective welfare.
For example, airline travel corridors would be safer by making testing widely and readily available en route. People deemed COVID-19 free might avoid lengthy quarantines upon arrival at destinations. Diagnostic solutions that are environmentally robust must be made available throughout the United States and across the globe in limited-resource countries. Partial or evanescent responses to new strains of SARS-CoV-2 may lead to underestimation of COVID-19 infections.
Progressively structured performance tiers promote high quality and realistic expectations for POC and laboratory COVID-19 tests. Mathematical modeling of transmission based on Tier 1 and 2 performance74,75 discounts diagnostics in contact tracing and pandemic management, other than showing that time to test result is critical, and hence, providing rationale for POC strategies.
Outcomes modeling should assume Tier 3 performance, in which case complex mathematical predictions will improve. In other words, design and model first for high-performance testing. Next, rule out suboptimal tests if the mathematical modeling74,75 shows they are not effective in achieving public health goals, such as mitigating transmission and qualifying re-employment. Let regulatory clearance, open competition,76 and efficacy follow suit.77
Virtual education following the themes in this article as well as training in geospatial concepts and strategies is available under open access online35,78 and at the Web site of the Next Generation Dx Virtual Interactive Global Summit.79 Public health schools can prepare students, practitioners, and providers by educating in needs assessment, diagnostics selection, and quantitative interpretation, using readily accessible POC curriculum80 plus visual logistics. Teaching the new curricula should start now.
The author thanks the creative students who participate in the POCT•CTR and contribute substantially to knowledge in mathematical analysis of diagnostics and POCT. The author also is grateful to have received a Fulbright Scholar Award 2020–2021, which supports theoretical analysis of COVID-19 and other diagnostics, strategic POC field research in ASEAN Member States, mainly Cambodia and the Philippines, community and university lectures throughout Southeast Asia, and collaboration with Professor Liu, Wuhan, China.
Figures and tables are provided courtesy and permission of Knowledge Optimization, Davis, California.
This work was supported in part by the Point-of-Care Testing Center for Teaching and Research (POCT•CTR) and by Dr Kost, its director.
The author has no relevant financial interest in the products or companies referenced in this article.