Delta checks are a powerful technique for monitoring clinical assays in many disciplines but have not been routinely used in molecular testing.
To determine if the biologically determined kinetics of BCR-ABL1's rise and fall could allow the development of a delta check in BCR-ABL1 testing.
Nine years of BCR-ABL1 p210 results were evaluated, and patients with 3 or more results were selected for inclusion. The kinetics of these percentages of international standard values were plotted against time along with the median and the 90th and 95th percentile lines. A Monte Carlo simulation of a batch mix-up was performed for 6 months of data to determine the efficacy of the proposed cutoff.
The median kinetics showed a 1-log drop of the percentage of international standard in 90 days, with less than 5% of cases showing faster than a 2-log drop in 90 days, and less than 2.5% showing a faster than 3-log drop in 90 days (extrapolated to 1 log in 30 days). The Monte Carlo simulation of a batch mix-up showed that an average batch mix-up of 23 samples could routinely be flagged by this cutoff, albeit with wide variance.
These results suggest that using a drop in the percentage of international standard of greater than 1 log in 30 days can be a useful trigger in implementing a delta-check system for this molecular test.
Delta checks are a useful and recommended approach to monitoring assays, especially in clinical chemistry and hematology. In brief, patient results are monitored over time, and if too great a change is seen (the delta), a flag is generated for subsequent review. The technique is very well described,1 has been in use for many decades,2 and has several advantages, including identification of preanalytical errors such as specimen misidentification and instrumentation errors.3 Within the realm of molecular diagnostics, in contrast, most quality control in the field has taken cues from high-volume clinical chemistry laboratories, focusing on detecting preanalytical issues such as DNA integrity and input quantities4 or analytical issues like DNA contamination or interrun variability with Levey-Jennings plots.5,6 However, no studies that we know of have looked at the utility of intrapatient variability (delta checks) in monitoring molecular assays. The lack of adoption of this technique may be due to the orders-of-magnitude-lower volumes of testing and the semiquantitative nature of many of our assays. Nevertheless, there may still be benefits to the application of this technique, and indeed a manual and primitive heuristic is sometimes used by the sign-out pathologists to look for unexpected trends in sample results that might indicate a sample switch or other major issue. In the field of quantitative BCR-ABL1 p210 testing for following chronic myeloid leukemia, numerous international guidelines suggest frequent testing to follow the disease course (for example every 3 months per the National Comprehensive Cancer Network guidelines7 ) and recommend using the international scale (IS) to normalize results across time and institutions.8,9 This results in a large number of tests that are presumably comparable over long periods of time. In this report, we describe an analysis of more than 9 years of BCR-ABL1 major transcript quantifications at our institution and propose data-driven thresholds for out-of-bounds BCR-ABL1 kinetics.
MATERIALS AND METHODS
Institutional review board approval was received for this retrospective study. Fifty-five thousand eight hundred thirty-five unique BCR-ABL1 quantifications in blood and bone marrow were identified between May 16, 2011 and August 21, 2020, belonging to 25 208 unique medical records. Of these, 4479 unique patients had 3 or more data points that were included for study (the median number of data points per patient was 6, with an average of 8.85). Analysis and simulations were performed using Python 3.7.3 along with the numpy (v1.16.2), scipy (1.2.1), and pandas (v0.24.2) packages with plots generated on the matplotlib (v3.0.3) package.
To evaluate the performance of this cutoff in a real-world application to detect specimen mix-ups, a Monte Carlo simulation was used. In brief, we developed an algorithm (delta check) that would flag a specimen if the quantitative value varied significantly among samples from the same patient over a defined period of time. Next, we simulated sample mix-ups by randomizing the link between results and patients over a whole run and then applying the delta check (1 log/30 d). We used data from a 180-day period where the number of cases per day was at least 5, and for each day's cases we mixed up the specimens and ran the simulation to determine the number of delta checks flagged. This simulation was run 10 times per day's worth of specimens. We then tabulated these results to determine an average number of delta flags.
Furthermore, to evaluate if the chosen cutoff was optimal for detection of mixed-up specimens, we ran the same Monte Carlo pipeline with differing delta check cutoffs ranging from 0.5 log/30 d (1 log in 60 days) to 1.5 logs/30 d, which encompasses cutoffs that would result in a reasonable number of cases being flagged (roughly 5% to 0.15%, respectively).
RESULTS
Of these 4479 unique patients with 3 or more data points, 994 patients had an initial IS value of more than 10%, which was assumed to represent the initial diagnosis of disease. Most patients responded to therapy as expected, although a significant cohort showed a persistent IS value of more than 10% (see Figure 1, a).
a, Median kinetics for all 994 patients with a starting international scale (IS) value of >10%. b, Median kinetics of 469 patients who showed a major molecular response. c, Subset kinetics of 175 patients (colored red) who showed a faster than 1 log/30 d drop, normalized to initial IS percentage. Blue lines highlight 81 patients who show rapid rebound. d, Kinetics of 448 patients (colored red) with kinetics between 1 log/30 d and 1 log/90 d. Blue lines highlight the 52 who showed rapid rebounding on subsequent testing. Abbreviation: PCR, polymerase chain reaction.
a, Median kinetics for all 994 patients with a starting international scale (IS) value of >10%. b, Median kinetics of 469 patients who showed a major molecular response. c, Subset kinetics of 175 patients (colored red) who showed a faster than 1 log/30 d drop, normalized to initial IS percentage. Blue lines highlight 81 patients who show rapid rebound. d, Kinetics of 448 patients (colored red) with kinetics between 1 log/30 d and 1 log/90 d. Blue lines highlight the 52 who showed rapid rebounding on subsequent testing. Abbreviation: PCR, polymerase chain reaction.
Of patients with an initial IS value of more than 10% (994), 469 (47%) eventually showed a major molecular response (a 3-log drop in the IS) at some point in their treatment course; the median kinetics was a 1-log drop in 90 days, with less than 5% showing a greater than 2-log drop in 90 days (or 1 log in 45 days), and less than 2.5% showing a greater than 3-log drop in 90 days (or 1 log in 30 days; see Figure 1, b).
In further subclassifying cases with rapidly decreasing kinetics (greater than 1% IS value followed by a drop faster than 1 log/30 d) at any point in their disease course, we found that 46% (81 of 175) showed a subsequent rebound on repeat testing on par with normal kinetics (see Figure 1, c). This is contrasted with cases showing drops between 1 log/30 d and 1 log/90 d, where a significantly smaller fraction of cases (52 of 448; 11.6%; P < .001) showed a subsequent rebound (see Figure 1, d).
In contrast to rapid decreases in IS, cases with rapid increases (greater than 1 log in 30 days) of the IS value after achieving a major molecular response (IS value <0.01%), 22% (40 of 180) showed a return to within ±1 log of the initial value on the subsequent test after the rapid increase. However, among cases with slower increases (between 1 log/30 d and 1 log/90 d), 46.7% (142 of 304) showed a return to within ±1 log of the initial value on the subsequent test.
The Monte Carlo evaluation of daily trials with 5 or more cases per day for 180 days demonstrated an average probability of detecting a sample mix-up of 5.2% per case swapped but with very wide variance (R2 = 0.487). Although on average a shuffling of 23 cases would result in a delta-check flag being thrown, in some cases a mix-up of even 15 cases would generate a delta-check flag, but in other cases even a shuffling of 80+ cases would not result in a delta-check flag (Figure 2).
Results of Monte Carlo simulation looking at the number of excess delta-check flags versus the swapped cases in a run during 180 days of runs. Each dot represents a run of cases, with swapped specimens represented by the x-axis. The y-axis represents delta-check flags in excess of the baseline (without swapped cases). The black line represents a linear regression of these points.
Results of Monte Carlo simulation looking at the number of excess delta-check flags versus the swapped cases in a run during 180 days of runs. Each dot represents a run of cases, with swapped specimens represented by the x-axis. The y-axis represents delta-check flags in excess of the baseline (without swapped cases). The black line represents a linear regression of these points.
In analyzing the cutoff values between 0.5 log/30 d and 1.5 log/30 d, we found a potential local minimum at 1 log/30 d, with values above and below this threshold yielding a trend toward an increased number of cases (+1 to +5 cases) needing to be shuffled in order to generate a delta-check flag. These trends did not show statistical significance given the significant underlying variance in the data, and although more extreme cutoff values could show statistically significant changes in the number of cases swapped to detection, these cutoff values were not practically useful. As expected, the false-positive rate (the number of cases flagged in the absence of specimen switch) increased when the cutoff value was decreased to 0.5 log/30 d, whereas the false-negative rate increased when the cutoff value was decreased to 1.5 log/30 d.
DISCUSSION
For clinical monitoring, BCR-ABL1 kinetics have been well known to provide additional prognostic information regarding therapy. BCR-ABL1 decreases have been evaluated as a potential prognostic factor, with rapid reduction associated with sustained treatment-free remission,10 lack of change associated with worse prognosis,11 and fast doubling times associated with mutation or adherence as a cause of lost tyrosine kinase inhibitor response.12 The underlying biology of these kinetics has been explored and has shown very similar results to those seen here using IS-calibrated reverse transcription polymerase chain reaction for BCR-ABL1 messenger RNA transcripts,13 suggesting our results are generalizable outside of our laboratory. Indeed, most pathologists within our large academic practice use a similar heuristic approach when signing out these cases, looking for a greater than 1-log drop in 30 days to raise the possibility of sample switch or other issues; however, the origin of this useful heuristic is unclear and is not evident in the literature.
Although it is readily apparent that smaller laboratories will not be able to determine laboratory-specific cutoffs for BCR-ABL1 kinetics with statistical confidence, the existence of the IS in BCR-ABL1 testing means that our results should be applicable to other laboratories under the (apparent) assumption of the stereotypic biological behavior of chronic myeloid leukemia in its response to standard tyrosine kinase inhibitor therapy. Indeed, even under conditions of dissimilar patient populations (eg, a reference center seeing a large proportion of relapsed or refractory chronic myeloid leukemia), these results should hold given the underlying biology.
Within our data, the majority of presumed responders showed a stereotypic decrease of 1 log/90 d, and only a small fraction demonstrated drops faster than 1 log/30 d in their IS values. Indeed, the rapid rebound of the IS value on subsequent testing after a 1 log/30 d drop raises the possibility of a spurious result (ie, a falsely low or negative result), as nearly half of these results returned to the median kinetics. These findings therefore suggest results with drops faster than 1 log in 30 days should be flagged for additional review, including repeat testing to exclude causes of false negatives such as specimen switch, reaction inhibitors, etc. In contrast to cases with rapid decreases in the IS value, only a small fraction of cases with a rapid increase in IS value normalized, suggesting a true treatment failure (assuming a change of therapy was undertaken after the increase), whereas in cases of a slower IS value rise, a significant fraction normalized. Given these results, rises in IS value are less useful as a delta check, and our results seem to support conclusions raised by other authors.12,14
There are numerous limitations to this study. The retrospective nature, absence of clinical correlation including the actual disease diagnosis, and absence of other data such as specimen/DNA quality require major assumptions about the disease status and course of patients. For example, we assume that the data we see are from patients with chronic myeloid leukemia, which likely represents the majority of test results, but we are unable to exclude cases of B-lymphoblastic leukemia with t(9;22) or acute myeloid leukemias with t(9;22), which likely have different kinetics in a posttherapy setting, and some of these cases with fast kinetics may represent these different disease states. Furthermore, as data were from a national reference laboratory, the majority of cases lack associated clinical data about the course of therapy. The question of whether the initial IS value of more than 10% represents the true initial time point of untreated disease or whether this represents someone who failed response and then transferred to a tertiary center cannot be answered based on our database.
From a practical standpoint, this delta-check method is far from perfect, as the Monte Carlo simulations suggest that a delta check will flag once in an average of 23 switched cases, and in some situations as many as 80 cases could be switched without a flag. On the other hand, although sample switches are unfortunate and to be avoided, given the statistics and distribution of our results, many of these presumed switches may not be clinically significant (eg, 2 or more negative patient samples swapped). Indeed, in our limited experience, single pairwise specimen swaps are unusual and nearly impossible to detect, as our simulations have shown that row/column swaps or frameshift mix-ups leading to large batch errors are another possible type of sample switch that would be more likely to be flagged with a delta-check approach. To our knowledge, there are no published data on failure modes in high-volume batch tests performed on 96-well plates.
This assay's analytical variability is not expected to significantly impact this delta-check method because those variances in our hands are far less than the 1-log delta flag. By using Monte Carlo methods, this variability is already accounted for in our analysis of performance for this proposed cutoff. Likewise, analytical variability near the limit of detection, although increased, is still below the 1-log threshold (coefficient of variation less than 0.5 log based on our validation data), suggesting analytical variability will not be a significant source of errors. Furthermore, samples collected at longer time intervals (eg, 90 days rather than 30 days after start of therapy per National Comprehensive Cancer Network recommendations7) would be more amenable to detection of specimen switches because extrapolating our 1 log in 30 days cutoff to 3 logs in 90 days means analytical variation is an even smaller proportion of our error margins.
The principal advantage of using Monte Carlo methods in estimating the performance of our flags lies in avoiding the need to estimate and model distributions of laboratory values. This is particularly challenging in laboratory values that are fundamentally continuous but whose measurement is limited by hard stops such as limit of detection on the low end or saturation/hook effect on the high end. Furthermore, other classical assumptions such as gaussian normality and temporal constancy do not hold true for many biological analytes.15 By using a Monte Carlo simulation to estimate error flagging rate, we can leverage real-world data to estimate performance rather than making assumptions, some of which would lack closed-form solutions.
In order to broadly harness the benefits of delta checks in a prospective matter, this proposed delta-check cutoff would be best implemented either within the laboratory information system if functionality permits, or as a modification to middleware solutions that flag cases for review and repeat. We are exploring the potential of this latter tool to prospectively flag these cases at the time of diagnosis to better gather associated clinical data that may provide more clarity with respect to the disease status and type. We expect that because this is a relatively rare event (<2.5% of cases or more than 2 standard deviations from the median), reflexive repeat testing along with other quality checks should not add significant cost or time and may result in a decrease of spurious (false-negative) results. However, even in situations where this is not possible (smaller laboratory settings, for example), this cutoff could be used by pathologists who are signing out smaller numbers of cases as long as prior results are available for review. Regardless of how cases are flagged, however, the follow-up procedure is likely the same, involving checking original orders and paperwork, checking sample and subsample labels, correlation with clinical history, and repeat extraction and amplification of the original sample from the client.
References
Author notes
The authors have no relevant financial interest in the products or companies described in this article.