## Abstract

*Context.*—Clinical laboratory assessment of test linearity is often limited to satisfying regulatory requirements rather than integrating this tool into the laboratory quality assurance program. Although an important part of quality control and method validation for clinical laboratories, linearity of clinical tests does not get the attention it deserves.

*Objective.*—This article evaluates the concepts and importance of linearity evaluations for clinical tests.

*Design.*—We describe the theory and procedural steps of each linearity evaluation. We then evaluate the statistical methods for each procedure.

*Results.*—Visual assessment, although simple, is subjective. The lack-of-fit error and the 1986 NCCLS EP6-P G test are sensitive to imprecision and assume that the data are first order. Regression analysis, as developed as the polynomial method, is partly based on the experiences of the College of American Pathologists Instrumentation Resource Committee and has proved to be a robust statistical method.

*Conclusions.*—We provide general guidelines for handling non-linear results from a linearity evaluation. Handling linearity data in an objective manner will aid clinical laboratorians whose goal is to improve the quality of the tests they perform.

According to NCCLS EP6-A, a quantitative analytical method is said to be linear when the analyte recovery from a series of sample solutions (measured value) is linearly proportional to the actual concentration or content of the analyte (true value) in the sample solutions.1 The points at the upper and lower limits of the analytic measurement range that acceptably fit a straight line determine the linear range. In some assays, the instrument response versus concentration of sample solutions is not linear; for example, competitive radioimmunoassays have a parabolic-shaped instrument response when plotted against concentration and a sigmoid-shaped curve when the response is plotted against the logarithm of the concentration. The responses may be transformed using a 4-parameter logistic formula or other formula such as logit-log. The test results from this transformation should be linearly proportional to the true value of the analyte in the sample solutions (Figure 1). Therefore, the curve of the instrument response, which can be parabolic or sigmoid-shaped, should not be confused with linearity between the measured value and the true value.

Some laboratory personnel mistake the requirement for calibration verification with that for evaluating linearity. The Clinical Laboratory Improvement Amendments of 19982 define calibration as the process of testing an analytical method to establish a relationship between known concentrations of an analyte (calibrators) and the measured value, but makes no specifications about linearity. Calibration verification ensures accuracy and involves measuring analytes in calibrators or other samples of known values traceable to a reference method to confirm that the established relationship has remained stable.2 In contrast, one does not need to know the absolute concentration to perform a linearity evaluation, even though knowledge of such is helpful in establishing upper and lower limits. For example, testing serial dilutions of a sample with an elevated value is an acceptable approach. It is the straight-line relationship between points that is the focus of interest in linearity evaluation. As described in NCCLS E6-P2, this linear relationship is important to clinicians who rely on this linear relationship for easy interpolation of results.1

Prior to the availability of linearity survey programs, such as those provided by the College of American Pathologists (CAP; Northfield, Ill) and Casco (Portland, Me), individual laboratories were forced to undergo the difficult task of saving specimens from patients with elevated results for testing at the upper limits of the analytic measurement range. In the absence of testing at the upper limit of the analytic measurement range, laboratories had to narrow the range with a subsequent increase in the frequency of sample dilutions.3 In 1988, the CAP Instrumentation Resource Committee (IRC) began offering linearity surveys as a tool for evaluating linearity. The linearity program provides pre-prepared, analyte-spiked human samples (mostly serum and some urine) covering the full, expected operating range for the analytes being tested for linearity. Although lyophilized samples were used initially, analyte-spiked serum and urine samples that do not require dilution are now being used, eliminating imprecision due to manual pipetting. Since survey samples are made to specific analyte target values, the data analysis verifies calibration within preset tolerances for the participants, unlike the use of stored patient samples, which can only be used to evaluate linearity. These surveys are also useful because they allow comparison across laboratories and methods.

Participation in a linearity and calibration verification program can add a layer of quality assessment above and beyond that provided by proficiency testing programs. Generally, linearity testing has a narrower range of acceptability than proficiency testing and is more effective in detecting analytical problems. For proficiency testing, either 3 SD from the peer group mean, an absolute percentage, or an absolute percentage deviation from the peer group mean is the usual limit, which is often disparate from the true analytical value for a given sample. Acceptability in linearity testing can have much narrower, absolute limits for error based on medically- or analytically-relevant criteria. Also, linearity testing challenges the entire calibration range, including the extremes, and can detect problems such as reagent or spectrophotometer deterioration earlier than quality control or proficiency testing failures. It is also good laboratory practice to periodically demonstrate linearity to detect reagent deterioration, monitor analyzer performance, or re-confirm linearity after a major servicing of equipment.4

## PERFORMANCE OF LINEARITY STUDIES

### Preparation of Standards

The appropriate evaluation of linearity requires 5 different concentrations spanning the analytical range. Five or more samples are necessary because a sigmoid-shaped, nonlinear curve can be missed with a regression using fewer than 5 points. Even though in geometry 2 points define a line, empirical studies require at least 3 points to add an additional degree of freedom for statistical computations. If a parabolic curve is to be captured, then 3 points define the curve, but 4 points are needed to assess “goodness of fit”; the 1 additional point is required for statistical calculation. A sigmoid curve can be defined by 4 points if it is symmetric about its axis, but by 5 points if it is asymmetric.

The samples should be spaced where analytically relevant. Frequently, equal spacing is sufficient. Spiking a biologic matrix with known amounts of analyte, making serial dilutions, or creating mixtures with different ratios of a high and low standard are all acceptable approaches that can be used to prepare the test samples.1 Typically, mixtures and serial dilutions have less error than individually prepared solutions. At least 2 replicate samples should then be run to allow for estimation of random error.1 Samples should be run in random order within the same day after establishing that the instrument is calibrated and in control.

### Analysis of Linearity Results

A wide variety of analytic and statistical methods have been developed to estimate the departure of the linearity experiment from perfect linearity. The methods for interpretation of the data have evolved over the years from simple visual inspection to statistical regression analysis.3,5,6 The techniques have been extensively reviewed by Tholen6 and Kroll and colleagues.5,7 The newer procedures have been developed to determine whether deviation from linearity is significantly relevant to analytic or clinical goals.5 The more commonly used methods, described by Tholen,6 are adapted here for completeness, and more recently adopted methods are then described.

#### Visual Review

The most common method used to interpret the results of a linearity experiment is visual review of a plot of the replicate mean of the measured values versus the true value for each level of the sample solutions.6,8 The desideratum is for y-axis values (measured value) to be as close as possible to the x-axis values (true value). The points are connected, and the evaluation is then based on the degree to which the data follow a straight line (Figure 2). For experiments with known x-values, it may be useful to draw a line with a slope of 1 passing through the origin as a visual reference for deviation from perfect linearity. The best-fit line may be drawn for experiments using equally spaced sample solutions. Visual assessment is a simple and intuitive tool for an expert laboratorian, but it is subjective, unreliable, and poorly reproducible when used without expert understanding of the method.6,8

#### Least Squares Linear Regression

The most commonly used method for fitting a line to the data is the least squares linear regression.6 The true value is plotted on the x-axis. If the solutions are evenly spaced, unitless solution numbers may be assigned to the x-axis. The measured values are plotted on the y-axis. Least squares linear regression fits a straight line to a set of data points such that the sum of the squares of the vertical distance of the points to the fitted line is minimized. This minimization is performed in the vertical direction, since the x-axis represents true values. The equation is of the familiar form y = mx + b, where m is the slope of the line and b is the y-intercept. The y-intercept is the point where the regression line crosses the y-axis, that is, the value for y where x equals 0. This value can be either positive or negative. In more common terms, the y-intercept represents the constant systematic error or constant bias (Figure 3). The y-intercept should be as close to 0 as possible. The acceptable value of the y-intercept depends on the analyte being evaluated. Analytes for which the clinical decision points are close to the 0 point of the analytic measurement range require a y-intercept close to 0, while analytes that have clinical decision points in the middle or high end of the analytic measurement range are more tolerant of a larger y-intercept.

When the solutions have known values, the ideal value of the slope of the regression line, m, is 1. The deviation of the slope of the regression line from 1 is used as an estimate of the proportional systematic error of the testing system (Figure 3). Proportional error is most often caused by incorrect assignment of the amount of substance in the calibrator. As a result, the error is consistently high or low proportional to the concentration of the analyte. Again, the level of acceptability will depend on the analyte tested. The least squares method is exquisitely sensitive to outliers, which weighs the regression heavily towards the largest values. Thus, the main drawback of least squares linear regression is that a single outlier may “pull” the regression line steeper or flatter. Mathematically, such problems can be reconciled by appropriately choosing the data range and using the polynomial method (see “The Polynomial Method”).

Another useful tool when looking at the slope is to look at the successive deltas (ie, the difference between two successive points when the target values are equally spaced or the ratio of deltas when not equally spaced). If the fit is linear, the deltas should be same for each subsequent interval. The successive deltas are the “poor man's” derivative; when a data set is linear, the first derivative is a constant (Table).

#### Error.6

The total error around the regression line is equal to the sum of the pure error or random error and the lack-of-fit error. The components of error can be determined using an appropriately constructed linearity evaluation with 2 or more replicates at each level. The pure error is the error of duplicate samples around their common mean and estimates random error. It is the sum of the squared vertical deviations from the mean of replicated samples at all sample levels. The lack-of-fit error estimates the appropriateness of the given model. The lack-of-fit error is the sum of the squared vertical distances between the mean of replicates at a given level and the regression line. The total error is the sum of the pure and lack-of-fit error and can also be calculated as the sum of the squared vertical distances between all the measured values and the regression line. If the model is a good fit, the lack-of-fit error will be close to 0 and the total error will be all random error.

#### Error and the G Test

In 1986, the NCCLS EP6-P guidelines incorporated a statistical procedure, the G test, which can be used to determine the appropriateness of a regression model.6,8 The G statistic is defined as the ratio of the lack-of-fit error to the pure error. This ratio is an F statistic and failure is set at *P* < .05.

If the value is less than a critical value, then the fit is linear. One limitation of the G test is that it is too sensitive when precision is good and too insensitive when precision is poor.3,9–12 Furthermore, the G test assumes that the data are first order, which is often an inappropriate assumption. Therefore, additional methods of statistical evaluation were subsequently developed.

#### The Polynomial Method (The CAP IRC Method)

The 2003 proposed revised guideline NCCLS EP6-A1 has replaced the NCCLS EP6-P guidelines of 1986,8 which did not provide methods to evaluate nonlinearity with clinically acceptable goals in mind. The approach proposed in 2001 as NCCLS EP6-2 is based in part on the experiences of the CAP IRC. Kroll and Emancipator7,13 developed a polynomial method to evaluate data for nonlinearity. The CAP IRC developed a computer program incorporating the polynomial approach and also found that first-, second-, and third-order polynomials are commonly observed patterns in participant data, as expected.5,7,13 The CAP linearity survey provides pre-diluted liquid samples (mostly serum, 1 urine) containing analyte with analyte concentrations at the upper and lower limits of the analytic measurement range. Samples are run in duplicate, and results are submitted to the CAP IRC for statistical comparison with peer group results. The procedure then calculates the regression equation and uses statistical tests to determine if the data are best modeled by a linear, quadratic, or cubic equation. A detailed mathematical treatment of this technique is provided elsewhere.5,7 The program, after fitting the data to the 3 models, assesses whether the nonlinear coefficients are statistically significant. If the nonlinear coefficients are statistically significant, then the data are nonlinear. If the nonlinear coefficients are not statistically significant, then the data are linear. If the best-fit equation is linear, then the data are called “Linear 1” in the CAP IRC survey. If not, the data are nonlinear because a quadratic or cubic equation better models the data, but the data set undergoes a further check to determine if the nonlinearity in the data is clinically significant when tested against clinically relevant allowable error.5 The process calculates the average distance between the best nonlinear fit and the linear one. The average distance is then compared against an analyte-specific bias adjusted for the random error.5 If the average deviation is within these predetermined limits, then the deviation from linearity is not clinically important and the CAP IRC surveys designate it “Linear 2.” It is important to note that Linear 1 and Linear 2 are equally valid demonstrations of linearity. Finally, if the data yield an equation that is higher order than a line (quadratic or cubic) and the difference between the polynomial and the straight-line fall outside preset tolerance limits, the data are determined to be nonlinear.

It should be noted that all methods for evaluating linearity must have enough statistical power to detect nonlinearity. This requires that the data have a minimum level of precision. The CAP polynomial method uses a statistically derived, formal approach for assessing whether the data contain sufficient precision to attain appropriate statistical power.5 Therefore, the data are put through a test of precision prior to initiating the linearity test. If the data are not sufficiently precise (poor repeatability), then linearity is not assessed and the data are labeled “imprecise.” In other words, the data set has too much variability to assess accurately the departure from linearity, and there is insufficient statistical power to evaluate the data.5 Calculating the ratio of the SD around the best-fit polynomial to the mean concentration for all assay solutions screens for imprecision. This ratio is compared to a quantity based on the clinically relevant tolerance limit for the analyte, the number of measurements made (number of solutions times the number of replicates), and a constant that depends on the degree of the best-fit polynomial. This statistical approach has gained general acceptance as the best statistical method to evaluate linearity of quantitative tests and has been adopted as an approved guideline (NCCLS EP6-A).1

## TROUBLESHOOTING

Laboratories must ensure the reliability of test results when nonlinearities are discovered during a linearity evaluation. The specific actions will depend on the analyte, method, extent of nonlinearity, and the individual laboratory. Since each situation is different, we suggest some general guidelines for evaluating nonlinearity.

During the pre-analytic steps, human error is often involved. If the materials are not prepared or stored properly, the amount of analyte may be incorrect or may deteriorate over time. Systematic error can be introduced by incorrect pipette calibration, while an imprecise pipette can result in imprecise results. Samples may be incorrectly labeled or otherwise incorrectly matched for content and testing.

Analytic error can arise from various steps in the analytic process. A good starting point is to check maintenance, quality control, and calibration logs during the period prior to the linearity testing to identify measurements that are out of control. This may provide some evidence and direction for finding and correcting the source of poor performance on the linearity evaluation. Potential, common analytic problems include wavelength shift of the spectrophotometer, dirty or aging optics, and dirty or elevated background in scintillation counting wells.4 Problems with reagents can occur as well. Improperly prepared reagents, expired reagents, or reagents near the end of their shelf life can cause nonlinearities. Transformation formulas, such as those used for immunoassays, can also be inappropriate for the binding characteristics of the antibody system being evaluated, leading to a nonlinear evaluation.

One common cause of post-analytic error for this and all other surveys is incorrect transcription of the data onto the submission forms. Transcription should be verified prior to submitting data and should be confirmed if an outlying point is seen on the initial visual inspection.

## COMMENT

Methods for the determination of linearity have evolved over time, from simple visual assessment to the polynomial method. Although visual assessment is an important step in interpretation of a linearity experiment, it is subjective and poorly reproducible. Linear regression techniques were next used to fit the data to a regression line, and the G test was added to determine appropriateness of this model. However, this method is dependent upon precision and assumes that the data set is linear. It was subsequently deemed an inappropriate test for evaluating linearity. Kroll and Emancipator7,13 developed a method for comparing first-, second-, third-, and higher-ordered polynomials, which the CAP IRC adopted for its linearity survey program. The polynomial method uses clinically relevant goals in assessing linearity.

The benefits of the CAP IRC and other linearity surveys are that the assessment picks up problems before quality control or proficiency testing failures occur. Therefore, enrollment in a linearity program can help detect problems, so that corrections can ensure accurate and reliable laboratory results. The linearity assessment can detect reagent deterioration, monitor analyzer performance, or re-confirm linearity after a major servicing.4 Peer group comparisons allow the clinical laboratory physician to compare the laboratory's instrument to the same instruments of other enrollees.

The CAP IRC monitors the survey participant reports to evaluate the adequacy of both the survey materials and the statistical tools. Ongoing considerations include adding new analytes to be tested, refining the statistical methods, deciding how to best supply the materials, and determining how to best process the samples once the product is received in a laboratory. Through proficiency testing, calibration verification, and other means, one can determine the accuracy of a method at specific points. Accuracy within all points in the analytic measurement range requires that one be able to show that the mathematical relationship between input and output (concentration, activity, etc) is continuous and acceptably linear. If that relationship is nonlinear, then one must know it exactly, which requires empirical study. If it is linear, then with 2 determined points, one can generate the rest of the response curve and report a consistent, reliable, and clinically meaningful value from the entire analytical range.

## Acknowledgments

The authors extend their thanks to William J. Castellani, MD (Department of Pathology, Truman Medical Center, Kansas City, Mo), and the members and staff of the College of American Pathologists Standards and Instrumentation Committee for their review of this manuscript.

## References

## Author notes

Reprints: Martin H. Kroll, MD, VA Medical Center, Pathology and Laboratory Medicine, SVC (113), 4500 Lancaster Rd, Dallas, TX 75216-0000 ([email protected])