Context.—Since 1988, the College of American Pathologists has been offering materials for calibration verification coupled with the surveys for linearity, called the linearity (LN) surveys.

Objective.—To determine whether successful completion of the College of American Pathologists LN surveys provides a benefit in terms of improved proficiency testing (PT) performance.

Design.—In this study, we used information from LN surveys LN1/2, LN3, and LN5 and from the PT surveys C, Z, and K administered and analyzed in the year 2000. For the PT data, we calculated 4 measures of performance: passing PT, results exceeding 2 SDs, sum of absolute SD intervals, and the absolute sum of SD intervals. For the LN data, we classified laboratories as participants versus nonparticipants in LN surveys and by whether or not LN survey performance was successful.

Results.—LN enrollees had fewer unacceptable PT results than did nonenrollees. Additionally, for many analytes there was a significant positive association between LN performance and PT performance.

Conclusions.—For most analytes studied, there was strong evidence linking performance on PT surveys with performance on LN surveys. Eight of 13 analyses (62%) demonstrated improved performance with successful calibration verification.

Federal regulations (Clinical Laboratory Improvement Amendments of 19881) specifically state that calibration verification, that is, the determination of analyte in materials composed of a matrix similar to that of patient samples, be performed every 6 months at every change in lot numbers of reagents, after major preventive maintenance, when controls show unusual trends or are out of acceptable limits, or more frequently if recommended by the manufacturer or laboratory. Sometimes, the term validation of the analytical measurement range is used instead of the term calibration verification. Calibration verification must cover the analytical measurement range, thereby requiring that the concentrations or activities being evaluated include the lowest and the highest values of this range. Since 1988, the College of American Pathologists (CAP) has been offering materials for calibration verification coupled with the surveys for linearity, called the linearity (LN) surveys. The concentrations or activities evaluated in these surveys span most reportable ranges and exceed the ranges found for proficiency testing (PT), as offered by the CAP, for most analytes. Calibration verification in the LN surveys determines whether the instruments or methods are properly calibrated against those of a peer group. At least 4 of 5 consecutive solution averages must fall within specified limits derived from the analytical goal for error for that analyte. Successful completion of calibration verification surveys validates the analytical measurement range. We investigated whether successful calibration verification also improves results of PT.

In this study, we used information from LN surveys LN1/2, LN3, and LN5 and from the PT surveys C, Z, and K. We studied the analytes sodium, potassium, glucose, albumin, iron, creatinine, alanine aminotransferase, digoxin, carcinoembryonic antigen, cortisol, β–human chorionic gonadotropin, folate, and vitamin B12. All surveys had been administered and analyzed in the year 2000. Table 1 lists the details of these CAP surveys, including the number of participants. For the PT data, we calculated 4 measures of performance for each of the analytes listed in Table 1. For the LN data, we grouped laboratories into a simple 2-stage hierarchy for each analyte. In this hierarchy, laboratories were first classified as either participants or nonparticipants based on LN enrollment. Among those enrolled in both PT and LN surveys, we further classified laboratories into 2 groups based on LN performance: successful (the laboratory met the criteria for validation of the analytical measurement range) and unsuccessful (the laboratory's results fell outside of those limits).

Table 1.

Summary of College of American Pathologists Survey Data Used in Analysis*

Summary of College of American Pathologists Survey Data Used in Analysis*
Summary of College of American Pathologists Survey Data Used in Analysis*

Measuring PT Performance

We evaluated PT performance using the grading rules mandated by the Clinical Laboratory Improvement Amendments for each analyte and using additional measures intended to identify laboratories that may be at risk for failing PT. Our 4 PT performance variables are defined as follows.

Presence of Unacceptable PT Results

The first measure of laboratory PT performance is a binary variable indicating whether there are any unacceptable PT results in a single mailing for each analyte examined in this study, using the PT grading rules. Typically, a laboratory completes 5 challenges per analyte, and the grading rules determine an interval of acceptable results based on the peer group mean and in some instances the peer group SD. The binary variable defined as the presence of unacceptable results is 1 for the ith laboratory when any results fall outside of the grading interval for that analyte and 0 otherwise.

We defined 3 other measures of PT performance to identify laboratories that may be at risk for failing PT based on deviations from their respective peer group means. These remaining 3 variables are intended to measure within-laboratory variability relative to a peer group standard without incorporating specific PT grading rules. In each case, results are standardized using the corresponding peer group mean and SD. We then calculated 3 different measures of laboratory variability using these standardized variables.

Results Greater than 2 SDs

For a given peer group and analyte, let Zij be the standardized result, or SD interval (SDI), for the jth challenge of the ith participant. That is, if j is the peer group mean and SDj is the peer group SD for the jth challenge, then Zij = (Xijj)/SDj. For the first variable using the standardized results, we construct a binary variable to indicate whether any of the standardized results are less than −2 or greater than +2. If any challenges are greater than 2 SDs from the participant's peer group mean, we assign a value of 1. Otherwise, the value is 0.

Sum of Absolute SDIs

The SDI is the laboratory result standardized by the corresponding peer group mean and SD. For the ith laboratory, the sum of the absolute SDIs is defined for a given analyte as

where the summation is taken over the number of challenges, usually 5, per analyte. This variable is intended to identify laboratories with large variability from results either substantially greater or less than the peer group average.

Absolute Sum of SDIs

The final variable, the absolute sum of the SDIs, is intended to identify laboratories with persistent bias, ie, with results that are consistently greater than or less than the peer group mean. Here, the summation is taken first, and then the absolute value is calculated. The variable is formally defined as

where the summation is taken over the number of challenges, usually 5, per analyte.

Comparison of LN and PT Performance

In the first set of analyses, we compared LN enrollees and nonenrollees using these 4 PT performance measures. For the 2 binary variables, we compared results using either a χ2 test or a Fisher exact test when expected cell counts were small. For the 2 continuous variables defined in Equations 1 and 2, we used the Wilcoxon 2-sample test to evaluate whether the locations of the distributions of the enrollee and nonenrollee performance measures differed by 0.2 In the “Results” section, we have listed specific results with P values of <.10. Generally, when the significance level of a comparison was >.10, we have indicated that the test was done but no significant differences were found. We have also listed all results for iron.

In the second set of analyses, we compared successful and unsuccessful LN enrollees using the 4 PT performance measures. Again, we used either the χ2 or Fisher exact test for the binary variables and the Wilcoxon 2-sample test for the continuous variables to formally evaluate the significance of the observed differences in performance measures for the different groups of participants. All comparisons with P values of <.10 are listed in the “Results” section, and nonsignificant results are noted. We have listed all results for iron, which is the single analyte that showed no suggestion of association between PT performance and LN enrollment or success.

LN Participants Versus Nonparticipants

For many analytes, laboratories enrolled in the corresponding LN surveys demonstrated superior PT performance, as measured by the presence of unacceptable results among all challenges, compared with laboratories not enrolled in the corresponding LN surveys (Table 2). Specifically, LN enrollees had lower rates of unacceptable PT results. The intralaboratory variability of the PT results, however, was generally of similar magnitude for LN enrollees and nonenrollees, as measured by the remaining 3 variables (performance measures), except for albumin and alanine aminotransferase, which had reduced results for excessive deviation from participant means, and increased SDIs. Table 2 lists the results for analytes with significant differences for LN enrollees versus nonenrollees and indicates which analytes had no significant differences. The summary measures in column 4 are taken over all LN enrollees, regardless of their performance in the LN surveys.

Table 2.

Results of Statistical Analyses Comparing Proficiency Testing (PT) Performance for Linearity (LN) Survey Participants and Nonparticipants

Results of Statistical Analyses Comparing Proficiency Testing (PT) Performance for Linearity (LN) Survey Participants and Nonparticipants
Results of Statistical Analyses Comparing Proficiency Testing (PT) Performance for Linearity (LN) Survey Participants and Nonparticipants

PT Performance by LN Performance

For many analytes, there was a significant positive association between LN performance and PT performance. Table 3 lists the analytes and measures of PT performance, providing evidence that successful LN performance, as measured by calibration verification, translates into successful PT performance. For most analytes, the presence of unacceptable results using the PT grading rules illustrates this association. For folate and vitamin B12, the association appears only for the measures of intralaboratory variability (results more than 2 SD's from the participant mean, the sum of absolute SDIs, or the absolute sum of SDIs). There were no significant differences in PT performance for alanine aminotransferase, glucose, and iron.

Table 3.

Results of Statistical Analyses Comparing Proficiency Testing (PT) Performance for Successful Linearity (LN) Survey Participants (Results Calibrated and Linear) Versus Unsuccessful LN Participants (Either Not Calibrated or Not Linear)

Results of Statistical Analyses Comparing Proficiency Testing (PT) Performance for Successful Linearity (LN) Survey Participants (Results Calibrated and Linear) Versus Unsuccessful LN Participants (Either Not Calibrated or Not Linear)
Results of Statistical Analyses Comparing Proficiency Testing (PT) Performance for Successful Linearity (LN) Survey Participants (Results Calibrated and Linear) Versus Unsuccessful LN Participants (Either Not Calibrated or Not Linear)

Table 4 lists the complete set of results for iron, the single analyte with no suggestion of association between PT performance and LN enrollment or performance. There was no trend in the direction of association between the measures of PT performance and LN enrollment or success.

Table 4.

Results for Iron as the Single Analyte Showing No Significant or Marginally Significant Differences in Proficiency Testing (PT) Performance Based on Linearity (LN) Survey Enrollment or Performance*

Results for Iron as the Single Analyte Showing No Significant or Marginally Significant Differences in Proficiency Testing (PT) Performance Based on Linearity (LN) Survey Enrollment or Performance*
Results for Iron as the Single Analyte Showing No Significant or Marginally Significant Differences in Proficiency Testing (PT) Performance Based on Linearity (LN) Survey Enrollment or Performance*

Eight (62%) of 13 analytes demonstrated reduced number of unacceptable results with successful calibration verification, and 5 analytes demonstrated simultaneous reduction in the number of unacceptable results and variability (Table 5). All analytes except iron showed at least some suggestion of positive association between PT performance and LN enrollment or success.

Table 5.

Summary of Results Comparing Proficiency Testing and Linearity (LN) Survey Performance

Summary of Results Comparing Proficiency Testing and Linearity (LN) Survey Performance
Summary of Results Comparing Proficiency Testing and Linearity (LN) Survey Performance

In this analysis, we demonstrated that for most analytes, there is evidence linking performance on PT surveys with performance on LN surveys. Only iron did not demonstrate better PT performance with LN enrollment or performance.

Compared with the previous study by Lum et al,3 we demonstrated significant differences for cortisol, creatinine, beta human chorionic gonadotropin, and potassium. In that previous study, 19 (58%) of 33 analyses demonstrated improved performance (reduced number of unacceptable results) with successful calibration verification.3 In the present study, 8 (62%) of 13 analyses demonstrated improved performance (reduced number of unacceptable results) with successful calibration verification. The biggest change since the previous study in the calibration verification portion of the LN survey has been a change in the LN materials from lyophilized to liquid. Participants no longer are required to make dilutions for most of the analytes, resulting in a decreased number of poor admixtures. Thus, fewer false assessments of poor calibration verification occur. In addition, these results may reflect better performance by the liquid material, as indicated by accuracy within peer groups.

In addition to outright unacceptable results, we also studied other conditions related to poor performance with PT. Many laboratories investigate PT results when the participant result exceeds the group mean by more than 2 SDs (increased variability). Because such investigations require additional resources and may be a source of concern, a survey that can reduce the incidence of increased variability would be useful.

Many laboratories investigate PT when the sum of the SD indices (SDIs) is greater than 5. Usually the SDIs in such cases are all positive or all negative, and such results indicate a strong positive or negative bias. In cases where they are not all positive or negative but the sum of the absolute values of the SDIs is greater than 5, then increased imprecision in the methodology should be suspected. These occurrences are not failures but rather represent warnings that a method is not performing up to expectations. For these types of cases, we measured the number of laboratories that experienced an increase in the sums of absolute SDIs or an increase in the absolute sum of the SDIs. By reducing the number of instances of these non–failure-oriented warnings, the LN calibration verification surveys can save valuable laboratory resources. Lum and his colleagues3 did not examine the effect of successful calibration verification on variability, as was done in the present study.

There are several reasons why successful performance with LN surveys may translate into better performance (both reduced number of unacceptable results and reduced variability) with PT. Failure to validate the analytical measurement range may alert a laboratory to problems before they are experienced with PT, thus allowing the time to make appropriate corrections. The LN material frequently covers a much wider range of values than do the PT materials, providing a greater challenge for a particular method's performance. The allowance for error is tighter for the LN surveys than for PT.

Clinical Laboratory Improvement Amendment of 1988: Final Rule.
42 CFR Part 405, et al. Washington, DC: US Dept of Health and Human Services, Health Care Financing Administration Public Health Service; Federal Register, February 28, 1992:7165. 493.1217
.
Fisher
,
L. D.
and
G.
van Belle
.
Biostatistics: A Methodology for the Health Sciences.
New York, NY: John Wiley & Sons; 1993
.
Lum
,
G.
,
D. W.
Tholen
, and
D. A.
Floering
.
The usefulness of calibration verification and linearity surveys in predicting acceptable performance in graded proficiency tests.
Arch Pathol Lab Med
1995
.
119
:
401
408
.

The authors have no relevant financial interest in the products or companies described in this article.

Author notes

Reprints: Martin H. Kroll, MD, Department of Clinical Chemistry, Dallas VA Medical Center, Room 113, 4500 Lancaster Rd, Dallas, TX 75216 ([email protected])