ABSTRACT:
To identify risk and protective factors associated with physician performance in practice; to use this information to create a risk assessment scale; and, to test use of the risk assessment scale with a new population of assessed physicians.
Physician assessments that were completed by community-based physicians between March 2016 and February 2022 (n =2708) were gathered to determine what professional characteristics and practice context factors were associated with poor peer practice assessment (PPA). The predictive capacity of the resulting model was then tested against a new sample of physician assessments completed between March 2022 and February 2023 (n =320).
N=2401 physicians were eligible for inclusion in a logistic regression analysis, which resulted in an empirical model containing 11 variables that was able to account for 21.6% of the variance in the likelihood of receiving a poor PPA generated by the College of Physicians and Surgeons of British Columbia. The resulting model, when tested against 320 new cases, was able to predict good versus poor PPA performance with a sensitivity of 0.79 and specificity of 0.75. Not having undertaken peer review (OR=1.47) created a risk like that arising from a full decade passing since completion of medical school (OR=1.50).
In addition to being the largest known study of its type, this work builds on similar studies by demonstrating the capacity to use regulator-mandated peer review to empirically identify physicians who are at risk of substandard performance using factors that are safe from claims of violating Human Rights Codes; that emphasize modifiable aspects of practice; and that can be readily updated to account for change over time.
Introduction
Professional regulation remains a challenging and controversial process.1 There can be no doubt that incompetent practice poses a risk to patient safety, making it critical that some form of quality assurance is undertaken. How to mount a process that achieves the right cost-benefit ratio, however, is not nearly as clear. Even if one accepts the most pessimistic estimates of dyscompetence (failing to maintain professional standards), which range from 3% to 15%,2,3 most physicians are highly performing and motivated to stay that way. Furthermore, not all safety issues that do arise are attributable to the capabilities of the responsible physician because organizational and systemic factors play a substantial role.4 As a result, it is easy to waste finite resources undergoing elaborate assessment processes of physicians whose time would be better spent in other ways.
If designed carefully, with an emphasis on quality improvement supported by peer coaching, assessment practices can have value even if they do not reveal major deficiencies.5,6 Making good practice better, however, does not eliminate the importance of identifying and remediating those who do not maintain acceptable standards and, thus, are at greater risk of causing patient harm. Risk, in fact, has become a central principle in regulatory practice (ie, Right-Touch Regulation) as various regulators strive to tailor their oversight of individual practitioners in a way that is proportionate to the likelihood of substandard performance.7,8
Efforts to engage in risk-based regulation have identified several factors that are associated with various forms of under-performance, but have not resolved into a clear means of prioritizing the review of individual physicians. Various factors that have been identified (eg, age and sex) are protected in jurisdictions like Canada, meaning they cannot legally be used to discriminate between individuals.8,9 Others may encroach on protected factors (eg, location of training as a proxy for ethnicity); are too broad to enable the identification of individual physicians who are at-risk (eg, specialty of practice); or are likely to become less relevant the more time has passed by virtue of reflecting point-in-time measurements (eg, certification exam scores) in a context where individual practices and professional standards will evolve.8,9 Further, most studies stop at the identification of associations between risk factors and physician performance without taking the next step of determining whether or how factors can be combined to make effective decisions. That is a key omission given the constant risk of overfitting (analyses that correspond too closely to a particular data set, resulting in inflated impressions of how well the resulting model can predict future cases).10 What is needed to ensure an assessment system that both fairly and meaningfully identifies (and enables support of) at-risk physicians is a model that can account for the dynamic nature of practice and focuses on practice-related factors over which the physician has some degree of agency.
To that end, we have built on previous work by examining the weight that should be given to both factors that are known to be significantly associated with substandard performance along with previously unstudied markers of clinical activity to determine which variables create a best-fit model predictive of an extensive multi-component peer assessment. With nearly 3000 cases included, this study is larger than any other known of its kind, thus enabling separation of cases into those used for building a model and those used for testing it.
Purpose
The purpose of this study was to identify risk and protective factors associated with physician performance in practice; to use this information to create a risk assessment scale; and, to test use of the risk assessment scale with a new population of assessed physicians.
Methods
Context
The College of Physician and Surgeons of British Columbia's Physician Practice Enhancement Program (PPEP) is a quality assurance program that has been in operation for over 10 years, using the current assessment format since 2016.11 Efforts focus on providing comprehensive peer practice assessments of community-based family physicians and specialists using a national competency-based framework that guides performance improvement activities.12 The main focus of the program (and main outcome used for this study) is a peer practice assessment (PPA), constructed through chart review (with charts selected by both an assessor and the assessee that include both random and targeted patient types), a review of prescribing patterns (gathered through a provincial database), a multi-source feedback report (capturing numerical ratings and narrative from peers, non-medical colleagues, and patients), and self-reflection.
All collected data are discussed in an interview between a peer assessor and assessee and then described in a written assessor's report. The report is then reviewed by a PPEP medical advisor who uses a detailed decision-making matrix to assign a PPA score based on a 5-point scale ranging from 1 (“satisfactory record-keeping and clinical care was seen in the records reviewed, no serious deficiencies, and file closed with no further actions required”) to 5 (“serious deficiencies found that can affect patient safety and immediate remediation activities are required”).
Design and participants
Physician assessments that were completed by community-based physicians between March 2016 and February 2022 (n =2708) were used to conduct Phase 1 of this study. Disciplines included family medicine, psychiatry, pediatrics, internal medicine, and dermatology. Cases were excluded from analyses if they (a) had incomplete data; (b) were conducted as part of an intervention shortly following a poor assessment; or (c) were duplicates, as would occur if an original finding was contested and a second assessor had to be employed. After applying these rules, 2401 cases remained.
Data sources included College registration data, the College's annual license renewal form, any history of complaints, a PPEP pre-assessment questionnaire, Medical Services Plan (MSP) billings (a record of the annual fees for service each physician billed to the provincial funding plan); and a provincial database of prescriptions filled.
We conducted multivariate logistic regression modelling in SPSS, including both categorical (binary) and continuous independent variables, as detailed below, to determine which were associated with the likelihood of a poor PPA.
To test the resulting model, Phase 2 involved compiling a sample of newly completed assessments conducted by PPEP from March 2022 through February 2023 (n =320). We created a numeric risk score for each individual based on the regression-derived Beta values and examined the extent to which this score predicted poor PPA performance, as detailed below.
Independent variables
Professional characteristics: Binary variables included specialty certification; having had a previous peer assessment; holding a research, academic, or administrative role in addition to undertaking community-based clinical practice; and engagement in hospital work in addition to undertaking community-based clinical practice. Continuous variables included years since completing medical school; number of College complaints in the last 5 years; and prescribing volumes (number of units prescribed in the year before the assessment) of controlled/obsolete/ high risk drugs (as classified by the College's Drug Program for high-risk prescribing) and antibiotics (both general antibiotics and amoxicillin, specifically).
Practice context factors: Binary variables included solo practice (a physician who practices alone in a private office) versus multi-physician practice (2 or more physicians sharing a practice); locale (rural/ small population center versus medium/large urban city, as classified by Statistics Canada censuses)13 ; having an electronic record management (EMR) system (versus paper records); working as a locum tenens or not; and, whether the physician has a walk-in component to their practice. Continuous variables included practice volume (number of patients per week according to physician self-report); and annual billing in the year before the assessment was conducted (measured by MSP blue book listings).14
Dependent variables
PPA scores: For the sake of all analyses, PPA scores were dichotomized into good (a score of 1 or 2) and poor (3 to 5) because that is the threshold at which an assessee is required to participate in remedial activities following their assessment. That is, a “good” PPA means that no major deficiencies were found, and no further activities were required before the file could be closed. A “poor” PPA means remedial activities (eg, self-directed charting improvements, enrollment in formal continuing medical education courses) were required of the physician to address identified deficiencies; such a file would not be closed until the deficiencies were adequately addressed.
The scoring is completed by a program medical advisor who uses a decision-making matrix developed by the program to guide their decision. The team of medical advisors use this tool both for training/onboarding and as a guide for assessment scoring. They participate in annual inter-rater reliability/consensus building exercises, whereby they each independently review, score, and decide next steps on several assessment cases. Analyses performed on these exercises indicate that inter-rater reliability is always greater than 0.7 and any differences of opinion are discussed and challenged until a satisfactory consensus is reached.
Phase 1: Model building
Inferential statistics (chi-squared and t -tests) were calculated to compare the scores received on each variable as a function of whether registrants were evaluated as having a poor PPA or a good PPA. Such analyses, however, are susceptible to confounding factors given that variables can covary with one another. For example, in the studied context, the more time has passed since medical school the more likely physicians are to have undergone a previous assessment.
As a result, logistic regression was then conducted with all independent variables using a backwards stepwise (conditional) method to determine the unique association between each variable and the likelihood of receiving a poor PPA. To avoid collinearity (ie, whether any variables were too closely related), correlations were examined for any variables that correlated with one another at r =.7 or greater, but none were identified. Continuous variables were examined for outliers comparing Mahalanobis Distances to a chi-square distribution with the same degrees of freedom. 103 cases with a probability of less than .001 were identified, but examination of those cases revealed no obvious errors or values of concern, so all apparent outliers were retained.
Phase 2: Creating and testing a risk score
Beta values from the regression model were used as multipliers to derive a risk assessment score for each physician in the test sample. That is, everyone's risk score was calculated by summing the Beta value for all categorical variables possessed by the physician and the product of Beta x the physician's score for each continuous variable. The proportion of poor PPA cases that would be identified by selecting physicians based on their risk score was then calculated upon setting a threshold for risk assessment at the 50th percentile, 75th percentile, and 90th percentiles. Finally, a Receiver Operating Characteristic (ROC) curve was generated to examine the sensitivity and specificity of identifying someone as “high risk” as a function of where the threshold for that label was set.
Ethics review
This study received ethics approval from the Behavioral Research Ethics Board of the University of British Columbia based on the Tri-Council Policy framework governing research ethics review in Canada.
Results
Identifying Risk Factors
The physician sample included in the analysis was composed of 79% family physicians (n =1902), 8% pediatricians (n =185), 6% internal medicine doctors (n =155), 6% psychiatrists (n =140), and 1% dermatologists (n =19). Among the n =2401 cases, 78% (n =1878) were classified as having a good PPA and 22% (n =523) were classified as poor PPA. Bivariate analyses showed a statistically significant relationship between all independent variables and the dependent variable (good/poor PPA), apart from MSP Billing (Table 1), which was therefore excluded from the regression model.
Table 2 illustrates the variables that were retained because of using stepwise elimination of predictors with the standard p =.05 for entry and p =.10 for removal from the model. Three variables were excluded because they could be considered protected grounds under the Human Rights Code in British Columbia: Age, Sex, Country of origin (defined simplistically as whether the physician completed medical school in North America). The remaining 11 variables account for 21.6% of the variance in PPA outcome. Retaining the 3 excluded variables listed above would account for 22.4% of the variance, so their exclusion had minimal impact on the strength of the model.
Creating a Risk Score and Testing the Model
All the variables included in the final regression model were present in the follow-up data set of n =320 more recently reviewed physicians. 15.0% (n =48) of these physicians were assessed as having a poor PPA. To test the model illustrated in Table 2, everyone's risk score was calculated as detailed above. The resulting risk scores ranged from 0.94 to 4.70 with a mean of 2.6 and median of 2.4.
A comparison of means showed that those with a good PPA had a lower risk score (2.40, 95% CI (2.31, 2.50)) than those with a poor PPA (3.43; 95% CI (3.20, 3.66); t =-8.2, p <.001) with a very large effect size (d =1.3). When cases were ranked from lowest to highest with various thresholds set for “high-risk” it was clear that the higher the threshold the larger the proportion of physicians would be identified as underperforming (Table 3). That is, 62.5% (20/32) of physicians above the 90th percentile based on the generated risk scores received a poor PPA whereas the same was true for 40% (32/80) of physicians above the 75th percentile and 25.6% (41/160) of physicians above the median. Table 3 also illustrates that the higher the threshold the larger the number of poor performers who would be missed due to being considered “low-risk,” but those proportions did not rise at the same rate.
To consider where the optimal threshold lies (ie, the point that maximizes both sensitivity and specificity despite the trade-off between them), a Receiver Operating Characteristic curve was created. The area under the curve in Figure 1 is 81.7% and the curve reveals that sensitivity and specificity are optimally balanced when the threshold is set at a risk score of 2.94 (sensitivity of .79, specificity of .75) which corresponds to the 67th percentile of risk scores.
Discussion
Given the pedagogical value of assessment,15 the importance of investing in maintenance of competence over the course of one's career,16,17 and the insufficiency of self-assessment as a means for effectively guiding such maintenance,18,19 health professionals would be well advised to routinely undergo evaluations of their practice for the sake of continuous quality improvement.20,21 In a world of competing priorities, even well-intentioned and high performing physicians may have difficulty doing so.22,23,24 Regulatory authorities, as a result, continue to play a vital role in protecting the public through mandated quality assurance reviews. In a world of finite resources, however, they must be judicious in the way such practices are implemented.
To facilitate opportunities to do so, this paper reports on the largest evaluation of risk factors known to be conducted to date. In performing this work, we were able to identify 11 variables that sum together to provide an assessment of risk of poor performance (defined through an extensive peer review protocol) with a high degree of sensitivity and specificity. Several features of the generated list are noteworthy. First, when determining the unique contribution of each independent variable through regression analysis, observing that not having undergone a previous peer assessment increases the odds of a poor PPA 47% and reinforces the educational value of participation in a peer review program. That becomes especially apparent when one considers that time since medical school is a contradictory, yet highly predictive, factor. In terms of odds, the detriment induced by not undergoing a peer review appears to be equivalent to the detriment that arises from being a decade removed from training. Second, many of the variables reflect modifiable aspects of practice, thereby offering physicians prospective guidance as to how they might take steps to reduce their risk over the course of their careers. Engaging in scholarly work, monitoring one's prescribing habits, finding a community of practice, using an electronic medical record, and reducing one's patient volume all appear to be protective factors that are well within a physician's control. That is not to say that simply doing these things will overcome all practice deficits, especially given the limitation that these findings are all correlational, but each reflects a means by which one can proactively take steps to stay up to date, discover areas of weakness, and re-invest in one's expertise to continue to strengthen one's competence. Third, many of these variables can be expected to evolve as one's practice changes, thereby allowing for assessments of physician risk that naturally update to reflect current practice rather than historical experiences. Finally, while years since medical school is correlated with age, the model achieves a high degree of sensitivity and specificity despite deliberately excluding factors that could make a regulator vulnerable to claims of discrimination.
Testing the resulting risk scale against a new sample of physicians increases confidence that the model can be put to practical use to increase the likelihood of reviewing physicians who are more likely to be underperforming. At no point along the scale is the sensitivity and specificity perfect, but that is actually beneficial to a degree because physician review programs should take steps to reduce the extent to which they are seen as punitive for the sake of increasing engagement as a means of learning.6,25 It is good, in other words, to not create stigma and threat by focusing solely on those who have performance issues.16 That said, where a regulator chooses to set their threshold for review should depend on the resources available and the degree to which the cost inherent in reviewing high performing physicians is offset by increasing the number of poor performing physicians who can be identified and offered support. While identification of a threshold was useful to test the model, we suspect that the most practical solution will be to start at the top of one's risk list and go down as far as one can with the resources available rather than abiding by an absolute threshold. By doing so, one can reserve more intensive assessment and continuing professional development resources for those with higher risk of needing assistance while directing those with lower degrees of risk to less intensive resources. This is in keeping with Cayton's Right-touch Regulation7 and suggests the need for future work aimed at determining whether the performance benefits observed from undertaking a peer review can be generalized to less resource intensive reviews of those at lower risk.
Limitations
Limitations of this work include the fact that the risk scale may not be implementable in the same way in all jurisdictions given that not all regulators will have access to the same information and not all regulatory bodies primarily assess physicians with community-based practices. In those cases, however, we believe this work provides some direction regarding where resources might be best spent to begin generating new data sources.
It also provides a process that can be used to enable determination of what other variables are more beneficially included in other contexts. The variables included in this dataset accounted for 22% of the peer assessment outcome, which was clearly enough to meaningfully identify at-risk physicians in the test sample; how much, if any, of the remaining 78% of the variance is simply unaccountable (eg, due to random factors) and whether any gains in prediction can be made from adding additional variables (as opposed to simply overfitting the data) remains to be seen. Further, the heterogeneity of the sample (ie, the inclusion of 5 distinct specialties) may have occluded differences in terms of what would best identify at risk physicians within any given specialty. The advantage of having adopted this approach, however, is the creation of a single model that can be applied to all physicians rather than attempting to maintain distinct risk scales for all specialty groups, some of which will be less robust due to smaller numbers of physicians. We were reassured in the decision to pursue that goal by the fact that specialty-certification was not retained as an important factor in the empirically derived model (Table 2) of risk. Finally, while performance on the extensive and data-informed peer review mounted by the CPSBC's PPEP offers an important outcome, it is undoubtedly insufficient with respect to capturing the full scope of physician performance. As a result, this work would be beneficially replicated with additional outcomes guided by the breadth of competencies expected in frameworks like CanMEDS.12
References
Funding/support: N/A
Other disclosures: N/A
Ethics statement: This study received ethics approval from the Behavioral Research Ethics Board of the University of British Columbia based on the Tri-Council Policy framework governing research ethics review in Canada.
Author contributions: Study concept and design (all); Acquisition of data (all); Drafting of the manuscript (all); critical revision of the manuscript for important intellectual content (all).