Background Although entrustment-supervision ratings are more intuitive compared to other rating scales, it is not known whether their use accurately assesses the appropriateness of care provided by a resident.

Objective To determine the frequency of incorrect entrustment ratings assigned by faculty and whether accuracy of an entrustment-supervision scale differed by resident performance when the scripted resident performance level is known.

Methods Faculty participants rated standardized residents in 10 videos using a 4-point entrustment-supervision scale. We calculated the frequency of rating a resident incorrectly. We performed generalizability (G) and decision (D) studies for all 10 cases (768 ratings) and repeated the analysis using only cases with an entrustment score of 2.

Results The mean score by 77 raters for all videos was 2.87 (SD=0.86) with a mean of 2.37 (SD=0.72), 3.11 (SD=0.67) and 3.78 (SD=0.43) for the scripted levels of 2, 3, and 4. Faculty ratings differed from the scripted score for 331of 768 (43%) ratings. Most errors were ratings higher than the scripted score (223, 67%). G studies estimated the variance proportions of rater and case to be 4.99% and 54.29%. D studies estimated that 3 raters would need to watch 10 cases. The variance proportion of rater was 8.5% when the analysis was restricted to level 2 entrustment, requiring 15 raters to watch 5 cases.

Conclusions Participants underestimated residents’ potential need for greater supervision. Overall agreement between raters and scripted scores were low.

Workplace-based assessment (WBA) is vital for evaluating resident performance in clinical settings.1,2  However, rating errors, particularly those stemming from inconsistent raters, pose a significant challenge.3,4  These errors can lead to suboptimal patient care and educational outcomes.5  This study addresses this issue by emphasizing the critical need to understand and mitigate rating errors in WBAs, providing essential insights for program directors.

While increasing the number of assessments helps mitigate poor interrater reliability, and it is also important to understand other sources of error. This involves estimating how much of the score variation is attributed to residents, raters, or other factors through generalizability (G) and decision (D) studies. This psychometric approach aims to identify sources of variability (G studies) and how to use the results of assessments to make decisions about the learner (D studies). However, this psychometric approach doesn’t guarantee accurate assessment in individual patient encounters, potentially leading to competency and care appropriateness concerns. For example, a resident may be deemed competent across several observations but may not have performed well or may not have been accurately assessed in one or more of those encounters. Therefore, aggregating the assessments does not address the competency level or appropriateness of care provided by the resident or the accuracy of the observation in a single patient encounter, potentially impacting the quality of care a patient receives.

Medical educators aim to improve WBA reliability with entrustment-supervision rating scales. These scales, based on decreasing resident supervision needs, are often more intuitive for faculty and residents.6-9  Early research suggested faculty can more easily identify with the concept of entrustment versus competency (thereby improving interrater reliability).9  While these scales may reduce the number of needed observations for acceptable reliability, questions about their enhanced effectiveness have emerged.9-11 

It is not known whether the use of entrustment-supervision ratings improves the accuracy of single observations, therefore addressing the appropriateness of care provided by a resident with a patient in a single encounter. While programmatic determination of the overall competency of a resident is important, it is equally important to ensure each patient encounter provides safe, effective, and patient-centered care under the right amount of supervision.12 

We aimed to measure the accuracy of single-encounter, entrustment-supervision scale WBAs. The main objective of this study was to determine the frequency of entrustment rating errors when the scripted resident performance is known, where we define error as a participant rating differing from the scripted rating. The second objective was to determine whether the accuracy of an entrustment rating differed by resident skill level. To compare the individual observation assessments to a more programmatic view, we also performed G and D studies to understand the performance of the WBA across all observations.

What Is Known

Use of entrustment scales is growing yet we still need to understand the psychometric implications of their use.

What Is New

Entrustment decision accuracy was measured using standardized resident performance, and levels of agreement were not always optimal.

Bottom Line

This adds to the growing body of literature around how entrustment decisions should be used in high-stakes ways.

Setting and Participants

All program directors from Accreditation Council for Graduate Medical Education-accredited family and internal medicine programs within a 5-hour drive from our study sites in Chicago and Philadelphia (324 programs from 6 Midwest and 5 Mid-Atlantic states) were invited via email to recommend eligible faculty who might have interest in participating.13  All potential participants, whose email addresses were provided by program directors, were practicing clinicians who trained and assessed residents in the outpatient setting, were on faculty for at least one year, provided care for their own panel of patients in the outpatient setting, had not yet taken the course or participated in one of the studies about direct observation, and were available for a 2-day session. At the time of the trial, a power calculation called for a sample size of 25 per group.13  We oversampled to account for potential participant attrition. The final 77 participants were asked to independently rate 10 standardized resident-patient video encounters using a modified 4-point prospective entrustment-supervision scale (Table 1).13  Raters were given the scripted level of training (ie, postgraduate year) of the residents depicted in each of the video cases but were blinded to the scripted level of performance (entrustment scale rating). All participants completed a demographic survey.

Table 1

Modified 4-Point Prospective Entrustment-Supervision Rating Scale

Modified 4-Point Prospective Entrustment-Supervision Rating Scale
Modified 4-Point Prospective Entrustment-Supervision Rating Scale

Development of Trigger Videos and Expert Assessment

The 10 video cases used in this study were developed for a previously published randomized controlled trial depicting a standardized resident obtaining a history from or counselling a standardized patient.13  As described in that original manuscript, each case was first rigorously scripted using the best available evidence to represent specific supervision-based entrustment levels for residents performing a history or counseling a patient across a variety of diagnoses to ensure a patient receives high quality care in the scenario.14  Six physicians with expertise in physician-patient communication and trainee assessment, along with study authors, worked together to create a matrix of observable behaviors and skills that would be necessary to display a certain resident skill level. One investigator (J.K.) wrote trigger video scripts using the observable behaviors and skills. The experts and 2 study investigators (E.S.H., L.C.) reviewed the scripts for accuracy before filming. To finalize the entrustment level portrayed by the standardized residents in the videos after filming, the videos were reviewed by one expert who had reviewed the original script and 2 experts who had not seen the script and were blinded to the scripted performance level. Of the 10 videos, 5videos depicted a resident performing at an entrustment level of 2 (learner can practice skill with direct supervision), 3 videos depicted a level of 3 (learner can practice skill with indirect supervision), and 2 videos depicted a level of 4 (unsupervised practice allowed).15,16 

Data Analysis

To best evaluate the common approach residency programs use to assess residents (combining multiple ratings across raters), we compared both the individual rater’s and the group assessments to the scripted score for each case. We first calculated the mean score obtained from raters across all 10 cases and for cases representing each entrustment level. We compared the observed mean score across all cases and for cases at each entrustment level to the scripted score using 2-sided t tests. We calculated the frequency of errors, which we defined as an entrustment rating higher or lower than scripted, within and across cases. We then calculated kappa coefficients to determine the level of agreement between raters and experts.

We then performed G and D studies to mirror how a residency program may attempt to overcome poor interrater reliability.3,4  G studies can estimate the source of variation in scores, that is, how much of the score variation is explained by the rater versus the resident skill level. We performed G studies for all cases, with the cases (ie, standardized residents) as the object of measurement. We used a one-facet crossed design (rater x case model) where raters represent the participants and cases represent the standardized residents in the videos. Since G studies use the difference of the score from the overall mean to estimate variance components, the rater variance component was recalculated using the scripted instead of the population mean to determine if this would impact the score variance attributable to the raters.3 

Simulated D studies demonstrate how the score precision changes based on changing the number of observations; the results can be used to determine how many observations need to be obtained before a residency program can make a reliable determination of a resident’s performance. We performed D studies to estimate the number of raters needed to accurately assign an entrustment rating to a case. Since raters describe more difficulty and discomfort with assessing struggling or poor performing residents,17  G and D studies were repeated for cases scripted with a level 2 entrustment rating.

We used SPSS (Version 28.0.1) for all descriptive and comparisons analysis and urGENOVA (Version 2.1) for the G and D studies.

The institutional review board at the University of Pennsylvania approved this study.

A total of 221 faculty were recommended by program directors. Of these, 31 did not respond, 40 were unable to participate, and 56 were ineligible. Fourteen dropped out post randomization and 3 after baseline data collection (due to scheduling and personal conflicts). Participant demographics are shown in Table 2. There were 768 (99.7% of expected) entrustment ratings in the sample. Ratings were missing from 2 participants from 2 cases.

Table 2

Baseline Characteristics of the 77 Participants

Baseline Characteristics of the 77 Participants
Baseline Characteristics of the 77 Participants

The mean entrustment rating across all 10 cases was 2.87 (SD=0.86), which is statistically significant different from the scripted score (2.70 [SD=0.78; P<.001]) (Figure). There were statistically significant differences in the observed compared to the scripted score for cases at each entrustment level: 2.37 (SD=0.72) vs 2 (P=<.001); 3.11 (SD=0.67) vs 3 (P=.015); and 3.78 (SD=0.43) vs 4 (P<.001).

Figure

Comparison of Scripted versus Actual Mean for each of the 10 Videos

Note: The actual mean of scores assigned by 77 raters of 10 standardized residents interacting with a standardized patient is compared to the scripted score of each of the cases (5 counselling [CA1, 4, 5, 6, and 8], and 5 history-taking [HA1, 3, 4, 7, and 9] cases).

Figure

Comparison of Scripted versus Actual Mean for each of the 10 Videos

Note: The actual mean of scores assigned by 77 raters of 10 standardized residents interacting with a standardized patient is compared to the scripted score of each of the cases (5 counselling [CA1, 4, 5, 6, and 8], and 5 history-taking [HA1, 3, 4, 7, and 9] cases).

Close modal

Of the total 768 ratings, 331 (43%) were incorrectly rated, with 223 (29%) ratings being higher than the scripted score (Table 3). Of the 384 ratings of the 5 cases scripted as level 2 entrustment, half of the ratings (192) were incorrect. Most of these errors (157, 82%) were a higher rating than the scripted score. The overall kappa was -0.19 for all cases (-0.26 for cases scripted to be a level 2 entrustment, -0.18 for cases scripted to be a level 3 and -0.14 for cases scripted to be a level 4).

Table 3

Number and Percent of Incorrect and Correct Ratings by Faculty Participants

Number and Percent of Incorrect and Correct Ratings by Faculty Participants
Number and Percent of Incorrect and Correct Ratings by Faculty Participants

To conduct G studies, we replaced the missing 2 values with the mean rating from the other 76 raters for the respective cases. The variance component of raters was 0.039 explaining 4.99% of the observed variation, while the cases explained 54.29% (variance component 0.424) with the residual error explaining 40.72% (variance component 0.318) (Table 4). D studies demonstrated that 3 raters would be needed to watch all 10 cases for a G coefficient of 0.78 (30 total observations). The rater variance would increase to 9.85% if the scripted score was used to calculate variance components rather than the observed mean. G studies were repeated limited to level 2 scripted cases (Table 4). Raters explained 8.5% of the variance observed. D studies estimated that 15 raters were needed to rate all 5 level 2 cases to reach a G coefficient of 0.81 (75 total observations).

Table 4

The Percentage of the Contribution of Raters as Estimated by Using Generalizability Studies and Results of Decision Studies

The Percentage of the Contribution of Raters as Estimated by Using Generalizability Studies and Results of Decision Studies
The Percentage of the Contribution of Raters as Estimated by Using Generalizability Studies and Results of Decision Studies

In 29% of ratings, participants underestimated residents’ future supervision needs, as indicated by low agreement with experts (as seen in the low kappa scores). Notably, the error rate was higher (41%) for low-performance cases (entrustment level 2), potentially leading to inadequate supervision for 157 out of 384 patients.

While entrustment rating errors were frequent in individual observations, our findings, supported by G and D studies, affirm the validity of aggregating observations for trainee assessment. Notably, a high-stakes decision regarding supervision levels for patient care can be made with input from just 3 faculty members observing 10 cases, although potentially up to 30 observations may still be needed. We re-evaluated generalizability using only cases scripted at an entrustment level of 2, revealing substantial variation in rater influence and D study results based on resident performance level. The accuracy of entrustment scores required an increase in raters from 3 to 15 when considering all 10 cases versus only the lowest performing 5 cases. This underscores the need for caution in using entrustment scales for assessing history taking and counseling, as generalizability varies widely by performance level. These findings reinforce the challenge faculty face in assessing and providing feedback to struggling residents, compared to those performing at a higher level.18 

Interestingly, the variation attributed to raters in our study was significantly lower than previous G studies using an entrustment-supervision WBA scale where raters typically explained 40% to 60% of the observed variation.19  The factors underlying this unexpected finding are unclear. Possibilities include (1) ratings in our study occurred in a controlled setting without typical contextual factors20,21 ; (2) each scripted case level displayed relatively consistent patterns in ratings by the participating faculty: for the high performing videos (level 4) almost all faculty correctly rated the performance, but for the lowest performing videos (level 2) the majority of faculty got the rating incorrect; or (3) study participants first narratively assessed what the resident did well and what needed improvement before completing the entrustment scale. The low variation attributed to raters, however, suggests that the incorrect assignments of future entrustment may be higher in clinical learning environments where rater variation is higher.

Programs often rely on G and D studies to determine how many observations are needed to determine resident competence.3  The calculations use the idea of dispersion or deviation from the population mean to help make these estimations. Our study is unique since we know the scripted score of each video case. Therefore, we were able to correct the deviation from the mean by using the scripted rather than calculated population mean. When we used the scripted score to recalculate the G studies, the raters explained more variation (8.50% vs 4.99%) compared to the calculated or population mean—suggesting that the rate of errors in supervision decisions is even higher.

There are several limitations. Clinical Competency Committees (CCCs) and program directors often use multiple types of evaluations to determine residents’ performance level. Nevertheless, CCC decisions typically rely heavily on faculty members’ direct observations of residents caring for patients. CCCs also use multiple observations over time to make decisions about trainees. This may increase the accuracy of the pooled information. Our study was limited to internal medicine and family medicine physicians observing videos of standardized residents in outpatient encounters. As such, our findings may not be generalizable to other specialties, other care contexts, or evaluations with actual patients. It is possible that the video entrustment levels were not accurate. In addition, the video creation focused on content validity and response process as opposed to the other metrics of validity.

Entrustment scale ratings varied significantly by performance level of the resident, with more errors occurring with lower performance of the resident. Residents who perform well are more likely to be accurately evaluated.

1. 
Norcini
J,
Burch
V.
Workplace-based assessment as an educational tool: AMEE guide no. 31
.
Med Teach
.
2007
;
29
(9-10)
:
855
-
871
.
2. 
Prentice
S,
Benson
J,
Kirkpatrick
E,
Schuwirth
L.
Workplace-based assessments in postgraduate medical education: a hermeneutic review
.
Med Educ
.
2020
;
54
(
11
):
981
-
992
.
3. 
Brennan
R.
Generalizability Theory
.
Springer-Verlag
;
2001
.
4. 
Monteiro
S,
Sullivan
GM,
Chan
TM.
Generalizability theory made simple(r): an introductory primer to G-studies
.
J Grad Med Educ
.
2019
;
11
(
4
):
365
-
370
.
5. 
Holmboe
ES,
Kogan
JR.
Will any road get you there? Examining warranted and unwarranted variation in medical education
.
Acad Med
.
2022
;
97
(
8
):
1128
-
1136
.
6. 
ten Cate
O,
Chen
HC.
The ingredients of a rich entrustment decision
.
Med Teach
.
2020
;
42
(
12
):
1413
-
1420
.
7. 
Weller
JM,
Coomber
T,
Chen
Y,
Castanelli
DJ.
Key dimensions of innovations in workplace-based assessment for postgraduate medical education: a scoping review
.
Br J Anaesth
.
2021
;
127
(
5
):
689
-
703
.
8. 
Dudek
N,
Gofton
W,
Rekman
J,
McDougall
A.
Faculty and resident perspectives on using entrustment anchors for workplace-based assessment
.
J Grad Med Educ
.
2019
;
11
(
3
):
287
-
294
.
9. 
Eltayar
AN,
Aref
SR,
Khalifa
HM,
Hammad
AS.
Do entrustment scales make a difference in the inter-rater reliability of the workplace-based assessment?
Med Educ Online
.
2022
;
27
(
1
):
2053401
.
10. 
Robinson
TJG,
Wagner
N,
Szulewski
A,
Dudek
N,
Cheung
WJ,
Hall
AK.
Exploring the use of rating scales with entrustment anchors in workplace-based assessment
.
Med Educ
.
2021
;
55
(
9
):
1047
-
1055
.
11. 
ten Cate
O.
When I say…entrustability
.
Med Educ
.
2020
;
54
(
2
):
103
-
104
.
12. 
Kogan
JR,
Conforti
LN,
Iobst
WF,
Holmboe
ES.
Reconceptualizing variable rater assessments as both an educational and clinical care problem
.
Acad Med
.
2014
;
89
(
5
):
721
-
727
.
13. 
Kogan
JR,
Dine
CJ,
Conforti
LN,
Holmboe
ES.
Can rater training improve the quality and accuracy of workplace-based assessment narrative comments and entrustment ratings? A randomized controlled trial
.
Acad Med
.
2023
;
98
(
2
):
237
-
247
.
14. 
Calaman
S,
Hepps
JH,
Bismilla
Z,
et al.
The creation of standard-setting videos to support faculty observations of learner performance and entrustment decisions
.
Acad Med
.
2016
;
91
(
2
):
204
-
209
.
15. 
Chen
HC,
van den Broek
WES,
ten Cate
O.
The case for use of entrustable professional activities in undergraduate medical education
.
Acad Med
.
2015
;
90
(
4
):
431
-
436
.
16. 
ten Cate
O,
Schwartz
A,
Chen
HC.
Assessing trainees and making entrustment decisions: on the nature and use of entrustment-supervision scales
.
Acad Med
.
2020
;
95
(
11
):
1662
-
1669
.
17. 
Boileau
E,
St-Onge
C,
Audétat
MC.
Is there a way for clinical teachers to assist struggling learners? A synthetic review of the literature
.
Adv Med Educ Pract
.
2017
;
8
:
89
-
97
.
18. 
Colletti
LM.
Difficulty with negative feedback: face- to-face evaluation of junior medical student clinical performance results in grade inflation
.
J Surg Res
.
2000
;
90
(
1
):
82
-
87
.
19. 
Wang
XM,
Wong
KFE,
Kwong
JYY.
The roles of rater goals and ratee performance levels in the distortion of performance ratings
.
J Appl Psychol
.
2010
;
95
(
3
):
546
-
561
.
20. 
Yeates
P,
Moult
A,
Cope
N,
et al.
Measuring the effect of examiner variability in a multiple-circuit objective structured clinical examination (OSCE)
.
Acad Med
.
2021
;
96
(
8
):
1189
-
1196
.
21. 
Park
YS,
Hyderi
A,
Heine
N,
et al.
Validity evidence and scoring guidelines for standardized patient encounters and patient notes from a multisite study of clinical performance examinations in seven medical schools
.
Acad Med
.
2017
;
92
(11S Association of American Medical Colleges Learn Serve Lead: Proceedings of the 56th Annual Research in Medical Education Sessions)
:
12
-
20
.

Funding: The authors report no external funding source for this study.

Conflict of interest: The authors declare they have no competing interests.

The preliminary findings of this study were presented as an abstract at the Association for Medical Education in Europe conference, August 26-30, 2023, Glasgow, Scotland.