ABSTRACT
The entrustable professional activity (EPA) assessment framework allows supervisors to assign entrustment levels to physician trainees for specific activities. Limited opportunity for direct observation of trainees hampers entrustment decisions, in particular for infrequently performed activities. Simulation allows for direct observation, so tools to assess performance of EPAs in simulation could potentially provide additional data to complement clinical assessments.
We developed and collected validity evidence for a simulation-based tool grounded in the EPA framework.
We developed E-ASSESS (EPA Assessment for Structured Simulated Emergency ScenarioS) to assess performance in 2 EPAs among pediatric residents participating in simulation-based team training in 2017–2018. We collected validity data, applying Messick's unitary view. Three raters used E-ASSESS to assign entrustment levels based on performance in simulation. We compared those ratings to entrustment levels assigned by clinical supervisors (different from the study raters) for the same residents on a separate tool designed for clinical practice. We calculated intraclass correlation (ICC) for each tool and Pearson correlation coefficients to compare ratings between tools.
Twenty-eight residents participated in the study. The ICC between the 3 raters for entrustment ratings on E-ASSESS ranged from 0.65 to 0.77, while ICC among raters of the clinical tool were 0.59 and 0.57. We found no significant correlations between E-ASSESS ratings and clinical practice ratings for either EPA (r = -0.35 and 0.38, P > .05).
Assessment following an EPA framework in the simulation context may be useful to provide data points to inform entrustment decisions as part of resident assessment.
The entrustable professional activity (EPA) assessment framework allows supervisors to assign entrustment levels to trainees for specific activities, but there are few opportunities for direct observation of trainees.
A simulation-based tool grounded in the EPA framework.
Study conducted at a single institution, limiting generalizability. Only 2 EPAs were studied; better alignment might exist with other EPAs.
The E-ASSESS tool was easy to use and had reasonable interrater reliability, but there was no clear correlation with performance ratings for the same EPAs in clinical practice.
Introduction
Entrustable professional activities (EPAs) are gaining popularity as a framework for competency-based assessment in medical education. EPAs, “units of professional practice that constitute what clinicians do as daily work,”1 help supervisors assess trainee competency by determining how much they entrust a trainee to perform a specific activity independently. EPAs operationalize competencies by focusing on activities and associated tasks that can be observed in specific clinical contexts.1,2 Specialty-specific EPAs have been developed for graduate medical education in several fields, including pediatrics, obstetrics and gynecology, surgery, psychiatry, internal medicine, and family medicine.3
One challenge with clinical performance assessments is that opportunities for direct observation in the clinical setting are declining4 ; therefore, a supervisor might be asked to make entrustment decisions without sufficient observation of a trainee's performance in a particular EPA. Simulation-based education is frequently used to augment clinical learning experiences and allow for direct observation and assessment.5,6 Numerous tools exist for skill assessment in simulation.7 These tend to be focused on technical or non-technical skills with checklists to identify whether the learner performed certain steps, rather than informed decisions about a learner's readiness for independent practice. It has been suggested that simulation can be used to inform entrustment decisions around specific EPAs, but this is controversial and largely untested.8,9 To our knowledge, no published assessment tool for use in simulation has applied the EPA framework to align incidental performance evaluations in simulation with longitudinal evaluation data from clinical contexts. If we can gain reliable information about trainees' performance of specific EPAs in simulations, this may provide additional data points to make entrustment decisions. We therefore developed the E-ASSESS (EPA Assessment for Structured Simulated Emergency ScenarioS) tool, and collected validity evidence to support the use of simulation to provide assessment information that can potentially contribute to entrustment decisions.
Methods
Setting and Participants
We conducted this project in the pediatric residency program at the University of California, San Francisco (UCSF). In July 2017, this program introduced American Board of Pediatric (ABP) EPA-based assessments for clinical supervisors to assign entrustment levels to residents they worked with during clinical rotations.10 We modeled our E-ASSESS tool after the residency's EPA clinical practice assessment tools11 and pilot tested it among residents who participated as leaders in an interprofessional simulation-based team training program at our institution, described in detail in a prior publication.12 The program's learning objectives include management of acutely deteriorating patients, application of resuscitation algorithms, and effective teamwork and leadership during emergency situations. Simulation scenarios reflect common pediatric emergencies: seizure/status epilepticus, anaphylaxis, shock (hypovolemic, hemorrhagic, septic), cardiac arrest (pulseless electrical activity or arrhythmia), and respiratory failure (bronchiolitis, pneumonia, asthma exacerbation, respiratory depression).
We recruited 3 pediatricians at our institution with relevant content expertise as raters to assist with the E-ASSESS pilot. In the first phase of our study, raters reviewed video-recorded performances of a previous cohort of residents participating as team leaders in the simulation program. In the second phase, we prospectively recruited residents who participated as team leaders during the 2017–2018 academic year, video-recorded their performances for review by the study raters, and accessed their clinical practice EPA assessments provided by clinical supervisors (different from study raters).
Instrument Development
We reviewed the ABP EPAs and chose 2 applicable to activities covered in our simulation program: EPA 4, “Manage patients with acute, common diagnoses,” and EPA 15, “Lead an interprofessional health care team.”10 We modeled the E-ASSESS tool (provided as online supplemental material) after our residency program's workplace-based EPA assessment tools.11 The latter were developed by our residency leadership and use frequency-anchored questions regarding tasks and behaviors essential to the ABP EPAs and a supervision scale adapted from Chen et al.13 E-ASSESS uses the same structure as the residency workplace-based EPA tools and consists of 3 parts: (1) behavioral items to assess specific skills integral to each EPA; (2) an entrustment scale; and (3) a free response item for the assessor to explain their reasoning. In the simulated setting a longitudinal relationship between rater and trainee is uncommon; therefore, we replaced the frequency ratings on the first part of the tool with behavioral anchors based on associated milestones.
Procedures to Collect Validity Evidence
Content Validity:
In addition to mapping the instrument to the ABP EPAs, we developed E-ASSESS through an iterative process involving review by experts in medical education and simulation at our institution. These included pediatric subspecialists in hospital medicine, intensive care, and emergency medicine, as well as educators with PhD and Master's degrees.
Response Process:
At the beginning of the study, the principal investigator (C.A.) briefed the raters on the intended use of E-ASSESS. Next, the 3 raters watched 5 video-recorded simulation scenarios and used E-ASSESS to evaluate each scenario's resident team leader. The principal investigator met with the raters and reviewed the videos using a “think-aloud protocol”16 to explore reasons for discrepancies in ratings. We subsequently refined E-ASSESS, and the raters used the revised tool to assess resident performance in an additional 5 videos. We repeated this process for a total of 3 rounds, using different video-recorded scenarios with different resident leaders for each round.
Internal Structure:
The E-ASSESS entrustment scale allows raters to score trainees on a scale from 0 to 8, with each level correlated to an increasing level of trust in a trainee's ability to perform autonomously (from 0, trust the trainee to observe only, to 8, trust the trainee to supervise others; additional information provided as online supplemental material). We used intraclass correlation (ICC) to examine interrater reliability between the 3 raters who completed the E-ASSESS tool in both study phases.18,19
Relationship to Other Variables:
In the second study phase, we compared ratings on E-ASSESS with entrustment ratings given by clinical supervisors on the clinical practice EPA assessment instruments for EPAs 4 and 15 during the same time frame (January–June 2018). Clinical supervisors participated in 20-minute faculty development sessions on EPAs provided by residency leadership in the year prior to the study. As the number of raters for the clinical practice tool varied for each resident and each EPA, we examined interrater reliability between these raters with 2-way random effects model ICC. We calculated mean entrustment scores for each resident across all raters for each tool and each EPA. We used Spearman's rank correlation coefficient to examine the relationship between the 2 sets of data separately for each EPA. We used SPSS Statistics 26 (IBM Corp, Armonk, NY) for all statistical analyses.
The UCSF Institutional Review Board approved the study.
Results
A total of 28 residents participated in the study: 15 in the first study phase and 13 in the second. In the second phase, 3 residents participated as simulation team leaders twice, for a total of 16 video-recorded performances in this phase. The number of ratings per resident from supervisors in the clinical setting ranged from 0 to 8 for each EPA. Two residents received no clinical practice ratings on EPA 15, and 2 residents had no ratings for EPA 4.
The table summarizes the E-ASSESS ICC. Using commonly cited cut-offs,20 overall agreement between the 3 raters on E-ASSESS was good for all entrustment levels. For specific behaviors within each EPA, the agreement ranged from fair to excellent. ICC among raters of the clinical practice tool was fair for both EPA 4 and EPA 15 (0.59 and 0.57, respectively).
The figure shows entrustment levels on E-ASSESS versus the clinical practice instruments. The correlations between E-ASSESS ratings and clinical practice ratings were not statistically significant (r = -0.35 and 0.38, P > .05 for both correlations).
Correlations Between Entrustment Levels on E-ASSESS and Clinical Practice Tools
Note: Correlation between entrustment levels assigned on the E-ASSESS tool and the clinical practice tool for EPA 4 (panel A) and EPA 15 (panel B). Scale 0–8 for both instruments; Spearman's rho -0.35, P = .25 for EPA 4 and 0.38, P = .18 for EPA 15.
Correlations Between Entrustment Levels on E-ASSESS and Clinical Practice Tools
Note: Correlation between entrustment levels assigned on the E-ASSESS tool and the clinical practice tool for EPA 4 (panel A) and EPA 15 (panel B). Scale 0–8 for both instruments; Spearman's rho -0.35, P = .25 for EPA 4 and 0.38, P = .18 for EPA 15.
Discussion
Our E-ASSESS tool, developed to assess resident performance of 2 EPAs during simulation, appeared easy to use and had reasonable interrater reliability, but we did not find significant correlations between ratings on E-ASSESS and clinical practice assessment tools. This finding has several potential explanations worth exploring. It is possible that either E-ASSESS or the clinical practice tool (or both) do not provide a reliable assessment of the underlying constructs, at least not in the contexts in which they were used, or in the hands of the raters who used the tools. Based on the ICC data, E-ASSESS had reasonable interrater reliability, but this was less evident for the clinical practice tool. Reliability may have been compromised because ratings on the clinical practice tool may not have been based on actual observation.
In addition, unlike the raters who used E-ASSESS, raters using the clinical practice tool received limited training. Despite training, even among simulation raters, the ICC for some of the specific behaviors remained fair at best. These persistent differences in opinions among raters were likely due to their differences in professional background and expertise, which led to different expectations from learners, highlighting that rater agreement is dependent on rater characteristics.21 Of note, entrustment ratings on the clinical tool were on average much higher than ratings assigned to the same residents using E-ASSESS in simulation. This may be due to leniency bias, the phenomenon of supervisors giving overly positive assessments, typically to avoid difficult conversations or out of fear of retribution.17,22,23 Clinical supervisors knew their evaluations would be viewed by the residents and therefore may have been prone to leniency bias, whereas study raters of simulations were told that residents would not see the ratings as they were generated for study purposes only.
A second explanation for the lack of correlation may be that E-ASSESS does not measure the same constructs as the clinical practice tool. Although both tools aim to assess the same EPAs, differences between the simulation and clinical context may lead to varying tasks and behaviors that can be observed. In most simulated scenarios, there are clear learning objectives, and the focus tends to be on the application of algorithms and/or team leadership skills within the crisis resource management framework. Real-life emergency scenarios have greater variability and are unpredictable—what is expected from team members may vary. In addition, teamwork and team leadership in clinical practice do not always center on emergencies and more often take place in low-acuity settings. While there is overlap between teamwork and team leadership skills in low- and high-acuity settings, they are not the same.24 Considering the stakes, clinical supervisors may more easily entrust a resident with leading a team in a low-acuity setting, which is an alternative explanation for the higher ratings on the clinical practice tool for EPA 15. However, a different study in the context of our pediatric residency program found similar high ratings of leadership skills in low-acuity settings, suggesting that leniency bias may be important.25 It is also possible that raters in the simulated setting focused on different aspects of performance than clinical supervisors, which was found to be a major contributor to interrater variability in a study examining assessment of clinical performance.26 Rater viewpoint as well as context play an important role in how raters assess learner performance. The complexity of the clinical environment with a broad variety of sociocultural factors influencing both rater and learner performance may not lend itself well to the psychometric-based, reductionist approach of a rating scale.27 This further complicates comparison between performance in clinical and simulation contexts.
Our study's limitations include the single institution origin, with a small sample of pediatric residents, which limited the power as well as the generalizability of our study. We also only examined 2 EPAs: other EPAs may show better alignment between simulation and clinical practice. Lastly, the residency administration provided us with clinical practice EPA ratings in a fashion that did not disclose the raters' identity; thus, it is possible that some residents received ratings from the same supervisor.
While there is some evidence that performance of procedural skills in the simulated setting may translate to real patient care settings,6 this is less clear for other competency domains.28 Whether simulation can be used to inform entrustment decisions is therefore controversial. The strength of simulation is that it allows for structured scenarios, a controlled environment, and limited variability, facilitating both rater training and benchmarking. Performance in one simulated scenario does not necessarily predict performance in other scenarios, and certainly not in the complexity of clinical practice. Thus, serial assessments in multiple contexts are likely needed to inform entrustment decisions in a program approach to resident assessment.29 Such an approach relies on multiple data points. If additional studies provide validity evidence, E-ASSESS and similar tools may be useful adjuncts to clinical practice assessments.30 The number of data points needed to predict future performance and the relative weight one can give simulation-based assessments will require further study.
Conclusions
In this study, the E-ASSESS tool used to assess pediatric residents' performance in 2 EPAs in a simulation setting was easy to use and had reasonable interrater reliability, although there was no clear correlation with performance ratings for the same EPAs in clinical practice. The E-ASSESS tool may be a model for other similar tools to inform entrustment decisions about resident readiness for independent practice.
References
Author notes
Editor's Note: The online version of this article contains the E-ASSESS (EPA Assessment for Structured Simulated Emergency ScenarioS) tool.
Funding: The authors report no external funding source for this study.
Competing Interests
Conflict of interest: The authors declare they have no competing interests.
The authors would like to thank the staff of the University of California, San Francisco Kanbar Simulation Center, faculty, and participants in the University of California, San Francisco Health Professions Education Pathway Program, and all the residents who participated in the study.