Workplace-based assessment (WBA) is a key assessment strategy in competency-based medical education. However, its full potential has not been actualized secondary to concerns with reliability, validity, and accuracy. Frame of reference training (FORT), a rater training technique that helps assessors distinguish between learner performance levels, can improve the accuracy and reliability of WBA, but the effect size is variable. Understanding FORT benefits and challenges help improve this rater training technique.
To explore faculty's perceptions of the benefits and challenges associated with FORT.
Subjects were internal medicine and family medicine physicians (n=41) who participated in a rater training intervention in 2018 consisting of in-person FORT followed by asynchronous online spaced learning. We assessed participants' perceptions of FORT in post-workshop focus groups and an end-of-study survey. Focus groups and survey free text responses were coded using thematic analysis.
All subjects participated in 1 of 4 focus groups and completed the survey. Four benefits of FORT were identified: (1) opportunity to apply skills frameworks via deliberate practice; (2) demonstration of the importance of certain evidence-based clinical skills; (3) practice that improved the ability to discriminate between resident skill levels; and (4) highlighting the importance of direct observation and the dangers using proxy information in assessment. Challenges included time constraints and task repetitiveness.
Participants believe that FORT training serves multiple purposes, including helping them distinguish between learner skill levels while demonstrating the impact of evidence-based clinical skills and the importance of direct observation.
To explore faculty's perceptions of the benefits and challenges associated with frame of reference rater training.
Participants felt frame of reference training offered an opportunity to apply skills frameworks via deliberate practice, demonstrated the importance of evidence-based clinical skills, improved their ability to discriminate between resident skill levels, and highlighted the importance of direct observation.
It is uncertain how findings generalize to other specialties or rater training for other skills.
This study provides an understanding for how frame of reference training impacts faculty's beliefs about and approach to workplace-based assessment.
Workplace-based assessment (WBA) is a key assessment strategy in medical education, particularly in competency-based medical education. Improving WBA quality requires faculty development,1-4 which is necessary to improve faculty's ability to observe, synthesize observations into a judgment, encode judgments into an entrustment rating, provide feedback, and coach learners.3-5 However, the impact of faculty development interventions on the reliability and accuracy of WBA in medical education has been small to negligible.6-12 As such, more work is needed to delineate the unique effects of the various components and implementation designs of rater training.4,5
Performance dimension training (PDT) and frame of reference training (FORT) are 2 rater training techniques that can improve performance appraisal assessments.13,14 PDT trains assessors to recognize the appropriate behaviors or dimensions of a given competency or skill using evidence-informed definitions supported by examples using written vignettes, videos, or role-plays.13,14 FORT helps assessors discriminate between variations in the quality of demonstrated skills by having participants individually assess multiple videos of individual learners with different skill levels and then together discuss discrepancies in observations and ratings.13,14 For example, during PDT, assessors are asked to identify the behaviors that constitute aspirational shared decision-making and are shown examples of these behaviors in video vignettes. Then, during FORT, facilitators ask team members to individually assess several stimulus videos of a resident counseling a patient about starting a statin, in which the resident demonstrates a range of skills from poor to aspirational. Assessors then discuss the videos as a group using a compare-contrast approach, sharing assessments and discussing discrepancies to create a shared mental model to guide assessment judgements.13,14 In a business setting, PDT and FORT can improve assessment accuracy and minimize rating variability by decreasing rater errors or biases unrelated to the targeted performance behaviors.13,14 The impact of these techniques in medical education have been more variable.9-12
We previously published a rater training study that incorporated emerging rater cognition theories (the use of variable frames of reference, inference, and uncertainty in how to translate observations into numerical ratings) to determine which components of rater training might improve WBA.15-19 In that study we explored how PDT affected participants' approach to WBA.19 More recently, we demonstrated that rater training which included FORT increased the accuracy and specificity of observations.20 Clarifying the mechanisms of FORT in medical education may provide further insights into rater cognition and how best to conduct FORT to improve WBA quality and accuracy.21,22 To our knowledge there has not been a study examining faculty perceptions of FORT in medical education and the mechanism by which it influences assessors' judgments.
The purpose of this study was to understand assessors' perceptions of FORT, in particular its benefits and challenges.
We previously published a randomized controlled trial of a longitudinal rater training intervention to improve WBA.20 This current study focuses only on the intervention group of that study and on the perceptions of FORT.
Between December 2017 and August 2018, we emailed family and internal medicine residency program directors at 138 programs in 6 Midwest states and 186 programs in 5 Mid-Atlantic states soliciting their interest to enroll faculty in the study. Program directors provided email addresses for potential participants. Eligible participants needed to be general practitioner faculty who (1) were responsible for outpatient clinical training and evaluation of residents; (2) provided outpatient care for their own patient panel; (3) held a faculty position for at least 1 year; and (4) were available for a 2-day study session. Participants who agreed to enroll were initially randomized to the control (n=45) or intervention arm (n=49). This current study focuses only on the 41 intervention group participants who completed the study (7 dropped out post-randomization and 1 dropped out mid-trial secondary to illness). The control group participants are not included in the current study. Participants received a modest $150 honorarium from the Accreditation Council for Graduate Medical Education. The intervention group was eligible for up to 14.25 Continuing Medical Education and Maintenance of Certification credits.
Development of Stimulus Videos
Between June 2016 and June 2018, with the assistance of 6 experts in physician-patient communication and trainee assessment, and using evidence from the literature, we created stimulus videos depicting residents taking a history from or counseling patients.23-28 We created 27 videos (9 scenarios, each demonstrating 3 different levels of resident skill) for rater training using recommended guidelines.29 In the online supplementary data, we summarize the steps we used to create the videos and their associated answer keys with the expert-informed consensus entrustment rating and narrative assessment.
Intervention and Assessments
At baseline, participants completed a self-administered demographic web-based questionnaire and assessed 10 stimulus videos (5 history taking and 5 counseling) using an online rater assessment form asking them to identify what the resident in the video did well, what required improvement, and how they would supervise the resident going forward (prospective entrustment decision).20 Participants attended 1 of 4 rater trainings in the fall of 2018. The rater trainings were 2-day, in-person, 3-hour workshops that immediately followed the baseline assessment. The rater training content and format (online supplementary data) were informed by our prior research and included PDT and FORT.9,15-19 During PDT, participants created frameworks of the skills required for history taking and counseling. They then reviewed an evidence-based framework for each skill and revised their framework as needed.23-28 During FORT participants applied the framework to the 2 remaining stimulus videos in the series, comparing their assessments to the answer keys as a group.
After each of the 4 rater training workshops, 1 of 4 individuals with expertise in focus group facilitation (but otherwise unrelated to the study) led a focus group about the rater training (focus group guide provided as online supplementary data). Focus groups occurred immediately after the rater training so that participants could provide specific feedback about the rater training. Study investigators were not present during the focus groups, which were audio-recorded, transcribed, and de-identified.
Four weeks after the in-person workshops, participants started 3 asynchronous, online, spaced learning FORT modules with timed deliverables spaced 6 weeks apart using the Canvas Learning Management System (Instructure Inc). In spaced learning, a course is divided into short duration modules with breaks between the sessions. We included spaced learning because knowledge retention is enhanced when learning sessions are spaced in time.30 During each spaced learning module, participants were prompted to watch a series of history taking and counseling stimulus videos, each depicting 2 different levels of resident skill. We instructed participants to use their frameworks to guide observations. After rating each video, participants reviewed the answer key and identified similarities and differences between their own observations and those of the expert. A third video for each series was available for optional review (online supplementary data). A study investigator (J.K. or E.H.) moderated the spaced learning discussion boards.
At a minimum of 4 weeks after the last spaced learning module (March to May 2019), participants watched and rated 10 stimulus videos (5 history taking and 5 counseling). Participants completed a 19-item end-of-study survey focused on the intervention (online supplementary data). Questions asked about the benefits of spaced learning (rated on a 5-point Likert scale where 1=strongly disagree, 5=strongly agree) and spaced learning timing and work effort (rated on a 3-point scale). There were 3 open-ended questions eliciting (1) strengths of spaced learning to improve skills in direct observation; (2) ways spaced learning could be improved; and (3) barriers to spaced learning participation.
We used descriptive statistics to summarize demographic and end-of-study survey data. Two investigators (J.K., L.C.) independently coded focus group transcripts and open-ended survey questions using thematic analysis.31 The investigators began by familiarizing themselves with the data, coding data relevant to the research question, and generating initial themes. The investigators met multiple times to review, discuss, reconcile, and name the themes identified. Themes were then shared with the third investigator (E.H.) for additional input and clarification.
The Institutional Review Board of the University of Pennsylvania Office of Regulatory Affairs approved this study. All participants provided informed consent.
Table 1 summarizes participant demographics. All participants attended 1 of 4 post-workshop focus groups. From the focus groups we identified 4 themes describing FORT benefits and 6 themes describing challenges. Table 2 provides example quotes for each theme.
Benefits of FORT
First, FORT allowed participants to practice applying their previously created frameworks to different videos. Participants described how watching multiple videos of the same encounter enabled them to apply the frameworks to videos demonstrating a progression of resident skill level. Participants explained how watching the same encounter at 3 different resident skill levels promoted deliberate practice. They explained how writing out their observations required them to commit to their assessments, further facilitating deliberate practice.
Second, watching videos of the same encounter at 3 different resident skill levels helped participants gain clarity about the importance of specific clinical skills. Initially some participants doubted that a particular framework behavior was important and necessary for safe, effective, patient-centered care (for example, asking patients to prioritize their agenda at the beginning of a visit). However, when participants watched the video in which the resident started the encounter with agenda-setting, they were able see why that behavior was beneficial and important. Therefore, seeing aspirational performance helped highlight the value and importance of certain clinical behaviors that may have initially been dismissed as unimportant or minimally important.
Third, seeing the same clinical encounter performed at 3 resident skill levels helped participants discriminate between performance levels. While it was sometimes difficult for participants to distinguish between poor and satisfactory performance (calling for direct and indirect supervision respectively) or between satisfactory and aspirational performance (no supervision needed), watching the sequence of 3 videos helped participants better distinguish between resident skill levels. Participants described how comparing and contrasting residents across the video series enabled them to better understand the range of behaviors for, or variable execution of, a given skill. For example, across the video series, participants could see a range of how much and how well a resident explored the physical, psychological, and emotional impact of a symptom on a patient.
Fourth, participants described how watching the 3-video series emphasized the importance of direct observation. Multiple participants described having an “ah ha” moment when they realized that a resident's oral case presentation might be identical after all 3 video encounters and might not represent what occurred during the patient visit. As such, FORT underscored how a resident's patient presentation was an incomplete and inadequate proxy for what occurred during an office visit. Participants recognized that, while the information a resident obtained from the patient might be the same, the patient's experience during the encounter and subsequent outcome of the visit could substantially differ. This realization further reinforced the value of direct observation.
Challenges With FORT
Participants described several challenges with FORT. Given the similarities between each video in a series, participants described how it was confusing to recall what occurred in each video. Therefore, they questioned whether the same or a different actor should portray the resident in all 3 videos. However, participants also recommended making the differences between the videos more subtle, with the poorly skilled resident a little better and the aspirational resident a little less skilled. Some felt that writing down their observations for each video in the series was tedious, particularly given the similarity of videos in the series. Participants also described how watching videos as a group may have caused cueing when other participants had verbal or non-verbal reactions to a video, potentially promoting groupthink. Participants knew that the residents in the videos were portrayed by actors, so at times they were uncertain if the behaviors they observed should be attributed to the resident's skill or the actor's performance. Finally, during FORT we asked participants if there were any behaviors on the framework that were not essential for safe, effective, patient-centered care. Several participants described their uncertainty taking behaviors off the framework when the frameworks were presented as being evidence-based.
End-of-Survey Results About FORT Spaced Learning
All participants (n=41) completed the end-of-study survey. Participants valued space learning as an approach to improve their skills in direct observation and feedback (Table 3) and were favorable regarding the number and timing of spaced learning modules (data not shown).
Almost all participants answered the open-ended questions on benefits of (n=39), areas to improve (n=39), and barriers to participating in (n=38) FORT spaced learning. Survey responses reiterated many of the focus group themes. Participants described how spaced learning afforded them additional opportunity for repeated practice, which helped refresh skills in direct observation while bringing intervening real-world experience to practice. Participants described how repeated practice mitigated losing previously acquired skills, promoting longer-term learning. Practice also reinforced the frameworks, thereby building an “internal model” for the competency being observed. As a result, applying the frameworks started to become “second nature.”
Participants also described the benefit of comparing their assessments to the answer keys during spaced learning. The answer keys helped participants differentiate between skill levels across the video series. The combination of seeing more cases and comparing assessments to the answer keys made it easier for participants to differentiate between good and fair resident skill levels. With time, participants described being better able to identify the more subtle differences in resident skills. Additionally, participants valued the ability to compare their assessments to the answer keys to see how they could further improve as evaluators. Finally, the FORT spaced learning continued to serve as a reminder that resident behaviors in the room with a patient may not translate into their oral case presentations.
The greatest challenge with spaced learning was time. Participants described how it was challenging to complete the spaced learning given their competing responsibilities. Some participants wondered if it would be better to assign fewer videos per module or make each video shorter but have more modules. Several commented that typing their observations was tedious. Finally, some participants recommended expanding the modules to skills beyond history taking and physical examination.
In this study we explored the mechanism by which FORT, delivered through 2 in-person workshops and 3 asynchronous online spaced learning modules, impacted participants' approach to WBA. We found that FORT provided participants an opportunity to practice applying assessment frameworks, highlighted the importance of specific evidence-based clinical skills in patient care, helped participants improve their ability to discriminate between skill levels, and emphasized the importance of direct observation and the dangers of using proxy information in assessment. There are ongoing calls to improve assessment across the undergraduate to graduate medical education continuum.32-34 Importantly, the ability to assess effectively requires training and practice.2-5,21 To our knowledge this is the first study to explore benefits and challenges of WBA FORT in medical education.
The participants in our study recognized how FORT provided an opportunity to engage in deliberate practice of assessment through repetition, reflection, and feedback using the answer keys. The acquisition of expertise requires deliberate practice,35 and acquiring expertise in assessment is likely no different. The ability to compare individual assessments to expert raters has previously been identified as an important FORT technique.13 Deliberate practice requires motivation and endurance, and participants described how this practice sometimes felt tedious and time consuming. As such, more research is needed to determine how best to optimize stimulus video content and delivery.
Faculty evaluation of learners is likely related to faculty's own clinical skills.15,16 Our findings highlight how FORT may help faculty shift from using their own clinical skills as the standard when evaluating residents. Furthermore, if faculty do not routinely perform a particular skill, or if they do not believe that behavior is important for effective patient care, they are unlikely to comment on it or assess it in their learners. While reviewing evidence-based clinical skills frameworks may suffice to convince a few faculty members of the importance of these specific clinical skills,19 we found that faculty could better appreciate why certain skills were important and beneficial after seeing the 3-video series, particularly the video of the resident with aspirational skills. Therefore, FORT may highlight the importance of specific clinical skills and the likelihood that faculty would use them in criterion-referenced resident assessment.
Successful implementation of WBA requires that both individuals and the organizational culture value direct observation.5,36 It can be difficult to get faculty to buy into the importance of direct observation.37 Watching the 3-video series reinforced the importance of direct observation. This was an unexpected finding. Watching the video series helped participants appreciate, in a more salient and visceral way, how a resident might present a patient the same way despite 3 very different patient encounters. This realization reinforced the importance of direct observation and the dangers and limitations of using proxy information. Going forward, using a video series may be an effective strategy to increase buy-in to direct observation.
The rater training in this study was longer than most previously published studies (6 hours of in-person workshops and 3 online asynchronous spaced learning modules). Effective rater training takes time, and this raises the real challenge of how to integrate more intensive rater training into medical education assessment programs given faculty's limited time and competing responsibilities. That said, if we want to improve assessment, we need to recognize that there are no easy fixes. Effective assessment takes time, practice, and faculty development.5,21,37 Furthermore, iterative practice is important given that prior rater training interventions have shown drift effects in which assessors show high levels of interrater reliability initially but poor reliability in subsequent performance assessments.38,39 Successful implementation of WBA requires that individuals and the organizational culture value direct observation.5,36 While it can be difficult to get faculty to buy into the importance of direct observation, training assessors on resident observation helps establish a sense of trust, reliability, and validity in the feedback that faculty provide to learners after conducting an observation, thereby shifting the cultural view of WBA as time well spent. Future research will need to explore how best to implement FORT to maximize effectiveness while balancing feasibility.
There are several limitations to our findings. We needed to recruit from multiple residency programs to identify faculty for this study, and the participants' motivation may differ from those who chose not to participate (volunteer bias). Second, it is uncertain how findings generalize to other specialties or rater training for other skills (for example procedural skills). Third, there was variable participation in the spaced learning. Fourth, participants may have described more benefits to training to justify their time participating. Finally, although the impact of WBA is maximized when it is followed by feedback, this study did not address the impact of training on feedback.
Individuals participating in FORT describe how it not only enables deliberate practice to improve discrimination between skill levels but also reinforces the evidence-informed skill frameworks and the importance of direct observation.
The authors would like to thank the following individuals: Denise LaMarra, MS, CHSE, Janice Radway, BA, Marc Shalaby, MD, and the Perelman School of Medicine Standardized Patient Program, including the actors for assistance developing stimulus videos; Carol Chou, MD, Nicole Defenbaugh, PhD, Richard Frankel, PhD, Benjamin Kinnear, MD, MEd, Denise LaMarra, MS, CHSE, and Leigh Simmons, MD, for assistance developing the stimulus videos; Stephanie Taitano and the Perelman School of Medicine Faculty Affairs and Professional Development Staff for their assistance creating the spaced learning modules; Anthony R. Artino Jr, PhD, for his help with the end of study survey; and focus group leaders Elizabeth Bernabeo, MPH, Justin Bittner, Laura Hirshfield, PhD, and Judy Shea, PhD.
Funding: The authors report no external funding source for this study.
Conflict of interest: The authors declare they have no competing interests.
Editor's Note: The online version of this article contains an overview of creation of the training videos and expert answer keys, components of rater training faculty development, the focus group interview guide, and the survey used in the study.