Competency-based medical education requires frequent assessment to tailor learning experiences to the needs of trainees. In 2012, we implemented the McMaster Modular Assessment Program, which captures shift-based assessments of resident global performance.
We described patterns (ie, trends and sources of variance) in aggregated workplace-based assessment data.
Emergency medicine residents and faculty members from 3 Canadian university-affiliated, urban, tertiary care teaching hospitals participated in this study. During each shift, supervising physicians rated residents' performance using a behaviorally anchored scale that hinged on endorsements for progression. We used a multilevel regression model to examine the relationship between global rating scores and time, adjusting for data clustering by resident and rater.
We analyzed data from 23 second-year residents between July 2012 and June 2015, which yielded 1498 unique ratings (65 ± 18.5 per resident) from 82 raters. The model estimated an average score of 5.7 ± 0.6 at baseline, with an increase of 0.005 ± 0.01 for each additional assessment. There was significant variation among residents' starting score (y-intercept) and trajectory (slope).
Our model suggests that residents begin at different points and progress at different rates. Meta-raters such as program directors and Clinical Competency Committee members should bear in mind that progression may take time and learning trajectories will be nuanced. Individuals involved in ratings should be aware of sources of noise in the system, including the raters themselves.
Clinical Competency Committees (CCCs) rely on work-based ratings of trainees to make decisions about competence and progress in the program.
Shift-based assessments of emergency medicine residents showed variation in their level of competence at the start of the second year and the rate in which they progressed.
Single institution, single specialty study limits generalizability.
Differences among trainees and “noise” in ratings have implications for program directors and CCCs.
Ensuring high-quality patient care in the face of increasing patient volumes1 and duty hour restrictions2,3 is increasingly challenging. These increases raise concerns about safe clinical care as residents transition to unsupervised practice. The ultimate goal of assessment in medical education is to determine when graduate trainees are ready for unsupervised practice.4 Competency-based medical education is an outcomes-based approach to physician training.5,6 Assessment is used to determine when residents achieve expected abilities, mapped to a staged progression of responsibility (ie, junior to senior).6 Such programmatic assessment7 uses multiple representative “biopsies” linked to a master blueprint, with staged criterion-based standards such as milestones.8–13
To date, the model for graduate medical education has been time based, where time spent on service served as a surrogate for the attainment of competence.14 Locally, we have noted that learners tend to value individual pieces of feedback more than trends in global performance.15 While it may be the case that individual observation encounters fit within an assessment as learning framework16 and precipitate learning encounters between faculty teachers and trainees, this approach alone may not be sufficient for defensible advancement or remediation decisions.17 If decision makers (such as program directors or Clinical Competency Committees [CCCs]) are to make defensible decisions using available data, it is incumbent on the designers of the assessment system to identify patterns of advanced and remedial performance within large assessment data sets and to identify how to combine data to determine this.17 Understanding the nature of information acquired from longitudinal data sets is imperative for educators responsible for interpreting available trends and rendering decisions derived from programmatic assessment data systems.
This study describes the patterns arising from longitudinal aggregate assessments of performance toward global competence for intermediate-level residents (ie, postgraduate year 2 [PGY-2]).
The study environment consists of 3 publicly funded, university-affiliated teaching hospitals associated with 1 residency training program. Since 2012, this training program has used a workplace-based assessment system called the McMaster Modular Assessment Program (McMAP).18 Residents are asked to gather daily digital faculty assessments of their stage-specific global performance and specific sentinel clinical tasks relevant to the practice of emergency medicine. We have previously shown that the McMAP system has internal consistency19 and is superior to traditional end-of-rotation reports.18
During PGY-1, residents complete a rotating off-service internship that includes a 2-block introductory rotation in emergency medicine, alongside multiple off-service rotations including general surgery, internal medicine, pediatrics, obstetrics and gynecology, orthopedics, and anesthesia. In PGY-2, residents complete ten 4-week blocks of emergency medicine, during which their performance is rated every shift using the McMAP system. This allows our program to examine the performance of our PGY-2 residents as they transition from highly heterogeneous off-service experiences into clinical rotations in emergency medicine.
In addition to a workplace-based assessment portfolio of specific emergency medicine task assessments, residents' daily global performance is rated using a global rating score (figure). The global rating score is completed by supervising physicians using a behaviorally anchored, competency-based scale (the CanMEDS 2015 framework).18,20–22 A multilevel regression model was developed to examine the relationship between the global rating score and time (ie, sequential shifts), adjusting for data clustering by resident and rater. This allowed us to attribute partition variance to the resident and the rater, while also modeling variation among residents with respect to learning trajectory and beginning point. The dependent variable was the global rating score (1 to 7) of resident performance for each shift. The independent variable was time (ie, when the shift took place chronologically). Both the y-intercept (or beginning point) and time were included as random factors in the model. The mean score for each consecutive 4-week period (ie, a single block) was calculated for each resident. Analyses were performed using Stata/SE version 13.1 (StataCorp LLC, College Station, TX).
The McMaster University/Hamilton Integrated Research Ethics Board granted this study an exemption.
The study included 82 individual raters (57 faculty members and 25 senior [PGY-4 and PGY-5]) residents. Fourteen resident raters joined the faculty during the study period. The average number of years in practice postresidency was 6.4 ± 9.5.
From July 2012 through June 2015, data were collected on 23 (of a total of 23, 100%) PGY-2 residents from 3 resident classes. This yielded 1498 unique ratings (65 ± 18.5 per resident; 18.3 ± 15.7 per rater). Data on the number of shifts assessed and mean global rating score (overall, first 4-week block, last block) for each resident are presented in table 1.
Unadjusted Resident Performance Analytics
Not accounting for the effect of different raters, the mean global rating score at the beginning of the year was 5.3 ± 0.6. The average score increased for 19 of 23 residents between their first and last blocks (average mean increase of 0.32; table 1). However, only 12 of 23 residents (52%) managed to achieve an average global rating score of more than 6.25 in the final block (the a priori criterion for progression to senior-resident status based on pilot data). This criterion had been defined by the program director and the CCC, and the global rating data informed competency committee proceedings and judgments.
Adjusted Resident Performance Analytics: A Proposed Model
The model estimated an average global rating score of 5.7 ± 0.6 at the start of PGY-2 (ie, y-intercept). This score was estimated to increase 0.005 ± 0.01 with each additional assessment (ie, slope). There was significant variation among residents with respect to the intercept and slope, suggesting that residents significantly differ in ability at the start of their first block and progress at different rates (ie, have a different slope and rate of achieving competence). The model showed an interaction between resident intercept and slope; as the intercept increased, the slope decreased, suggesting a ceiling effect for those with a high global rating score at the start of the year.
The analyses suggest significant variation within and between individual residents and individual raters. The highest source of variance in the global rating score was between different residents, as denoted by the intercept. Once time and rater effects were accounted for, within-resident variation was still substantial (table 2).
The determination of competence requires the aggregation of many observations from multiple observers to make a judgment (ie, create a meta-rating). In this exemplar study, we demonstrated certain patterns in aggregated data that may be important for those using multiple data points derived from assessment programs.
After a common, time-based year of training (the internship year), individuals begin at different observed levels of competence. As expected, through frequent, criterion-based assessments of authentic performance, we observed that trainees progress at different rates. Second, we described a learning trajectory that allows systems designers to anticipate the number of shifts an “average” resident requires to transition from being an intermediate resident to a senior resident, thereby allowing educational administrators and designers to allocate resources and plan residents' rotations. Finally, we have seen confirmatory evidence that raters can introduce a fair degree of noise (ie, variance) into the system.
Previously, data used to assess performance during rotations were typically collected via retrospective surveys of single faculty members (ie, post hoc in-training assessments of performance over the entire rotation), without a systematic process to ensure direct observation of resident performance.23,24 Systems like McMAP overcome this by contemporaneously gathering prospective data,18 which may then be evaluated.
At the same time, large data sets introduce new problems. The program director and/or CCCs now must interpret data sets that contain hundreds of data points. Thus, a competency-based medical education decision maker is a meta-rater, combining data from multiple sources into a specific judgment about competence. In this discussion, we highlight key points that such meta-raters should consider when making global judgments.
Nuances of Individualized Baselines and Progression
The observed range of resident baseline global rating score (4.3 to 6) suggests that even after a full year of “common” training, our residents did not enter their second year of residency with the same level of competence. Traditional education models assume that all learners progress equally. End-of-year examinations and end-of-rotation assessments are presumed to identify trainees who are not advancing along a standard measure of progression. This leaves the educator with only the option of holding the resident back a year or advancing him or her with the hope that the resident can catch up. Differential learning trajectories for individual trainees suggest that there is no standard number of shifts at which individuals achieve the threshold score required for advancement. Such modeling may help educators anticipate the need for additional clinical exposures with relevant educational interventions before the end of PGY-2.
Over multiple years of training, modest differences can become substantial with respect to global competence. Gradual trajectories suggest educators have time to act. If analyzed correctly, careful attention to learning trajectories may allow educators to intervene earlier in the learning process, initiating individualized learning plans with small changes and attention to neglected areas, rather than drastic remediation plans when significant gaps are identified late in residency. Residents who begin the year at a higher score and then trend downward may warrant closer observation and feedback, and they may be provided with added challenges to create desirable amounts of difficulty.25,26 Residents who excel may similarly be identified by these trends, permitting earlier progression toward unsupervised practice.
Noise in the System: Raters and Other Sources of Noise
Curiously, our observational data suggested that time is only a minor contributor to the score variance. This may suggest that competency-based advancement is more appropriate than automatic time-based progression. The largest sources of variance were due to individual differences between and within residents as well as the effect of raters on the system.
The variance within a resident from shift to shift is to be expected, since context and performance will vary from day to day. Raters, however, present a particular challenge to decision makers. Our longitudinal, pragmatic data set demonstrates significant rater variance, consistent with experimental studies on rater cognition and variance.27–30 Despite our attempts to create a shared mental model via a behaviorally anchored scale, we saw evidence of interrater variability, which is consistent with previous literature.31 Furthermore, the range of the number of assessments gathered by each resident may also pose a problem.32
This study has limitations. It was based in a single program and specialty: local culture and context may limit the generalizability of our findings. Moreover, the interaction between intercept and slope suggests regression to the mean for some residents' ratings over the course of the training year. Our data set is not large enough to make robust conclusions about facets that contribute to the variance in our model. Our forms may suffer from problems shared with other CanMEDS-based assessment forms, including impressions of performance from one role spilling over to affect another.33 Our global scale was designed to combat this phenomenon by asking raters one integrated question rather than multiple questions, which has been associated with rater variance.34 As the amount of data increases, new and novel techniques for both visualizing and analyzing data will need to be attempted. Opportunities for using machine learning computer algorithms may further enhance data visualization for decision makers.35
Aggregated ratings can show the tailored progression of learner competence and document achievement of competence. In our study, emergency medicine PGY-2 residents did not enter their second year with the same assessed abilities, and their progression toward competence over the year varied. Some of these differences could be attributed to rater variability, which did produce some noise into the system. Other nuances and trends in these data can inform rotation planning and anticipate needs for remediation or advancement.
Funding: Dr. Chan holds a McMaster University Department of Medicine Internal Career Research Award for her work on this project. Drs. Chan and Sherbino have also previously received funding from the Royal College of Physicians and Surgeons of Canada for various unrelated projects.
Conflict of interest: The authors declare they have no competing interests.