ABSTRACT
The Family Medicine (FM) Milestones are a framework designed to assess development of residents in key dimensions of physician competency. Residency programs use the milestones in semiannual reviews of resident performance from entry toward graduation.
To examine the functioning and reliability of the FM Milestones and to determine whether they measure the amount of a latent trait (eg, knowledge or ability) possessed by a resident or simply indicate where a resident falls along the training sequence.
This study utilized the Rasch Partial Credit model to examine academic year 2014–2015 ratings for 10 563 residents from 476 residency programs (postgraduate year [PGY] 1 = 3639; PGY-2 = 3562; PGY-3 = 3351; PGY-4 = 11).
Reliability was exceptionally high at 0.99. Mean scores were 3.2 (SD = 1.3) for PGY-1; 5.0 (SD = 1.3) for PGY-2; 6.7 (SD = 1.2) for PGY-3; and 7.4 (SD = 1.0) for PGY-4. Keyform analysis showed a rating on 1 item was likely to be similar for all other items.
Our findings suggest that FM Milestones seem to largely function as intended. Lack of spread in item difficulty and lack of variation in category probabilities show that FM Milestones do not measure the amount of a latent trait possessed by a resident, but rather describe where a resident falls along the training sequence. High reliability indicates residents are being rated in a stable manner as they progress through residency, and individual residents deviating from this rating structure warrant consideration by program leaders.
The Family Medicine (FM) Milestones were designed to assess resident progression toward unsupervised practice.
A study of the properties of FM Milestone ratings for 10 563 residents from 476 programs.
As residencies get more acquainted with creating ratings over time, scoring patterns may change.
Lack of variability in the FM Milestones suggests that they describe resident progress in the educational program, and deviations from the structure by individual residents warrant consideration by program leadership.
Introduction
In 2012, the Accreditation Council for Graduate Medical Education (ACGME) introduced the Next Accreditation System, of which a primary component is the educational milestones.1 The milestones are organized around the 6 ACGME core competencies and describe attributes that residents are expected to demonstrate as they progress through their program. The Family Medicine (FM) Milestones were implemented in July 2014 and were designed to provide a framework to assess the development of residents in key dimensions of physician competency. Each of the 22 items consist of 6 levels representing the progression from preresidency to master physician, with Level 4 representing the target for independent practice.
There has been little research conducted examining the functioning of the milestones. Swing et al2 examined the development process and content validity claims for the Emergency Medicine (EM) Milestones, and while they found sufficient evidence for content validity, they called on future research to utilize milestone data to provide further validity evidence. Beeson et al3 examined the validity and reliability of the EM Milestone ratings to determine the degree to which subcompetencies were interrelated. They found that they were able to discriminate between residency program years and concluded that the EM Milestones “demonstrated validity and reliability as an assessment instrument for competency acquisition.”3
Our study builds on this body of research by examining the functioning and reliability of the FM Milestones. It is important for both residency program directors and the ACGME to understand how the milestones function, and their capacity for providing useful information about resident and residency performance. In particular, this study seeks to determine whether the FM Milestones are measuring the amount of a latent trait (eg, knowledge or ability) possessed by a resident or simply indicating where along the training sequence a resident falls.
Methods
Participants
This study used end-of-year FM Milestone ratings for all residents in ACGME-accredited FM programs for the 2014–2015 academic year (10 563 residents from a total of 476 programs: postgraduate year [PGY] 1 = 3639; PGY-2 = 3562; PGY-3 = 3351; PGY-4 = 11) provided by the ACGME to the American Board of Family Medicine.
Instrumentation
The FM Milestones encompass 22 items across 6 domains: patient care (5 items), medical knowledge (2 items), systems-based practice (4 items), practice-based learning and improvement (3 items), professionalism (4 items), and interpersonal and communication skills (4 items).4 For each item, there are 10 rating scale categories representing 6 levels (0 to 5), with 4 categories representing half-point categories (the point at which a resident demonstrates all of the milestones in the lower levels and some of the milestones in the higher levels). The FM Milestones can be found online.4
This research was approved without restrictions by the American Academy of Family Physicians Institutional Review Board.
Analysis
We applied a Rasch measurement model5 to the FM Milestone data. Because each item has its own distinct set of category descriptors, the analysis was conducted using the Rasch Partial Credit model6,7 available in Winsteps Rasch Measurement Software version 3.81.0 (Winsteps, Beaverton, OR). We examined data model fit using information weighted (INFIT) and unweighted (OUTFIT) mean square values (MNSQ).
These statistics provide an indication of the amount of useful information provided by an item. Although there are no concrete rules about the acceptable thresholds for INFIT MNSQ and OUTFIT MNSQ, values between 0.5 and 1.5 are generally considered acceptable for use.8,9 In addition, reliability and mean scores by resident program year are provided.10,11
Measurement implies a linear, hierarchical construct that is subdivided into equal interval units on which some objects of measurement (eg, individuals) possess more of the construct and others less. Because Rasch models place item difficulty calibrations and person ability estimates on the same scale, we were able to examine this relationship visually using an item-person map. The distribution of people is shown on the left side and the distribution of items is shown on the right side. Ideally, these distributions will be sufficiently well targeted to each other to allow for adequate discrimination.10 Items located at a similar place along the continuum may be redundant, and large gaps may indicate a place where an additional item is needed.
Increasing amounts of the latent trait correspond to an increasing probability of a person receiving a rating in a higher category, such that, as a person advances along the ability spectrum, each category in turn must be the most probable.11 For this, category probability curves were created showing person ability relative to item difficulty on the x-axis and the probability of observing each ordered category plotted on the y-axis, such that each category has its own probability distribution. As the person's ability increases from left to right along the x-axis, each category should at some point be the most probable; that is, each category should have its own distinct peak and the entire chart should resemble a mountain range. Categories that are never the most probable, often referred to as “submerged” categories because they are located beneath other categories, contribute little to the rating scale. These category probability curves allow us to visually determine which rating scale categories are providing useful information.
Finally, a Keyform was created to illustrate the relationship between the expected category responses for each item. Person ability estimates were placed on the x-axis, and items were placed along the y-axis with expected rating categories for each item plotted. Keyforms help us to visually predict someone's rating on an unobserved item based on their ratings on observed items or to identify anomalous response patterns.
Post Hoc Analysis
Because the 10-point rating scale produced a substantial number of submerged categories, a post hoc analysis was conducted in which half-point categories were collapsed down into the nearest full-level category. This collapsed 6-category rating scale reflected the original theoretical progression levels rather than the extended 10-category rating scale structure.
Results
Individual fit statistics for each item are shown in the table. There were no items for which INFIT or OUTFIT values were lower than 0.5 or higher than 1.5, indicating acceptable data model fit (table). The mean scores were 3.2 (SD = 1.3) for PGY-1, 5.0 (SD = 1.3) for PGY-2, 6.7 (SD = 1.2) for PGY-3, and 7.4 (SD = 1.0) for PGY-4. Reliability, an index of internal consistency similar to a Cronbach's alpha or KR-20, was 0.99.
Abbreviations: INFIT, information weighted; MNSQ, mean square values; OUTFIT, information unweighted.
Note: INFIT and OUTFIT MNSQ are chi-square statistics divided by their degrees of freedom and reported as ratios with an expected value of 1 and a range of 0 to infinity.
The item-person map (figure 1) illustrates the construct of the FM Milestones. Person ability estimates ranged from –10 to 10 logits with a mean of 0.77 logits and standard deviation of 2.73 logits. The item difficulty calibrations ranged from –1 to 1 with a mean of 0.00 logits (as imposed by the model) and a standard deviation of 0.41 logits. Figure 1 shows very little spread in the item difficulty, meaning that all items are of similar difficulty.
The Keyform (figure 2) illustrates the relationship between the expected rating categories for each item. Keyforms typically have a step-like structure because the difficulty of the items usually varies to a noticeable extent; however, when items do not vary in difficulty, the categories look more like columns than steps. This Keyform is more column-like than step-like, indicating that the items and rating scales are functioning in a near identical manner.
Figure 3 provides an illustration of the category probability curves for examining the assumption that, at some point along the ability spectrum, each category will be the most probable. Only 3 items met this assumption: 11 items had 1 category, 6 items had 2 categories, and 2 items had 3 categories that were never the most probable. Of the 29 instances of these nonprobable (submerged) categories, 28 (97%) were half-point categories.
Post Hoc Analysis
Because 19 of the 22 items had at least 1 submerged category and 97% (28 of 29) of the submerged categories were half-point categories, we conducted a post hoc analysis in which the half-point categories were collapsed such that a 1.5 became a 1, 2.5 became a 2, and so on. The resulting analysis provided a reliability of 0.98 and new category probability curves, as shown in figure 4. The collapsed categories produced category probability curves with no submerged categories.
Discussion
The FM Milestones were designed “to create a logical trajectory of professional development in essential elements of competency.”1 Our findings suggest that they seem to function as the designers intended. The lack of spread in item difficulty (figure 1) and the near-deterministic usage of the rating scale by raters (figure 2) indicate that the FM Milestones are not measuring the amount of a latent trait (eg, knowledge or ability) possessed by a resident, but rather indicate where along the training sequence a resident falls. The extraordinarily high reliability shows that residents have no individual differences other than their year in residency and that residents whose ratings deviate from their associated year of residency warrant additional consideration by program leaders.
Rating Scale Functioning
Our findings suggest that the half-point categories provide little additional information and should be either eliminated or given richer descriptors in order for raters to effectively discriminate between categories. Although no items in the original analysis exhibited misfit to a degree that they should be removed, there were 3 items that caused some concern: SBP_Q1 (Provides cost-conscious medical care), PBLI_Q3 (Improves systems in which the physician provides care), and PROF_Q2 (Demonstrates professional conduct and accountability). When the categories were collapsed, the OUTFIT MNSQ for each of these items improved, indicating that the ratings became more in line with expected responses.
Structure of the FM Milestones
An item-person map (figure 1) places items and people with common residency progression estimates at the same point on the continuum. Typically, items spread along the distribution of the people in order to articulate the range of the construct and to accurately measure the entire continuum. However, the 22 FM Milestone items were designed to represent different aspects of progression through residency into practice, such that the items are of similar difficulty and nearly all of the variation in difficulty is driven into the rating scale categories. This function of the FM Milestones can be seen in the Keyform (figure 2), which shows the relationship between the expected response categories for each item. By drawing a vertical line from the item-response category in question through the other categories, one can see that the rating a resident receives for any item can be expected to be the same for all other items. For example, a resident who received a rating of 2 in PBLI_Q3 would be expected to receive a 2 in PBLI_Q1, a 2 in SBP_Q2, and so on down the list. This suggests that residents are not being rated on each item individually, but rather on a single global trait.
Dependencies
Even after reducing the number of rating scale categories, reliability remained absurdly high at 0.98. This is likely due to internal dependencies built into the FM Milestones. For example, to achieve Level 4 on medical knowledge question 1 (MK_Q1), a resident needs to successfully complete the American Board of Family Medicine requirements for certification, and this certification is only open to PGY-3s; thus a PGY-2 can never receive this score. Dependencies like these yield a high level of reproducibility (reliability) in the data because the answers to the questions are driven by a single deterministic process and are really collecting the same piece of information by asking the same question in cosmetically different ways. Since the FM Milestones were designed as a framework to inform and guide curriculum development,12 these dependencies are not a flaw in, but rather a feature of, their design. However, using the FM Milestone scores as a representation of knowledge or ability in any subsequent analysis would prove problematic, since the variation in scores seems to occur due to progression in residency rather than other characteristics of the resident or residency.
These dependencies, and the lack of stochasticity they cause, make any use of the FM Milestone scores as measurement in the strict sense problematic, but these scores can be useful for identifying residents who deviate from the expected progression. The FM Milestones have an average standard deviation of 1.3, so a PGY-1 would typically receive a rating of 2, 3, or 4 on most items. A PGY-2 would largely receive a 4, 5, or 6, and a PGY-3 would receive a 6, 7, or 8. In this sense, a PGY-3 who received a 4 on any item would probably be in need of remediation. Some have noted that program directors and members of the Clinical Competency Committees often have little direct observation of residents on which to base their ratings.13,14 The exceptionally high reliability may support the claim that residents are being rated solely on their year of residency.
Sklar15 commented that residents may be rated a little above or below their training year, but his statement had the subtext that they were largely rated by their year in residency, and our findings are largely congruent. In examining the EM Milestones, Beeson et al3 claim that their analysis “demonstrates a practice of rating residents across a broad range of the scale, independent of the year of training.” We interpret their results somewhat differently, that for nearly all residents, mean milestone scores are indeed equivalent to their training year. A visual inspection of the EM Milestones shows that the questions are written with dependencies similar to those in the FM Milestones, so we have little reason to believe the EM Milestones should function substantially differently.
Our study is subject to limitations. First, we used the first set of national FM Milestone data, and as residencies get more acquainted with creating ratings over time, scoring patterns may change. Second, each rating is determined by the institution's Clinical Competency Committee and a deeper understanding of the variation in ratings could be gained by having all scores factor in to the final score. However, these data are not available.
Conclusion
In a national study of all family medicine residents in ACGME-accredited programs, we found that lack of spread in item difficulty and lack of variation in category probabilities form the basis of a framework to inform progression through residency; however, the FM Milestones in their current form are not suitable for measuring residents or programs due to the lack of independence in the ratings. If year of residency is indeed the primary factor in assigning ratings, then the utility of the FM Milestones seems to be that of an educational framework to identify residents for remediation.
References
Author notes
Conflict of interest: The authors declare they have no competing interests.
A portion of this study was presented at the Accreditation Council for Graduate Medical Education Annual Educational Conference, National Harbor, Maryland, February 28, 2016.