Background The Accreditation Council for Graduate Medical Education (ACGME) Milestones use has been formative and low-stakes to date, and transitioning to higher-stakes applications in a truly competency-based medical education (CBME) system requires extensive validity evidence. Surgical specialties, with their unique demands for procedural skills and operative experience, represent a critical context for evaluating the validity of Milestones.
Objective To synthesize studies reporting validity evidence for the ACGME Milestones in surgical specialties.
Methods This systematic review was conducted based on Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. A systematic literature search was conducted across 8 databases and references to identify studies that reported validity evidence for Milestones in surgical specialties. Literature was reviewed for inclusion using Covidence and coded based on Messick’s framework. The quality of the studies was evaluated using the Medical Education Research Study Quality Instrument.
Results A total of 114 studies were included from 2013 to 2023. The primary source of validity evidence (n=45, 39.5%) was relations to other variables (knowledge and skills, learner characteristics, patient/health care, social-emotional variables), followed by response processes (n=38, 33.3%: interrater reliability, rating processes, structure of Clinical Competency Committee, rater training, longitudinal reliability, straightlining) and consequences (n=29, 25.4%: value and utility, intended use, anticipated impact). Only 12 studies (10.5%) reported internal structure evidence.
Conclusions This study provides insights into understanding what constitutes validity evidence within the context of ACGME Milestones in surgical specialties. This review highlights areas where further research is needed to support the moderate to high-stakes use of Milestones in a CBME system.
Introduction
Medical education has undergone a shift with the advent of competency-based medical education (CBME), an outcomes-based approach that centers on the evaluation of education programs and assessment of learners, all structured around a framework of competencies.1,2 In response to the CBME movement, the Accreditation Council for Graduate Medical Education (ACGME), in collaboration with surgeon experts, crafted specialized Milestones for assessing trainees in surgical specialties. This formative assessment not only provides information regarding trainees’ readiness for unsupervised practice but also serves as a tool for identifying at-risk trainees and offering timely feedback during their training journey.3
The intended use of Milestones needs to be supported by validity evidence for rating use and interpretation. However, given the significant time and resources required to generate and report Milestones data, many in the surgical community have questioned their value.4-6 As such, a more comprehensive and robust collection of validity evidence would enhance understanding of what constitutes validity evidence for Milestones and aid in the interpretation of ratings.
Messick’s framework of construct validity, as adopted by the Standards for Educational and Psychological Testing, provides a comprehensive approach to evaluating multiple facets of validity evidence for assessment tools.7,8 This framework encompasses 5 aspects: content, response processes, internal structure, relations to other variables, and consequences.7,8 Content evidence examines whether the questions or tasks in an assessment truly reflect the underlying construct of interest. Evidence supporting content validity can be obtained through experts’ judgment or theoretical analysis of performance assessment items. Response process evidence involves understanding how raters think or engage cognitively when making an assessment rating. Internal structure validity examines the extent to which the individual items align with different dimensions of the underlying construct. Internal structure validity can be assessed using measures of reliability or factor analysis. Relations to other variables can be assessed by examining whether ratings correlate with other construct-relevant measures that have been measured independently. Consequential validity concerns evidence regarding the use of assessment ratings, and their resulting impact, encompassing both intended and unintended outcomes.
As performance decisions transition from formative and low-stakes to summative and high-stakes, the need for robust validity evidence becomes increasingly important to ensure that Milestones accurately reflect trainee competencies and support fair and effective assessments, ultimately enhancing the credibility of competency-based systems and decisions in surgical education. A systematic and comprehensive collection of validity evidence would enhance understanding of what constitutes validity evidence for Milestones in surgical specialties and help identify gaps where future research is needed. However, to date, we are not aware of any systematic reviews on validity evidence of ACGME Milestones in surgical specialties. The current study aims to conduct a systematic review of validity evidence for ACGME Milestones in surgical specialties, employing Messick’s framework of construct validity adopted by the Standards for Educational and Psychological Testing.7
Methods
The study was conducted based on Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) standards of quality for reporting systematic reviews.9
Study Eligibility
We included conceptual or empirical studies published in peer-reviewed journals, including full texts and abstracts. Excluded from consideration were reviews or studies sourced from non–peer-reviewed journals, as well as other forms of literature such as dissertations/theses, conference presentations, and reports. We also included studies that reported at least one aspect of validity evidence for the ACGME Milestones, as defined by Messick’s framework and Standards for Educational and Psychological Testing.7,8 In terms of the population, our focus was on trainees in surgical specialties or subspecialties, as classified by ACGME Data Resource Book 2022-2023 (online supplementary data 1).10 Lastly, we restricted our analysis to studies published in English.
Search Terms and Strategies
To ensure a thorough search and comprehensive inclusion of pertinent literature, a combination of electronic and manual search methods was employed. The initial step involved searching multiple online databases, including PubMed, Web of Science, Scopus, Embase, and 4 databases indexed in EBSCO Information Services (APA PsycInfo, ERIC, Education Full Text, and Academic Search Premier) using the following search terms in the title and abstract: ACGME, Accreditation Council for Graduate Medical Education, Clinical Competency Committee, CCC, resident, residents, residency, medical education, milestones, and fellow* (* denotes search for multiple spellings or word endings). We used no beginning date cutoff, and the last date of the search was August 9, 2023. We supplemented the electronic database search by a forward and a backward reference search to examine the references of key articles identified during the initial search. Additionally, we reviewed the latest online Milestones Bibliography resource that was published in December 2021 for potentially relevant studies.11
Screening Process
All identified studies were imported into Covidence for data screening, resulting in 2836 studies after duplicate checking. A 2-phase screening process was then applied to determine the eligibility of those studies, consisting of title and abstract screening, followed by a full-text review. A subset of studies (696 for the first phase and 49 for the second phase) was randomly selected and independently reviewed by 2 researchers. Any discrepancies were addressed through discussions among the team members. The final sample consisted of 114 studies. Figure 1 provides a visual representation of the search and screening workflow.9
PRISMA Flowchart for Literature Search and Screening
PRISMA Flowchart for Literature Search and Screening
Data Coding
A coding protocol was devised in Microsoft Excel to facilitate the extraction of study characteristics and validity evidence from the primary studies (online supplementary data 2). The coding protocol included the following descriptors: (1) Identification of Studies, (2) Trainee Characteristics, (3) Method and Milestones Characteristics, and (4) Validity Evidence. The validity evidence section was designed to extract information related to the validity framework established generally7,8,12 and specifically for Milestones.13 This process engaged measurement and subject matter experts, operating in an iterative manner that incorporated multiple rounds of refinement and revisions. To ensure coding consistency, a subset of 15% of the included studies was randomly selected and independently coded by 2 researchers, with 83% agreement. Instances of discrepancies were resolved through discussions involving team members. Notably, it is possible for one study to fall into multiple validity evidence categories or elements. In instances where a study did not fit into any of the validity elements, new elements were created to adequately capture the extracted information.
Methodological Quality
Since the quality of a systematic review is determined by the methodological rigor of the primary studies,14 the quality of all the included studies was evaluated. We employed the Medical Education Research Study Quality Instrument (MERSQI), which was designed to evaluate the rigor and credibility of quantitative research in medical education.15 Each study was evaluated across 6 domains: study design, sampling, type of data, validity of evaluation instrument, data analysis, and outcomes, with different scores assigned to each item within these domains. MERSQI was applied only to all the quantitative studies where the full text was available.
Data Synthesis and Analysis
Descriptive statistics, including frequencies and percentages, were reported for each coded element. The mean and standard deviation of each item in the MERSQI were calculated across all the included studies. In addition, the total MERSQI scores for all the included studies were computed and reported.
Results
Characteristics of Included Studies
The final sample comprises 114 primary studies from 2013 to 2023 (Table 1). A full list of references is included in online supplementary data 3. Most studies pertained to the specialty of general surgery (n=34, 29.8%), employed quantitative methods (n=76, 66.7%), and examined overall or all Core Competencies (n=94, 82.5%). More than one quarter of the included studies (n=32, 28.1%) were published in the Journal of Surgical Education, followed by the Journal of Graduate Medical Education (n=10, 8.8%) and Academic Medicine (n=5, 4.4%). The publication trend is depicted in online supplementary data 4. When aggregated by year, the total number of studies shows a gradual increase except for the year 2020.
Validity Evidence
Validity evidence is summarized in Table 2. The primary source of evidence was relations to other variables (n=45, 39.5%), followed by response processes (n=38, 33.3%) and consequences (n=29, 25.4%). Only 11% (n=12) of the included studies reported validity evidence related to internal structure.
The frequency distribution of validity evidence stratified by research methods is displayed in Figure 2. Notably, conceptual studies contributed more to content validity. In the 45 studies reporting validity evidence concerning relations to other variables, quantitative methods were predominantly employed (n=42, 93.3%). Similarly, quantitative studies were relatively more likely to provide validity evidence related to response processes (28 of 38, 73.7%) and internal structure (9 of 12, 75.0%). Consequences evidence was primarily provided by conceptual or quantitative studies employing survey-based approaches with descriptive statistics.
Content Validity:
A total of 23 (20.2%) studies reported validity evidence in this category. Specifically, 14 (12.3%) studies focused on the development process of Milestones, 9 (7.9%) examined the representativeness of core competencies and/or subcompetencies, 7 (6.1%) described the pilot testing and revision, and 4 (3.5%) provided definitions for Milestones and competencies. Content validity evidence involves providing a detailed description of relevance of content to the underlying construct and ensuring comprehensive coverage of Milestone subcompetencies within a specialty.
Response Processes:
Evidence of validity based on response processes is demonstrated when members of a Clinical Competency Committee (CCC) understand the construct in the same way that is intended by the ACGME as well as consistently apply relevant criteria when assessing trainees. In this category, 18 (15.8%) studies explored interrater reliability, 15 (13.2%) examined the rating process, 11 (9.6%) focused on the development and structure of the CCC, and 7 (6.1%) examined rater training. Only one study (0.9%) investigated longitudinal reliability, and one (0.9%) examined the phenomenon of straightlining (Table 2). Specifically, studies did not consistently report high and significant interrater reliability. For example, while general surgery programs showed high interrater reliability between CCC and self-assessment,16-18 significant correlations were not observed for a majority of subcompetencies in ophthalmology programs.19 Regarding longitudinal reliability, one study revealed that ratings of half of the subcompetencies for the first-year residents from 118 urology programs exhibited significant downward trends over a 4-year period.20 The increasing stringency has been attributed to the increasing comfort in rating struggling residents lower early in training.20
Internal Structure:
Internal structure validity examines the extent to which the relationship of subcompetencies aligns with the conceptual framework of trainees’ competency. Evidence from this source was reported in only 12 studies. The most common element contributing to internal structure was the analysis of ACGME Milestone rating variability attributable to various sources (5 of 114, 4.4%). Only 2 studies provided evidence of internal consistency reliability (1.8%), and 1 study (0.9%) examined the factorial structure of Milestones. Milestones have displayed differentiation among trainees of different postgraduate years (PGYs), Milestones, and meaningful levels of competency.21-23 Milestones have also shown satisfactory internal consistency reliability, with Cronbach’s alpha for all Milestone items ranging from 0.93 to 0.99.24 A study using factor analysis revealed ratings on Milestones in an obstetrics and gynecology program reflected a 6-competency framework, with an additional 3-factor structure for the Patient Care subcompetency.25
Relations to Other Variables:
Relation to other variable validity evidence can be established by examining the associations of Milestone ratings with other construct-relevant measures suggested by empirical evidence or a theoretical framework. Variables external to Milestone ratings include performance criteria and categorical variables such as learner characteristics, when underlying theories of a proposed use of Milestones ratings implies such a relationship. Among the studies examined, 45 reported validity evidence based on relations to other variables. These can be further categorized into relations to knowledge and skills (n=27, 23.7%), learner characteristics (n=12, 10.5%), patient/health care outcomes (n=4, 3.5%), and social-emotional variables (n=3, 2.6%). Overall, the significance and strength of the correlation between Milestone ratings and other construct-relevant variables varied considerably depending on specialties, measures, and subcompetencies. For example, among vascular surgery trainees, all Milestone competencies were found to be associated with first-attempt vascular qualifying and certifying pass, and vascular in-training examination in PGY-4 and PGY-5.26 However, a study with general surgery residents revealed that a large number of Milestone competencies were not significantly correlated with United States Medical Licensing Examination (USMLE) Step 1, Step 2, and USMLE delta scores.27 Mean Milestone competency ratings of general surgery residents showed no significant correlation with early career patient outcomes.4
Consequences:
Evidence for validity with respect to consequences is gathered by evaluating how the intended consequences support or unintended consequences weaken the interpretations or uses of assessments. The predominant element of validity evidence in this category came from studies (n=15, 13.2%) that inquired into program directors’ and/or trainees’ experiences with the implementation of Milestones to assess the tool’s utility and helpfulness. A total of 13 studies (11.4%) analyzed the intended use of Milestones, including formative use (tracking of progress trajectory, early identification of underperforming trainees, and design of individualized learning plans), evaluative use (reviewing aggregated data for curriculum reform and educational quality improvement), self-assessment, and the use of a supplemental guide.
Methodological Quality
The methodological quality of the 67 quantitative studies where the full text was available is detailed in online supplementary data 5. A significant portion of these studies (61 of 67, 91.0%) employed a single-group cross-sectional design or retrospective design. Approximately half of the studies (n=37, 55.2%) employed a single institutional design, and a small number of the studies (n=6, 9.0%) approached descriptive statistics only. The primary focus of outcomes was knowledge and skills (n=55, 82.1%), followed by satisfaction, attitudes, and perceptions (n=10, 14.9%). The overall scores for the included studies ranged from 5 to 11.5, with a median of 8 and an interquartile range from 6.8 to 9.5. The maximum achievable score was 18.
Discussion
This systematic review synthesized studies that report validity evidence for ACGME Milestones in surgical specialties. Our finding revealed that most studies provided validity evidence based on relations to other variables, while very few examined the internal structure of the Milestones. CBME requires that surgical trainees attain a specified level of competence before they are entrusted to care for patients without supervision.28 The accurate and reliable reporting of ACGME Milestones becomes increasingly important to determine readiness for unsupervised practice or to make high-stakes decisions regarding promotion, graduation, or certification.1 Valid Milestone ratings provide surgeon educators confidence that it is worthwhile to conduct robust CCC processes and share the resulting formative feedback with surgical trainees. This systematic review identifies the existing validity evidence for the ACGME Milestones across surgical specialties. In addition, we have identified gaps in the validity evidence for surgical Milestones, namely internal structure, that should be addressed by future research.
We observed a notable pattern regarding the types of validity evidence reported across the reviewed studies. Primarily, our finding revealed that most studies provided validity evidence based on relations to other variables, whereas very few examined the internal structure of Milestones. This observation mirrors that of a 2014 study by Cook et al,12 a systematic review of simulation-based assessment, that similarly noted a prevalence of validity evidence based on relationships with other variables. However, overreliance on this aspect of validity may stem from the convenience of data and simplicity of analysis methods. Furthermore, it is improper to correlate Milestone ratings with scores on an external measure and interpret the results without considering whether they are measuring the same underlying construct.
Although the aim of this review is to summarize validity evidence from primary studies rather than evaluating the valid use and interpretation of Milestone ratings in each study, we did detect notable variability within each validity category. For example, regarding response process validity, substantial variability has been noted in CCC formation and structure, rater training, rating process, and interrater reliability across programs within the same specialty,29,30 posing challenges to the comparability of assessment ratings collected from different programs. This underscores the necessity for a shared mental model to understand Milestones and guidance to standardize its implementation process.31 Similarly, for validity evidence regarding relations to other variables, the significance, strength, and even the direction of the relationship between ratings on Milestones and performance on other measures varied considerably. The heterogeneity in the observed effect sizes may be attributed to the inconsistencies in study design, context, sample, and statistical method used. Future meta-analysis is warranted to investigate what factors (eg, specialty, training level, competency, outcome) can explain the heterogeneity.
The study has implications for primary researchers and research methodologists, serving as a valuable resource by highlighting areas where research has yet to be conducted. There is a notable gap in research on the internal structure of Milestones, which may reflect the limited understanding of this source of validity in the medical education community. Further investigation is needed to explore or confirm the factorial structure of Milestones in other specialties using factor analysis. Additionally, assessing various sources of measurement error using generalizability theory or evaluating measurement properties of each subcompetency employing item response theory is also needed, though prior research suggested caution when using these approaches with Milestones data due to qualitative findings that show consistent differences between programs.30,32 Regarding response process validity, while most studies have focused on interrater reliability, there is a need to examine intrarater reliability and cognitive process of raters, perhaps by employing techniques such as think-aloud protocols. While there has been extensive investigation into the correlation of Milestones ratings with trainees’ knowledge and skills assessed by concurrent measures, evidence for the predictive validity of Milestones for future patient/health care outcomes is limited.
There are several limitations to this study. First, the current review focuses on surgical specialties and may not capture nuances specific to hospital-based and medical specialties or the larger context of CBME, potentially limiting the scope of analysis and applicability of the results. Second, the search terms may not have captured all relevant studies, and there may be potential bias in the screening and coding process. Finally, while our review includes a range of surgical specialties, we acknowledge that these specialties have unique Milestones subcompetencies and behavioral language. This limitation highlights the need for specialty-specific studies to further validate the use of Milestones in distinct contexts.
Conclusions
Most studies reported validity based on relations to other variables for the ACGME Milestones in surgical specialties. This review highlights the need for further research on the internal structure of Milestones and for evidence supporting the predictive validity of Milestones in relation to future patient and health care outcomes, which is crucial for their use in moderate- to high-stakes decisions within a CBME system.
References
Editor’s Note
The online supplementary data contains a list of surgical specialties and subspecialties, the coding protocol used in the study, references for all studies included in the systematic review, the number of studies during the study period by year, and the methodological quality of included quantitative studies.
Author Notes
Funding: The authors report no external funding source for this study.
Conflict of interest: The authors declare they have no competing interests.