Background The utility of traditional academic factors to predict residency candidates’ performance is unclear. Many programs utilize holistic review processes assessing applicants on an expanded range of application and interview characteristics. Determining which characteristics might predict performance-related difficulty in residency is needed.
Objective We aim to elucidate factors associated with residency performance-related difficulty in a large academic internal medicine residency program.
Methods In 2022, we conducted a retrospective cohort study of Electronic Residency Application Service and interview data for residents matriculating between 2018 and 2020. The primary outcome was a composite of performance-related difficulty during residency (referral to the Clinical Competency Committee; any rotation evaluation score of 2 out of 5 or lower; and/or a confidential “comment of concern” to the program director). Logistic regression models were fit to assess associations between resident characteristics and the composite outcome.
Results Thirty-eight of 117 residents met the composite outcome. Gold Humanism Honor Society (odds ratio [OR] 0.24, 95% confidence interval [CI] 0.16-0.87) or Alpha Omega Alpha (OR 0.36, 95% CI 0.14-0.99) members were less likely to have performance-related difficulty, as were residents with higher United States Medical Licensing Examination Step 2 Clinical Knowledge scores (OR 0.97, 95% CI 0.47-1.00). One-point increases in general faculty overall interview score, leadership competency score, and leadership overall score were associated with 41% to 63% lower odds of meeting the composite outcome. Interview or file review “flags” had an OR of 2.82 (95% CI 1.37-5.80) for the composite outcome.
Conclusions Seven metrics were associated with the composite outcome of resident performance-related difficulty.
Introduction
Selection of candidates into residency programs remains a high-stakes process. As recently as 2021, programs participating in the National Resident Matching Program (NRMP) reported prioritizing traditional selection factors that reflect past academic success, including United States Medical Licensing Examination (USMLE) Step 1 and Step 2 Clinical Knowledge (CK) scores, class ranking, and clerkship scores.1 Past NRMP and Association of Program Directors in Internal Medicine surveys confirm that internal medicine programs utilize similar academic performance characteristics for interview invitation decisions.2,3
Recently, holistic candidate review, which focuses on applicants’ experiences and attributes in addition to academic performance, has been promoted and increasingly adopted across specialties.1,4 While the literature examining resident selection factors and resident performance is extensive, many questions remain. For example, a recent systematic review noted that while Step 2 CK scores predict future in-training and board examination performance, they do not effectively predict subjective performance, with only a weak correlation with overall resident performance, patient care, professionalism, and interpersonal and communication skills.5 Similarly, although memberships in Alpha Omega Alpha (AOA) and Gold Humanism Honor Societies (GHHS) were endorsed by 36% and 29%, respectively, of 2021 NRMP program director survey respondents as a consideration in applicant ranking decisions (averaged across all specialties),1,6 studies reporting correlation with resident performance or performance-related difficulties in residency are inconclusive.7-10 Interviews, while crucial for residency selection,11-13 have demonstrated mixed results in prediction of performance, problems with professionalism, or attrition.14-21
KEY POINTS
In the era of increasingly holistic residency application review, additional data is needed that provides correlation between all aspects of the residency application and eventual residency performance.
In one internal medicine program, several components of the application review process and interview were associated with fewer performance-related difficulties during residency.
Residency programs may be interested in devising a similar system to collect data on their applicants’ outcomes in order to better understand who could be at risk for difficulties in their program.
Methods
Setting and Participants
This was a retrospective cohort study of all residents matching and matriculating into categorical residency positions at a large academic internal medicine residency program between 2018 and 2020; the study was conducted in 2022.
Applicant Metrics
During the interview process, most applicants interviewed with one “leadership” faculty member (eg, an associate program director, program director, vice chair of education) and one “general” faculty member. Leadership faculty are trained in behavioral-based interviewing (BBI), in which applicants are asked to describe past experiences to facilitate a conversation on traits desirable to the residency program. Each leadership faculty member asks one BBI question during each interview; these questions are the same for an individual faculty member during a single recruitment season and focus on the core areas of teamwork, empathy, and conflict resolution. Leadership faculty give 2 scores: one for the BBI response and one for overall impression. A few residents had 2 leadership interviews due to scheduling variances, rather than one with a faculty member and one with a leadership team member.
General faculty interviewers receive a preparatory article on hiring for emotional intelligence in advance.24 They use an unstructured interview format, asking about topics including, but not limited to, the applicant’s experiences, career interests, and interest in the program. Each general faculty member gives the applicant 4 scores in these competency areas: emotional intelligence, communication skills, career goals, and overall impression.
For both leadership and general faculty interviews, a standardized scoring tool was used to provide a consistent scoring system for each interview conducted. Each competency area was scored from 1 (evidence that a skill or competency is not demonstrated regardless of guidance provided) to 10 (very strong evidence that a skill or competency is present and used effectively). Career goals and overall interview scores also ranged from 1 to 10, with 1 indicating that an applicant should not be ranked, a 6 indicating an applicant who would be a good fit for the program, and 10 indicating that an applicant should be ranked highly and would be an exceptional resident and excellent fit for the program.
File reviews, including demographics and medical school performance data, were obtained from the Electronic Residency Application Service (ERAS).25 After the interview, associate program directors and core faculty, who previously completed dedicated training on file abstraction and were not involved in the applicant’s interview, abstracted data from each applicant’s ERAS file. Abstracted data included self-identified race/ethnicity, self-identified gender, internal medicine clerkship and internal medicine sub-internship grades, most recent USMLE Step 1 and Step 2 CK scores, and whether the applicant was elected to their school’s chapter of GHHS and/or AOA, if available at their school. GHHS and AOA election were measured as separate variables. Students who did not have GHHS or AOA available at their schools were categorized as GHHS or AOA “not available” for the purposes of this analysis. As clerkship grading scales vary by institution, a simple normalization of scales and aggregation of the normalized data was used to convert clerkship grades across institutions to a 4-point scale modeled on the most common grading scale of Fail/Pass/High Pass/Honors. Race/ethnicity was categorized as a dichotomous variable: underrepresented in medicine (UIM; Black, Latinx, American Indian or Alaskan Native, Native Hawaiian or Pacific Islander) or not UIM. The definition of UIM for this study is based on the AAMC’s 2004 updated definition of the term,26 along with the defined categories provided by ERAS for applicants to self-identify.
Free-text “flags” were noted in the file review or from interviews. Leadership, general faculty interviewers, and file reviewers could identify flags during the interview or file review, respectively. While the criteria for a flag are not explicitly defined, typical flags reflect concerns such as course or clerkship failure, repeated course or clerkship, USMLE failure, legal issues, portfolio inconsistencies (eg, spelling errors, date inconsistencies, references to residency application for other specialties), or concerns noted in the interview (eg, narrative-application discrepancy, disengaged conversationalist). For this study, each author independently reviewed free-text comments in the flags field for the file review and interview forms to determine whether communicated flags were present and reflected actual concerns. Disagreements or questions were resolved by author consensus. For the analysis, flags were categorized as present/absent, and an individual could have more than one flag identified in their application.
Outcomes Measured
The primary outcome of interest was a composite outcome of performance-related difficulty during residency, defined as meeting any of these criteria: (1) referral to the Clinical Competency Committee (CCC) for a specific competency concern; (2) any score on any rotation evaluation competency-based question of 2 or lower (on a 5-point Likert scale: 1-unacceptable/never, 2-marginal, 3-satisfactory, 4-very good, 5-excellent); or (3) a confidential “comment of concern” submitted to the program director on any evaluation. A composite outcome was examined in order to achieve a large enough sample size for the outcome of performance-related difficulty. The CCC is an Accreditation Council for Graduate Medical Education (ACGME)-mandated committee that reviews resident performance, including specific competency concerns. Rotation evaluation scores were available from 2018 to 2020, for all inpatient and outpatient clinical experiences and included milestone-based questions in all key areas of competency. Peer evaluations were not included as part of the evaluation score component.
Confidential comments, offered as an optional free-text entry on each resident evaluation, were reviewed by the program director and were labeled as positive, neutral, related to wellness, or a comment of concern. Of these, only comments of concern were included as part of the composite outcome. Examples of a comment of concern include additional details regarding deficiency concerns in any of the 6 ACGME core competencies, or behaviors that raised concern for a deficiency in performance that the evaluator might not have known how else to categorize.
Analysis of the Outcomes
Fisher’s exact tests were used to compare categorical variables, and t tests and Wilcoxon rank sum tests were used to compare normally and nonnormally distributed continuous variables, respectively. Unadjusted logistic regression models were created for each file review and interview covariate. Receiver operating characteristic curve values were also reported for each unadjusted model. Covariates that achieved statistical significance were put into adjusted models by category (file review and interview) to determine which file review and interview variables were associated with the outcome of interest; matriculation year was also included in all adjusted models. Correlation between covariates in multivariable models was assessed using the CORRB statement in SAS (SAS Institute Inc). Strongly correlated covariates (correlation >0.4 or <-0.4) were combined or dropped from the model.
This study was approved by the Institutional Review Board of the Emory University School of Medicine and conducted in compliance with ERAS data policy.27
Results
Between 2018 and 2020, 117 residents matched into the residency program. Thirty-eight (32.5%) met the composite outcome of referral to CCC, scoring lower than 2 on any evaluation, and/or having a confidential comment of concern to the program director (Table 1).
Of the 38 residents who met the composite outcome, 3 (7.9%) were members of GHHS, compared to 19 of 79 residents (24.0%) who did not meet the composite outcome (P=.048; odds ratio [OR] 0.24; 95% confidence interval [CI] 0.16, 0.87; Table 1). Residents who met the composite outcome had lower average Step 2 CK scores (247.6 for met outcome vs 253.6 for residents who did not meet the outcome, P=.04; OR 0.97; 95% CI 0.94-1.00). Mean faculty overall impression score, leadership behavioral interview score, and leadership overall impression score were all significantly lower for residents who met the criteria for the composite outcome (faculty overall score 7.1 vs 7.8, P=.02, OR 0.69, 95% CI 0.47-1.00; leadership behavioral interview score 7.5 vs 8.1, P=.008, OR 0.52, 95% CI 0.32-0.87; leadership overall impression score 7.6 vs 8.2, P=.004, OR 0.47, 95% CI 0.27-0.81). Residents who met the composite outcome were also more likely to have had one or more flags identified during their interview or file review (18 of 38 [47.3%] of residents who met the composite outcome vs 15 of 79 [19.0%] of residents who did not meet the composite outcome, P=.003, OR 2.82, 95% CI 1.37-5.80; Table 1). The online supplementary data shows a comparison of descriptive statistics between residents who met the composite outcome once (n=8) compared to residents who met it more than once (n=30).
Table 1 includes ORs for each potential covariate. A 1-point increase in the general faculty interview overall score, leadership competency score, and leadership overall score was associated with between 41% and 63% lower odds of being referred to CCC, receiving a low evaluation, or having a comment of concern to the program director. Identification of any flag on file review or interview was associated with an increase of over 200% in the odds of the composite outcome (OR 2.82, 95% CI 1.37-5.80) for the composite outcome.
Table 2 shows the results of the adjusted models, including significant covariates obtained from file review (Step 2 CK score, GHHS election, AOA election); from interview scores (overall faculty interview score, leadership competency score, leadership overall score); and flags, as well as an indicator variable for matriculation year. The online supplementary data includes the correlation matrix for the covariates in the fully adjusted model; only leadership competency score and leadership overall score were found to have a strong correlation of greater than 0.4 or less than -0.4 (correlation=-0.453), so an average leadership score was created and included in adjusted models. When these 7 covariates were included in a fully adjusted model, both the general faculty overall interview score (adjusted OR [aOR] 0.58, 95% CI 0.35-0.99) and the average leadership score (aOR 0.44, 95% CI 0.20-1.00) remained significant. The results of the adjusted models with the overall leadership score and competency leadership score separated are displayed in the online supplementary data; in this model, only the general faculty overall score remained significantly associated with the composite outcome (aOR 0.59, 95% CI 0.35-0.99).
Discussion
We sought to understand associations between residency selection factors and performance-related difficulty during residency. Three residency selection factors from the application: higher Step 2 CK scores, election to AOA, and membership in GHHS were more common in residents who did not experience performance-related difficulty. Higher interview scores, particularly those resulting from behavioral interviews done by experienced faculty, were also associated with lower odds of performance-related difficulty, and flags identified during the interview or on the file review were associated with performance-related difficulty in residency.
Our study adds to previous studies that have demonstrated a correlation between USMLE Step 2 score and resident performance.5 Previous literature examining the association between AOA membership and resident performance has been mixed,16,28,29 but our results support an association. To our knowledge, a correlation between GHHS membership and less performance-related difficulty has not been previously described. Likewise, our study contributes to the growing literature demonstrating that behavioral-based selection interviews correlate with resident performance; in our case, with performance-related difficulties during training.21,30
Finally, faculty identification of flags during ERAS application review or interview was associated with nearly 3 times the odds of experiencing performance-related difficulty in residency. This suggests that holistic review of applications by faculty may detect subtle areas of concern not otherwise reflected in large-scale review processes, thereby identifying residents who may experience performance-related difficulty, even when scores on standardized, numeric components of the application fall within a program’s typical range.
Flag notations by experienced faculty may signal a mismatch between candidates and programs. This may also be reflected in the observed association between general faculty “overall impression” interview scores and resident performance outcomes. Indeed, this study suggests that behavioral-based interviewing and holistic review may be significant tools for identifying residents who may experience performance-related difficulty during training. The results of our study illustrate how investment in faculty development in these key areas supports more effective resident selection, especially when employed in a multifaceted approach of interviews and file reviews.
The goal of identifying residents who are at higher risk for performance-related difficulty during training is not meant to exclude those residents from selection. Rather, in this era of holistic review, this process affords an opportunity to assess the alignment of a single candidate’s goals and likelihood of success with the mission, attributes, and resources of a specific program. Furthermore, learners are not static, and difficulties old and new may arise and resolve throughout medical school, residency, and beyond. Many residents who face performance-related difficulty successfully remediate, contribute positively to our programs and communities, and proceed to fulfilling careers.
There are limitations to our study. As this is a single-institution study, our results may not be generalizable. Some applicant characteristics may not be available in our data, such as scores of previous USMLE attempts. Because the study cohort includes residents who matriculated between 2018 and 2020, not all residents had the same amount of time to achieve the outcome; however, we included matriculation year as a covariate in adjusted models. Faculty receive yearly training on interview and evaluation best practices, but the interview and evaluation scores have not been reviewed for interrater reliability. Additionally, we did not attempt to measure resident strengths or factors that might offset identified red flags or mitigate negative evaluations. Further qualitative work should explore how programs that utilize holistic review balance candidate strengths and weaknesses during the application proceed.
Conclusions
GHHS membership, AOA membership, higher Step 2 CK score, higher behavioral-based interview scores, and absence of red flags through holistic review of residency candidate applications were associated with reduced performance-related difficulty in our program.
The authors would like to thank Mary Ann Kirkconnell Hall, MPH, for editing assistance.
References
Editor’s Note
The online version of this article contains further data from the study.
Author Notes
Funding: The authors report no external funding source for this study.
Conflict of interest: The authors declare they have no competing interests.