Abstract
A core objective of residency education is to facilitate learning, and programs need more curricula and assessment tools with demonstrated validity evidence.
We sought to demonstrate concurrent validity between performance on a widely shared, ambulatory curriculum (the Johns Hopkins Internal Medicine Curriculum), the Internal Medicine In-Training Examination (IM-ITE), and the American Board of Internal Medicine Certifying Examination (ABIM-CE).
A cohort study of 443 postgraduate year (PGY)-3 residents at 22 academic and community hospital internal medicine residency programs using the curriculum through the Johns Hopkins Internet Learning Center (ILC). Total and percentile rank scores on ILC didactic modules were compared with total and percentile rank scores on the IM-ITE and total scores on the ABIM-CE.
The average score on didactic modules was 80.1%; the percentile rank was 53.8. The average IM-ITE score was 64.1% with a percentile rank of 54.8. The average score on the ABIM-CE was 464. Scores on the didactic modules, IM-ITE, and ABIM-CE correlated with each other (P < .05). Residents completing greater numbers of didactic modules, regardless of scores, had higher IM-ITE total and percentile rank scores (P < .05). Resident performance on modules covering back pain, hypertension, preoperative evaluation, and upper respiratory tract infection was associated with IM-ITE percentile rank.
Performance on a widely shared ambulatory curriculum is associated with performance on the IM-ITE and the ABIM-CE.
Effective resident education and evaluation require validated curricula and assessment tools.
Assessment on a widely shared internal medicine ambulatory care curriculum was correlated with performance on the in-training examination and the American Board of Internal Medicine (ABIM) examination.
Participation rate was low, raising the potential for selection bias.
Performance on the widely shared ambulatory curriculum was associated with performance on the in-training examination and the ABIM examination.
Introduction
Evaluating residents is best done with methods that are both feasible and psychometrically robust.1 The Accreditation Council for Graduate Medical Education (ACGME) has provided a toolbox of methods for program directors to incorporate into their evaluation processes.2 Advocates of the toolbox call for the development of additional measures of resident assessment.3,4 Ideally, these tools should have validity evidence of their assessment results.
One tool of resident assessment with validity evidence is the Internal Medicine In-Training Examination (IM-ITE). The IM-ITE assesses knowledge among internal medicine residents.5–8 Validity evidence on the IM-ITE includes high performance on tests of internal consistency, improvement in scores among advanced trainees, and association with American Board of Internal Medicine Certifying Examination (ABIM-CE) performance. As a result, the IM-ITE is a valuable assessment tool for program directors and is used annually in most programs.5–8
With the advent of competency-based education and milestone interval evaluations, residency training programs need more assessment tools with validity evidence. One type of validity evidence is the correlation of assessment results with other instruments that have good validity evidence (ie, concurrent validity).9 The Johns Hopkins Internal Medicine Curriculum is a widely used curriculum on topics in ambulatory care distributed online via the Johns Hopkins Internet Learning Center (ILC).10,11 Education outcomes on the ILC undergo extensive reliability testing and gathering of validity evidence. Trainees are provided with real-time feedback on their performance, including a ranking relative to others at the same level of training.12,13 We hypothesized that individual performance on the ILC would be an indicator of individual performance on 2 key benchmarks of medical knowledge, the IM-ITE and the ABIM-CE, and compared knowledge outcomes data to determine potential correlations.
Methods
We performed a cohort study of postgraduate year (PGY)-3 residents at internal medicine residency training programs who subscribed to the ILC Internal Medicine Curriculum in the 2009–2010 academic year. The ILC Internal Medicine Curriculum consists of 41 modules on topics in ambulatory care, including chronic disease management (eg, diabetes, hypertension, depression); acute symptom management (eg, headache, back pain); and preventive care (eg, cancer screening, immunizations).13 Modules consist of a pretest, a didactic section, and a posttest, and are disseminated online. On the ILC pretests and posttests, item discrimination is performed on each test item, and Cronbach α is performed on each test using the method of Ferguson and Takane.14 For the purpose of this study, we analyzed posttest performance. We calculated a 2-digit percentile rank on each PGY-3 posttest score, relative to other PGY-3 residents who have completed that same module, by calculating the mean and standard deviation of scores on the module and determining the standardized score for a particular resident. This was then converted to a 2-digit percentile rank. An overall 2-digit percentile rank is calculated for each resident by determining the mean and standard deviation of all the individual percentile rank scores and comparing the individual resident's average percentile rank with the mean and standard deviation of all percentile rank scores.
In the 2009–2010 academic year, there were 109 internal medicine residency training programs using the ILC curriculum, and 3 attempts were made to contact each program by e-mail or by phone to participate in the study. Of those 109 programs, 38 programs responded, with 22 (20.2%) agreeing to participate and 16 (14.7%) declining to participate. Programs that did not respond were categorized as declining to participate. Programs that agreed to participate were sent a score sheet of each PGY-3 resident in their program, containing his or her ILC module scores and percentile rank scores (or no scores for those residents who had not completed any modules). Programs were asked to enter the IM-ITE total score and percentile rank and the ABIM-CE total score and total percentile rank for each resident, and then to remove the resident name from the score sheets to preserve anonymity. Several programs expressed confusion about which score on the ABIM-CE report represented the total percentile rank, and as a result, the ABIM-CE total percentile rank was deleted from the study. The IM-ITE results are always available to training programs, whereas residents may decline to share their ABIM-CE scores with their training program. As a result, we received fewer ABIM-CE scores than IM-ITE scores.
The study was approved by the Johns Hopkins Medicine Institutional Review Board.
Statistical Analysis
The distribution of resident performance on the IM-ITE, the ABIM-CE, and the ILC modules were summarized as means and standard deviations. Associations between IM-ITE, ABIM-CE, and ILC module performance were examined by the pairwise Pearson correlation. We used a linear regression model to determine whether there was any association between the number of modules completed and performance on the IM-ITE and ABIM-CE. β-Coefficients were calculated to estimate the effect on performance per module increase. The number of ILC modules completed was also grouped into thirds for comparison. Student t test was used to compare mean IM-ITE total and rank score as well as ABIM-CE total scores in residents completing 2 to 7 modules and 8 or more modules to residents completing 0 or 1 module. In addition, we used a nonparametric test of trends for the ranks of across-ordered groups (0–1, 2–7, and 8+ modules). The test is an extension of the Wilcoxon rank-sum test. Finally, we investigated whether a resident's performance on a specific topic correlated with IM-ITE rank performance. For this analysis, pairwise Pearson correlations were calculated to assess the strength of associations on modules completed by at least 100 residents. We also performed linear regression with IM-ITE rank as the dependent variable and module performance as the independent variable. Effect size was calculated using the R2 in regression as the proportion of shared variability between the 2 variables.15 All tests of significance were 2 tailed, with an α level of .05. Analyses were performed using Stata/SE version 12.0 (StataCorp LP).
Results
Respondent Characteristics
Reports were received from 22 internal medicine residency training programs, including 16 community hospitals (72.7%) and 6 academic medical centers (27.3%). There were 506 PGY-3 residents at the 22 programs, and we received score reports for 443 (87.5%) of them. We received IM-ITE scores on 305 residents (68.8%), and rank IM-ITE scores on 323 (72.9%). We received ABIM-CE scores on 182 (41.1%) residents. Of the 443 residents, 313 (70.7%) completed at least 1 module, and 130 (29.3%) completed no modules. The mean number of modules completed by residents was 7.5. Mean and rank scores are shown in table 1.
Respondents Versus Nonrespondents
We compared module performance between the 22 participating and the 87 nonparticipating programs. The average module score on all modules among participating programs was 80.1% and among nonparticipating programs was 81.3%, a difference that was not statistically significant (P = .11). The average resident rank score also did not differ between participating programs and nonparticipating programs (54.0 versus 54.4, P = .82). At participating programs, rank score did not differ among residents for whom we had IM-ITE scores and those for whom we did not (53.7 versus 55.2, P = .61), nor between those residents for whom we had ABIM-CE scores and for those for whom we did not (52.3 versus 55.2, P = .27).
Associations
Associations among IM-ITE, ABIM-CE, and ILC module performance are shown in table 2. In a post hoc analysis using linear regression with ABIM-CE total score as the dependent variable and IM-ITE rank score as the independent variable, R2 was 23%. Adding the ILC module total score to that model improved the R2 to 26%, but that result was not statistically significant (P = .67).
Pairwise Correlations Among the Internal Medicine In-Training Examination (IM-ITE), the American Board of Internal Medicine Certifying Examination (ABIM-CE), and the Johns Hopkins Internet Learning Center (ILC) Module Performance

We next looked for correlations among the number of modules completed (regardless of performance on those modules) and IM-ITE total and percentile rank scores. When we categorized the number of modules completed by tertiles (0 to 1 module completed; 2 to 7 modules completed; 8 or more modules completed), we found that when a resident completed at least 8 modules, scores correlated with the IM-ITE total (P < .01) and percentile rank scores (P = .03). The IM-ITE total and percentile rank scores improved with greater numbers of modules completed (table 3). Mean IM-ITE percentile rank scores increased in residents who completed more modules, relative to those who completed fewer (for the trend, P = .03). Although ABIM-CE scores also improved with additional modules completed, these differences were not significant (P = .07).
Associations Among the Johns Hopkins Internet Learning Center (ILC) Modules Completed, the Internal Medicine In-Training Examination (IM-ITE) Performance, and the American Board of Internal Medicine Certifying Examination (ABIM-CE) Performance

Finally, among the 18 modules completed by at least 100 of the 323 residents with IM-ITE rank scores, performance on back pain, hypertension, preoperative evaluation, and upper respiratory tract infection modules was statistically associated with IM-ITE rank scores (P < .05; table 4).
Discussion
We showed that evaluative data generated by an interactive ambulatory curriculum have concurrent validity with IM-ITE and ABIM-CE performance. For an individual learner, ILC module performance correlated with IM-ITE performance when at least 8 modules had been completed. Higher numbers of completed modules were associated with better performance on the IM-ITE. The Johns Hopkins Ambulatory Curriculum thus offers evaluative information that may predict performance on the IM-ITE and ABIM-CE.
It is not perfectly explained why performance on an ambulatory curriculum correlates with performance on tests that broadly cover internal medicine. The IM-ITE test blueprint assigns questions to general internal medicine,8 whereas the ABIM-CE test blueprint does not assign a specific portion of content to general internal medicine.16 It could be that ILC performance is an indicator of something other than specific knowledge on ambulatory care topics. Residents who perform well on the ILC may have self-directed study habits that cover all topics in internal medicine, and research has shown that self-directed reading is associated with better IM-ITE performance.17 Our findings are similar: Regardless of module performance, those residents who completed greater numbers of modules performed better on the IM-ITE. The ILC module completion likely serves as a marker of resident study habits.
We also found that, for some ILC module topics (ie, back pain, hypertension, preoperative assessment, upper respiratory tract infections), individual performance was associated with IM-ITE performance. The reasons are unclear. Preoperative assessment is a general topic that requires comprehension of cardiovascular risk and other comorbidities and might demonstrate broad comprehension of internal medicine. However, this could not be said for knowledge of back pain or upper respiratory tract infections.
If a major thrust of evaluating residents is to determine who is competent to take care of patients, assessment tools must have validity evidence.1 We developed and provided validity evidence of an ambulatory curriculum that can enhance a program director's ability to evaluate residents, and demonstrated the feasibility of establishing validity evidence of an evaluation instrument by testing correlations between its results with those of the IM-ITE and ABIM-CE.
Our study has several limitations. Most programs using the curriculum declined to participate, introducing possible selection bias. However, module performance did not differ at participating and nonparticipating programs. We did not have access to IM-ITE and ABIM-CE results from nonresponding programs to compare performance on those metrics. We only looked at PGY-3 learners, and it is possible that ILC performance among PGY-2 and PGY-1 learners does not associate with IM-ITE or ABIM-CE performance. We did not assess pretest performance and, thus, could not assess the effect of the curriculum itself on IM-ITE or ABIM-CE performance. We also did not assess association of ILC module performance with clinical outcomes, which would provide very powerful validity evidence of the ILC as an assessment tool.
Conclusion
Our study showed that performance on a widely shared ambulatory curriculum for internal medicine residents was associated with performance on the IM-ITE and the ABIM-CE.
References
Author notes
All authors are at Johns Hopkins University School of Medicine. Stephen D. Sisson, MD, is Associate Professor, Division of General Internal Medicine, Department of Medicine; Amanda Bertram, MS, is Senior Research Program Coordinator, Division of General Internal Medicine, Department of Medicine; and Hsin-Chieh Yeh, PhD, is Associate Professor, Departments of Medicine and Epidemiology.
Funding: The authors report no external funding source for this study.
Conflict of interest: Dr Sisson receives an annual stipend for editorial duties related to the Johns Hopkins Internet Learning Center.