In-training examinations (ITEs) are intended for low-stakes, formative assessment of residents' knowledge, but are increasingly used for high-stake purposes, such as to predict board examination failures.
The aim of this review was to investigate the relationship between performance on ITEs and board examination performance across medical specialties.
A search of the literature for studies assessing the strength of the relationship between ITE and board examination performance from January 2000 to March 2019 was completed. Results were categorized based on the type of statistical analysis used to determine the relationship between ITE performance and board examination performance.
Of 1407 articles initially identified, 89 articles underwent full-text review, and 32 articles were included in this review. There was a moderate-strong relationship between ITE and board examination performance, and ITE scores significantly predict board examination scores for the majority of studies. Performing well on an ITE predicts a passing outcome for the board examination, but there is less evidence that performing poorly on an ITE will result in failing the associated specialty board examination.
There is a moderate to strong correlation between ITE performance and subsequent performance on board examinations. That the predictive value for passing the board examination is stronger than the predictive value for failing calls into question the “common wisdom” that ITE scores can be used to identify “at risk” residents. The graduate medical education community should continue to exercise caution and restraint in using ITE scores for moderate to high-stakes decisions.
In-training examinations (ITEs) have been used as an objective measure of residents' and fellows' medical knowledge since the 1970s. ITE scores and reports provide program directors with information on the strengths and weaknesses of their trainees' medical knowledge in various content areas, which can be used in a low-stakes, formative fashion to support development of individualized learning plans. ITE scores may also be utilized by program directors at the program level, with areas of poor performance across trainees suggesting potential gaps in program curricula and identifying areas on which to focus for continuous program improvement. Ultimately, graduate medical education (GME) programs are responsible for ensuring their trainees are equipped to succeed in passing the qualifying examination (QE) and/or certifying examination (CE), administered by their respective specialty board, at the conclusion of their training. It is unclear, however, if ITEs are predictive of trainees' success in the board certification process.
Validity evidence for the interpretation of scores from assessment tools can be organized into 5 categories, based on Messick's unified framework, including content, response process, relationship to other variables, internal structure, and consequences.1 The category most relevant to gather evidence for ITE scores is relationship to other variables. If the ITE and respective specialty board examinations had similar test content, ITE scores would share a strong relationship with board examination scores. The predictive ability of ITEs has been an area of interest since the early 1990s, and the number of investigations of this topic has continued to increase in recent years. Furthermore, some specialties and programs have begun to expand the use of ITEs beyond the original low-stakes formative intent to more high-stakes decisions, including formal academic actions, such as formal remediation, probation, non-advancement, and non-retention within the training program, which has significant implications for the consequences of ITE scores.2–4
Given that ITEs could be utilized in a manner that impacts a trainee's future in terms of promotion and program completion, ensuring that there is validity evidence for the relationship between ITE scores and board examination scores is of the utmost importance. To date, there has neither been a review synthesizing the literature on the use of ITEs across medical specialties nor a synthesis of correlations/prediction results between ITE scores and board examination scores. Thus, the purpose of this study was to complete a systematic review of the literature on relationships to other variables' evidence for interpretation of GME ITE scores, with the other variable being performance on board examinations. A secondary aim of the study was to identify current use of ITEs across specialties.
Selection of Studies
We conducted a systematic review of the research on the association between ITEs and board examinations published from January 2000 to March 2019 using the following databases: PubMed, Embase, Cochrane Library, and Scopus. Major medical subject heading terms used for the systematic review included: in-training examination, in-service examination, medical education, and certification. Two authors (B.K.S. and H.C.M.) independently reviewed titles, abstracts, and full-text articles to determine if they met inclusion criteria. This process was completed with the assistance of systematic review software (Covidence, 2019). Phase 1 included screening of titles and abstracts for relevance. Phase 2 included evaluation of the full text. The search methods are reported using relevant items of the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) checklist (Figure).
Studies were included if: (1) they reported quantitative analysis of an association between performance on the ITE and performance on the respective specialty board examinations; (2) the study population included US GME trainees (residents or fellows); (3) manuscripts were available in the English language; (4) the full-text article was able to be obtained; and (5) articles were published after the year 2000. The criteria to include studies published after 2000 was established given our assessment of the availability of literature, which increased substantially after the year 2000.
Title/Abstract and Full-Text Review
Two authors (B.K.S. and H.C.M.) independently reviewed the titles and abstracts of all 1407 articles captured by the search, removing duplicates and articles obviously not meeting predetermined eligibility criteria. Discrepant opinions were discussed until consensus was reached during the abstract and full-text review stages. Two authors (B.K.S. and H.C.M.) completed the abstract review phase, while all 4 authors participated in the full-text review. A full-text review of 89 articles determined eligibility for inclusion in the final review, with a total of 32 articles ultimately included (Figure).
Relationship to Other Variables' Evidence
In the Messick validity evidence framework, relationship to other variables evidence refers to gathering information to show that assessment scores relate to scores from similar assessments. Such evidence generally takes 3 forms, including correlation coefficient, regression equation, and Area Under the ROC Curve (AUC). For continuous scores (eg, 0%–100%), relationships are measured with a correlation coefficient, where a strong positive correlation value is a metric for validity evidence. For educational purposes, correlation values > 0.50 are considered strong, 0.30–0.49 moderate, and < 0.30 low.5 A significant regression equation is another potential metric for validity evidence where either continuous scores or dichotomous outcomes (eg, pass/fail) are used to predict future performance on another variable measured on a continuous scale (linear regression) or as dichotomous outcomes (logistic regression). Finally, an AUC with good accuracy/predictive value is a third potential metric for validity evidence where a particular score (eg, cut score) or outcome is used to discriminate between true positives and false positives of future performance.
Data Extraction and Analysis
Results were categorized based on the type of statistical analysis used to determine the relationship between ITE performance and board examination performance: correlation, linear regression, logistic regression, and/or AUC. Additionally, the type of ITE performance data (eg, percent score or rank) used for the analysis were extracted. Data were also collected from publicly available websites for each specialty society in terms of the format and number of ITE questions, and national pass rates for board examinations (Table 1).
Two authors (B.K.S. and H.C.M.) independently assessed the quality of the studies included in the final analysis using the Medical Education Research Study Quality Instrument (MERSQI). The MERSQI scoring system includes 10 items that are used to evaluate the quality of medical education research, including study design, institutions, response rate, type of data, validity, appropriateness of analysis, sophistication of analysis, and outcome.6 Each item is scored (total possible score of 18), with Reed et al citing the mean as 9.6 in a cross-sectional study of 100 medical education research studies.6 The validity and response rate items were not applicable to the studies included in our analysis; thus, these criteria were discarded, resulting in a total possible score of 13.5 points. Any discrepancies in scoring were resolved through group consensus. Importantly, the MERSQI scoring system is not intended to generate an absolute indicator of the validity or reliability of the research results. Furthermore, “cut-points” for “excellent” or “poor” quality have not been defined. Rather, the scores can be used to compare the quality of evidence between studies within a specific body of literature.
Given that there are differences in language across specialties in terms of what QE and CE means, the term board examination will henceforth refer to the written examination for each given specialty, unless a study evaluated how the ITE compared with oral board examination results. This study is consistent with the definition of non-human subjects research, therefore, no Institutional Review Board review was sought.
Thirty-two articles were included in the final review, representing 21 medical specialties. National first-time pass rates for specialty board examinations are high across these specialties, ranging from 83% to 99% (Table 1). Table 2 includes a summary of the characteristics, results, and quality assessment of all studies included in our final analysis.
ITE Performance Data
The statistical analyses in the studies utilized a variety of quantification methods for ITE performance. Two studies (5%) grouped ITE performance into stanines (scaling of test scores on a 9-point scale with a mean of 5 and standard deviation of 2), 14 studies (38%) used ITE absolute scores, 11 studies (30%) used ITE percentiles, and 10 studies (27%) used both absolute scores and percentile rank. A total of 16 studies used board examination pass/fail rates (43%), 13 studies (35%) used absolute or percentile board examination scores, and 8 (22%) used both absolute and percentile scores.
Relationship to Other Variables' Validity Evidence
About half of the studies (17, 53%) conducted a single type of statistical analysis to show evidence of relationship to other variables' evidence, 8 (25%) conducted 2 types of statistical analyses, 6 (18%) conducted 3 types of statistical analyses, and 1 (3%) conducted all 4 types of analyses. Nineteen studies used correlations, 12 used linear regressions, 18 used logistic regressions, and 6 used AUC values for the statistical analysis. Two studies reported sensitivity and specificity values, but did not provide an AUC value and thus were not include in the AUC category.
Forty-seven percent (9) of the 19 correlation studies found a strong relationship2,7–14 between ITE performance and board examination performance for all residents and fellows in the respective study samples, and 1 found a moderate relationship (Withiam-Leitch and Olawaiye, obstetrics and gynecology15) for all residents. The other 9 correlation studies found mixed results by postgraduate year (PGY) or specialty.16–24 Eleven of the 12 studies using linear regression found that ITE scores significantly predicted board examination performance.4,7,9,10,13,25–29 Only 1 study showing signicant prediction for PGY-3–PGY-4 residents, but not PGY-1–PGY-2 residents (Swanson et al, orthopaedic surgery21).
For logistic regression analysis, studies either used ITE scores as a predictor on a continuous scale or categorized ITE scores into 2 categories (eg, < 10th percentile, > 10th percentile). AUC analysis was used to determine the precision in prediction as a complement to logistic regression results or was done without logistic regression analysis. For predicting a board examination passing outcome, 6 studies showed ITE scores significantly predicted who would pass the board examination.4,9,13,26,27,29 Three additional studies showed that a particular high score, quartile, or stanine significantly predicted who would pass the board examination (Pucas 2012, otolaryngology30), along with AUC good accuracy/predictive value (Lingenfelter et al, obstetrics and gynecology31 ; Pucas 2018, otolaryngology32). O'Neill et al (family medicine)14 also found good AUC accuracy/predictive value for a particular high ITE score. Two additional studies showed that passing the ITE predicted passing the board examination (Johnson et al, ophthalmology33) with good AUC accuracy/predictive value (Indik et al, cariovascular disease fellows10).
For predicting a board examination failing outcome, 2 studies showed ITE scores significantly predicted who would fail the board examination (Swanson et al, orthopaedic surgery21), but only with a moderate AUC accuracy/predictive value (Withiam-Leitch and Olawaiye, obstetrics and gynecology15). Three studies showed that a particular low score or quartile significantly predicted who would fail the board examination (de Virgilio et al, surgery3 ; Kay et al, internal medicine11), with a good AUC accuracy/predictive value for PGY-2 and PGY-3 residents' ITE scores, but poor predictive value for PGY-1 ITE scores (Carey and Drucker, ophthalmology16). Babbott et al (internal medicine)34 did not perform logistic regression and found good AUC accuracy/predictive value for a low quartile score. Only 1 study showed that failing an ITE significantly predicted failing the board examination (Carey and Drucker, ophthalmology16), but with a low positive predictive value and only applied to PGY-2 and PGY-3 residents' ITE scores. McClintock and Gravlee (anesthesiology)29 applied a logistic regression to see how well the model predicted board examination fail/pass outcomes. The accuracy in prediction value was low-moderate for predicting a fail outcome and moderate-high for predicting a pass outcome. Finally, 2 studies found ITE scores had weak to no prediction for board examination pass/fail outcomes (Collichio et al, hematology and oncology8 ; Monaghan et al, hematology35). Additionally, Pucas (otolaryngology)32 and O'Neill et al (family medicine)14 were not able to predict who would fail the board examination based on their respective AUC analysis.
In terms of quality assessment of the articles included in this study, the average MERSQI score was 7.9 out of possible 13.5 points (range 7–9). This is within the range of reported MERSQI scores of medical education research more broadly.36 All the included studies were retrospective cohorts; no studies were randomized controlled trials.
This systematic review finds there is generally strong evidence that strong trainee performance on ITEs is predictive of subsequent passing performance on specialty board examinations. However, there is limited evidence that poor performance on the ITE predicts subsequent failure on board examinations, which calls into question the appropriateness of programs using the ITE to make high-stakes decisions. These results are important, as performance on ITEs has been widely accepted as predictive of subsequent performance on specialty board examinations, with pervasive beliefs that low-scoring residents are at risk of failing their board examination, resulting in some specialties reporting high-stakes use of ITE performance.
National first-time pass rates for specialty board examinations are high across specialties, which makes it difficult to predict trainees who will fail the examination (Table 1). In a cohort of otolaryngology residents, even those who scored in the bottom 3 stanines for each of the 4 years they took the ITE still had an 82% pass rate on their board examination.32 If a nephrology program director simply predicted that all nephrology fellows would pass the nephrology board examination, they would be correct 89% of the time; using the ITE to make the same prediction, they would be correct 90% of time. This suggests that, despite correlations between ITE and board performance, prediction of board examination pass/fail using the ITE for an individual resident is of little practical benefit.26 Even residents who perform very poorly on the ITE have a reasonable likelihood of passing their board examination.
The studies that did find a significant outcome of failing may not generalize to all trainees taking that particular ITE; thus, those results may only be useful for the individual program since the studies that found a significant outcome of passing were more likely to use national samples of all residents and fellows. Additionally, since the number of trainees who fail an ITE is small, trying to accurately predict if all will end up failing their boards is statistically difficult since having just one of these trainees pass the board examination will greatly impact whether the outcome is significant. The number of trainees who pass the ITE is much larger so there is more wiggle room to accidently have a few fail the board examination and still find a significant outcome of predicting passing.
It is important to note the different formats of board examinations. Specialties including pediatrics, family practice, pathology, preventative medicine, neurology, internal medicine (and associated subspecialties), and psychiatry typically have 1 written examination that serves as the CE. Thus, evaluating the relationship between the ITE and CE in these fields may represent a more accurate comparison. Within surgical specialties, obstetrics and gynecology, ophthalmology, and anesthesiology there are 2 separate examinations. The QE is a written examination designed to evaluate knowledge in principles and applied science in a given specialty.37 The CE among these specialties is an oral examination with the intent of evaluating a candidate's clinical judgement, reasoning skills, and problem-solving skills.38 The ITE has limited ability to predict performance on oral board examinations. Additional tools that specifically assess application of knowledge and demonstration of clinical judgement in an oral format are needed to predict passage of oral CEs.
ITEs were originally developed as a formative assessment tool to assist learners and programs in identifying deficiencies in medical knowledge. Scores were meant to be used for no or low-stakes decisions and to guide development of individualized learning plans. To maintain the original intent of these examinations, further efforts at delineating “cut-scores” that predict board examination failure should not be undertaken. It remains similarly challenging to predict who will fail board examinations, with few studies designed to address this issue. Even if a significant fail outcome is found the predictive value is low. The paucity of data regarding ITE prediction of board examination failure suggests that program directors should exercise caution in the interpretation and use of low ITE scores at the individual resident level, particularly regarding high-stakes uses to inform formal academic actions (probation, repeating PGY, and requiring remediation) within a program. The majority of studies describe the use of ITE performance as low-stakes and formative for trainees or GME programs, with 2 (6%) studies in pediatrics and ophthalmology using the information for continuous program improvement.2,33 Three studies (9%) in pediatrics and general surgery describe moderate to high-stakes use of ITE performance, including decisions regarding formal academic actions.2–4 Finally, as expected, ITE performance increases with PGY. Therefore, when a resident is in their final year of training, when the correlations between ITE and board examination performance are strongest, it may be too late to help struggling residents “catch up” in time to pass board examinations.
This study has several limitations. First, the heterogeneity of the assessment instruments and specialties limited our ability to perform a pooled meta-analysis of the data. Furthermore, the studies included in this review vary in population size, from single institutions to a national review of how ITEs correlated with board examinations. There were also variations in study design, with some studies including data on interventions performed within a given residency versus large national data on how ITEs correlate with board examination scores. Future studies should involve national samples and investigate precision in predicting failing or passing board examinations utilizing other assessment data and contextual variables in addition to ITE scores.
This systematic review demonstrates that strong performance on ITEs is associated with passing subsequent board examinations, while the reverse is not necessarily true. Ultimately, this suggests that the GME community should continue to exercise caution and restraint in using ITE scores for moderate to high-stakes decisions.
Funding: The authors report no external funding source for this study.
Conflict of interest: The authors declare they have no competing interests.