ABSTRACT
The clinical learning environment (CLE) is frequently assessed using perceptions surveys, such as the AAMC Graduation Questionnaire and ACGME Resident/Fellow Survey. However, these survey responses often capture subjective factors not directly related to the trainee's CLE experiences.
The authors aimed to assess these subjective factors as “calibration bias” and show how it varies by health professions education discipline, and co-varies by program, patient-mix, and trainee factors.
We measured calibration bias using 2011–2017 US Department of Veterans Affairs (VA) Learners' Perceptions Survey data to compare medical students and physician residents and fellows (n = 32 830) with nursing (n = 29 758) and allied and associated health (n = 27 092) trainees.
Compared to their physician counterparts, nursing trainees (OR 1.31, 95% CI 1.22–1.40) and allied/associated health trainees (1.18, 1.12–1.24) tended to overrate their CLE experiences. Across disciplines, respondents tended to overrate CLEs when reporting 1 higher level (of 5) of psychological safety (3.62, 3.52–3.73), 1 SD more time in the CLE (1.05, 1.04–1.07), female gender (1.13, 1.10–1.16), 1 of 7 lower academic level (0.95, 1.04–1.07), and having seen the lowest tercile of patients for their respective discipline who lacked social support (1.16, 1.12–1.21) and had low income (1.05, 1.01–1.09), co-occurring addictions (1.06, 1.02–1.10), and mental illness (1.06, 1.02–1.10).
Accounting for calibration bias when using perception survey scores is important to better understand physician trainees and the complex clinical learning environments in which they train.
While subjective factors are believed to influence how residents and fellows rate their clinical learning environments, how to calibrate for these influences when using such ratings to rank programs by their performance is not well understood.
We measure calibration bias and show how biases vary by discipline, the trainee's program and facility factors, and the mix of patients that trainees see.
Study data were limited to the Department of Veterans Affairs medical centers and to a limited set of predictor factors.
Educators must integrate calibration bias metrics into their perceptions surveys results in order to better understand their residents and fellows and the complex clinical learning environments in which they train.
Introduction
A critical component of medical education is the clinical learning environment (CLE) where trainees engage in supervised patient care to acquire competencies necessary to enter independent practice.1 To evaluate CLEs for program accreditation, faculty evaluations, and program rankings, education leaders turn to perceptions surveys,2 such as the AAMC Medical School Graduation Questionnaire3 and the Accreditation Council for Graduate Medical Education (ACGME) Resident/Fellow Survey.4 These surveys ask respondents to rate items on 5-point scales (satisfaction, agreement, excellence), with item responses grouped into domains reflecting CLE constructs such as supervision, interaction with faculty, clinical experience, scut work, research opportunities, working environment, personal experiences, and professionalism.2,5,6 While perception surveys can reflect CLE qualities, critics charge that responses may also vary with how questions are framed,7 surveys are designed,8 and response options are quantified.9 Importantly, respondents' subjective characteristics,10 including personality traits, perceptions of personal support, peer morale, and autonomy,11–13 and how respondents retrieve information, make judgments, and interpret survey questions,14–16 have also been shown to impact perception survey responses.
In this study, we propose a theoretical framework that defines calibration bias as the difference between a trainee's self-reported rating from that of an actual rating had the trainee responded with the subjective characteristics of the average trainee respondent. If calibration bias were controlled, trainees would rate the same experience in exactly the same way. Well-validated data from the Department of Veterans Affairs (VA) national CLE surveys10 are used to approximate calibration bias and assess if: (1) such biases exist with well-validated survey data, and if so, (2) does calibration bias vary by discipline and (3) by trainee and CLE factors.
Conceptual Model
Derived from the 9-criteria evaluation framework,10 figure 1 shows calibration bias as a mediator between the CLE as the object to be assessed, and domain scores used to assess the CLE. The bias is a result of subjective17–19 factors that impact how a respondent's experiences are perceived, and threshold9,20 factors that impact how respondents value the 5-point response options they must select to rate those perceptions. These biases can lead respondents to over- or under-rate experiences compared to an “average” rater who, by definition, has no calibration bias. Based on our framework, possible remedies include changes in survey design, administration, scoring, and analyses.
Role of Calibration Bias on Relationship Between Clinical Learning Environment and Trainee Perceptions Survey Scores
a Facility-level calibrating items for this example include parking, convenience of the facility location, and electronic health record. Respondent satisfaction with these calibrating items are scored as the calibration index.
Role of Calibration Bias on Relationship Between Clinical Learning Environment and Trainee Perceptions Survey Scores
a Facility-level calibrating items for this example include parking, convenience of the facility location, and electronic health record. Respondent satisfaction with these calibrating items are scored as the calibration index.
Calibration bias is not directly observable. To test for its presence in validated data, we created a calibration index where respondents rate their satisfaction with selected CLE facility-level “calibrating items,” such as parking, facility location, and electronic health record (EHR), where experiences are not likely to vary among trainees reporting on the same facility and academic year. The index equals the respondent's calibrating item score minus the average of all such scores from respondents to the given facility and academic year. If calibrating item experiences are invariant, then from figure 1 any variation in index scores must be the result of mediating subjective and threshold factors. A second test does not depend on strict invariant item experiences. As shown in figure 1, associations between the index score and trainee, patient, program, and other facility-level factors that are not expected to impact a trainee's calibrating item experiences, can only be observed if the respondent answered the survey in the presence of calibration biases and such biases are influenced by such factors.
Methods
Data Setting and Sample
Data came from the Department of Veterans Affairs' (VA) Learners' Perceptions Survey (LPS) for physician, nursing, and allied and associated health trainees,21 collected from July 1, 2010, through August 30, 2017. Validated elsewhere,10 LPS is an anonymous, voluntary, Office of Management and Budget approved, web-based perceptions survey administered annually to trainees who rotate through a VA medical center as part of a required curriculum for an accredited health professions education program. LPS respondents were solicited through advertising, capturing only 11% of all VA trainees. However, LPS findings have been well-published,10 with physician resident respondents shown to be comparable by specialty, academic level, international status, and gender with US physician residents in ACGME-accredited non-pediatric and non-OB-GYN programs.22
Calibration Bias
Calibration bias is estimated by a proxy index based on how respondents rated their satisfaction with 3 calibrating items on a 5-point scale. Calibrating items are parking, location convenience, and EHRs. Item responses are scored as 1 for “very dissatisfied,” 2 “somewhat dissatisfied,” 3 “neutral,” 4 “somewhat dissatisfied,” and 5 “very satisfied.” The index, Cindex, is computed by taking the average of the 3 calibrating item scores and subtracting the mean of such averages computed for all survey respondents at the given facility.22,23 Cindex is also computed in standard deviates (Cz = ) and binary (Cbinary = 1 if C z > 0, and = 0 if Cz ≤ 0). Positive Cindex values indicate trainees whose subjective factors put them at risk of overrating their experiences compared to that of an average respondent, while negative values indicate trainees at risk of underrating their experiences.
The psychometric properties of Cindex have been estimated for VA trainees before.10 Calibration index values were found to have a mean (−0.06), range (−3.60 to 2.00), SD (0.84), facility-level clustering ICC (0.05), and test-retest reliability (ICC = 0.86). We also reported modest scalability (H = 0.38) and consistency (Cronbach's alpha = 0.59) among the 3 calibrating items. This is not surprising, as parking and location fall under working environment, and EHR falls under clinical environment. Calibration bias is expected to reflect the subjective properties of trainees, so combining different items together is tantamount to measuring illness severity by counting comorbidities, even though such diseases are clinically distinct and unrelated.24
Covariates
Trainee and CLE covariates were computed from LPS survey responses, previously shown to have high consistency (Cronbach's alpha) and test-retest reliability.10 Trainee covariates included professional discipline across 26 professions, academic level in years since high school, and gender. CLE covariates included the percent of time the trainee spent in VA, psychological safety25 computed based on a 5-point agreement to: “Members of the clinical team of which I was a part are able to bring up problems and tough issues”; a 5-point VA facility service complexity score26 ; and mix of patients seen ranked into terciles by discipline, for patients “age 65 and over,” with “chronic mental illness,” “chronic medical illness,” “multiple illnesses,” “substance dependence,” and “low income,” and those who “lacked social support.”
Analyses
Independent associations regressing calibration index on trainee and program factors were estimated using SPSS generalized linear models with an identity link function and Gaussian distribution for Cindex and Cz, and a logit link function and binomial distribution for Cbinary.
Results
Table 1 describes sample means, SDs, and frequencies of all study variables. Cindex ranged from −3.563 to 1.628 with an SD of 0.7966 index values, consistent with theory that calibration biases exist and vary by trainee.
Table 2 shows how calibration, computed as the percent of respondents with positive index values (Cz > 0), varied by discipline (P < .001), ranging from 43.0% (psychology) to 66.9% (physical therapy). Table 3 shows allied and associated health trainees and nursing trainees were 17.8% and 30.5%, respectively, more likely to overreport favorable ratings (Cbinary = 1) and had higher average index scores by 0.085 and 0.142 standard deviates, compared to their physician counterparts. These findings are consistent with our hypothesis that calibration biases vary by academic discipline.
Percent Trainees with a Positive Calibration Index Score and a High Psychological Safety Rating, and Their Associations, by Discipline

Independent Associations Between Trainee and CLE Factors and Calibration Index, by How Calibration Is Scoreda

Table 2 shows psychological safety, computed as the percent of trainees who strongly agreed that their CLE was psychological safe, was associated with calibration, on a discipline by discipline basis (except chiropractic). As with calibration, psychological safety also varied by discipline (P < .001). Figure 2 shows disciplines with a higher percentage of trainees who had a positive calibration index value also had a higher percentage of trainees who strongly agreed their CLE was psychologically safe (r = 0.786, P < .001). Table 2 also reveals that the associations between calibration and psychological safety varied by discipline (P < .001). Across disciplines, the size of the association between calibration and psychological safety was negatively correlated with the percent of a discipline's trainees who strongly agreed their CLE was psychologically safe (r = −0.564, P = .003).
Percent Trainees Reporting Positive Calibration Index Valuesa and Psychologically Safeb Clinical Learning Environment by Professional Discipline
a Percent of respondents with Cz > 0 (above sample mean).
b Percent of respondents who “strongly agreed” that their VA clinical learning environment was psychologically safe.
Percent Trainees Reporting Positive Calibration Index Valuesa and Psychologically Safeb Clinical Learning Environment by Professional Discipline
a Percent of respondents with Cz > 0 (above sample mean).
b Percent of respondents who “strongly agreed” that their VA clinical learning environment was psychologically safe.
Table 3 describes the independent association across all study variables between trainee and CLE factors and calibration based on how bias was scored. Psychological safety had overwhelmingly the largest association: one level increase in psychological safety 5-point scale was associated with an average increase in calibration (Cz) of 0.476 standard deviates. Calibration was also positively associated with lower academic level, female gender, percent of time the trainee was in VA, more complex facilities, and fewer patients the respondent sees than expected for the respondent's discipline with chronic mental illness, chronic medical illness, alcohol/substance dependence, low income, and lack social/family support.
Discussion
Our findings highlight the importance of accounting for calibration bias when interpreting CLE perceptions surveys scores. Calibration bias is viewed as subjective and threshold factors mediating between a trainee's CLE experience and their satisfaction rating of that CLE. We measured bias severity using an index scored that averages how trainees rated the 3 calibrating items, mean-centered by facility and academic year. We observed these index values varied by trainee and discipline, suggesting calibration biases exist, but only to the extent trainee experiences were invariant by facility and academic year. On exploratory analyses, we found patient, program, and trainee factors that were not expected to impact trainee experiences with parking, location, and EHR were consistently associated with index values, suggesting the presence of calibration biases.
Our findings are consistent with studies that have shown adjusting for Cindex, under different names, has led to significant changes in reported results when assessing primary care continuity clinics,27 psychological safety,23 trainee preferences for education program elements,22 and interprofessional team care.28 Of note, calibration bias was accounted for by including the index as a covariate to explain a CLE domain score of interest.
The finding of a strong association between psychological safety and calibration bias is also consistent with the role psychological safety has been seen to play in the workplace29–31 and in health professions education,32,33 together with its connection to CLE satisfaction,23 work-related communication,34 team tenure,35,36 perceived care,37 self-awareness, burnout, civility,38 and mental health.39 Our findings suggest trainees who believe their CLE is psychologically unsafe will tend to underrate their CLE experiences below what a rater unaffected by psychological safety would otherwise have rated those same experiences.
Overall, resident physicians reported more negative calibration bias and lower levels psychological safety compared to their nursing and allied and associated health trainee counterparts. These findings are consistent with the pressures physician trainees face when engaged in the care of complex patients in situations with high levels of ambiguity and uncertainty,40–42 where trainees assume the role of an apprentice with expectations of becoming independent practitioners.43 Resident physicians advance their professional development by engaging in patient care in supervised environment,44,45 where frequent risk-taking behavior is often taken in complex hierarchical situations fraught with uncertainty.38,46–48
There are study limitations. Our methods do not allow separate estimates for subjective and threshold biases. Computation rests on the assumption that calibrating item experiences are either invariant, or at minimum, not affected by other trainee and CLE factors. We also assumed respondents both comprehended the meaning of and recalled information relevant to answering the 3 index questions. Similarly, trainee responses may be subject to additional biases when assessing sensitive topics.46–48 However, we believe the impact of such pressures may be minimal because the survey was administered nationally, with only aggregate scores reported to program directors. We also assumed calibration bias is a property of trainees. An alternative approach is to construct separate indices derived from experience-invariant items that are related to the CLE construct of interest.
Readers are also cautioned about extrapolating results to different clinical settings, as VA medical centers can differ from non-VA clinical settings. In addition, with an 11% sampling rate, it is unlikely this convenience sample represents all VA trainees.49,50 However, our purpose here is to compare trainees by discipline. LPS resident physician sample has been shown to be comparable to US residents in non-pediatric and non-OB-GYN programs by international status, PGY level, and specialty.22
Finally, our list of Cindex predictors was limited. Future studies should consider the prevalence of depressive disorder, burnout, and chronic anxiety among trainees and teaching faculty that have been shown to be associated with high pessimism, negative perceptions,51 negative-selective memory,52 lower satisfaction intensity,53,54 increased frequency of medical errors,55 and higher rates of medical negligence and malpractice litigation.56,57
Conclusions
This study offers evidence that a trainee's subjective and threshold factors introduce calibration biases that impact how responses to CLE perceptions surveys should be scored, analyzed, and interpreted. The integration of calibration bias metrics into CLE perceptions surveys should be an integral element in the quest to better understand medical trainees and the complex clinical learning environments in which they train.
References
Author notes
Funding: This study was funded by the Department of Veterans Affairs, Veterans Health Administration, in the Office of Research and Development, Health Services Research and Development Service IIR#14-071 and IIR#15-084, and by the MacPherson Society, Loma Linda University School of Medicine.
Competing Interests
Conflict of interest: The authors declare they have no competing interests.
The authors would like to thank the following individuals for their help on the project: Christopher T. Clarke, PhD, David Bernett, BA, and Laura Stefanowycz, Office of Academic Affiliations Data Management and Support Center, St. Louis, MO; Grant W. Cannon, Utah VA Medical Center, University of Utah School of Medicine; Catherine P. Kaminetzky, Puget Sound VA Healthcare System, University of Washington School of Medicine; Sheri A. Keitz, University of Massachusetts School of Medicine; and Shanalee Tamares, MLIS, and Daniel Reichert, MD, Loma Linda University.