The standardized letter of evaluation (SLOE) is the application component that program directors value most when evaluating candidates to interview and rank for emergency medicine (EM) residency. Given its successful implementation, other specialties, including otolaryngology, dermatology, and orthopedics, have adopted similar SLOEs of their own, and more specialties are considering creating one. Unfortunately, for such a significant assessment tool, no study to date has comprehensively examined the validity evidence for the EM SLOE.
We summarized the published evidence for validity for the EM SLOE using Messick's framework for validity evidence.
A scoping review of the validity evidence of the EM SLOE was performed in 2020. A scoping review was chosen to identify gaps and future directions, and because the heterogeneity of the literature makes a systematic review difficult. Included articles were assigned to an aspect of Messick's framework and determined to provide evidence for or against validity.
There have been 22 articles published relating to validity evidence for the EM SLOE. There is evidence for content validity; however, there is a lack of evidence for internal structure, relation to other variables, and consequences. Additionally, the literature regarding response process demonstrates evidence against validity.
Overall, there is little published evidence in support of validity for the EM SLOE. Stakeholders need to consider changing the ranking system, improving standardization of clerkships, and further studying relation to other variables to improve validity. This will be important across GME as more specialties adopt a standardized letter.
The standardized letter of evaluation (SLOE) was developed by a Council of Emergency Medicine Program Directors (CORD) task force in 1995 for use in medical students' applications to emergency medicine (EM) residency.1 In the 23 years since its inception, the SLOE has become the most important piece of information that program directors use to determine which candidates they will select to interview and how they will rank students for the Match.2–4 The SLOE consists of the following (see online supplementary data for an example SLOE):
Grade (honors, high pass, pass, fail, with some institutions choosing to select only pass/fail)
“Global ranking” in which writers are instructed to rate the student against all other EM bound rotators, placing them in the top 10%, top third, middle third, or bottom third
Predicted placement on the institution's match list, again from top 10% to top, middle, and bottom third
Qualities necessary for success in EM ranked against peers
An early study comparing the SLOE to the narrative letter of recommendation (NLOR) was favorable, indicating that the SLOE was significantly more user friendly, as it demonstrated a decrease in both writing and reviewing time, as well as being easier to interpret with high interrater reliability.5 Other specialties, including otolaryngology, dermatology, and orthopedics, have adopted an SLOE as well. Due to these factors, a recent commentary in Academic Medicine highlighted these advantages of the SLOE over a NLOR and suggested that the SLOE be adopted by all specialties for use during the residency application process.6 Across specialties, program directors cite letters of recommendation as highly important, ranking them the second most important factor for interview invites, only after failed USMLE Step 1 attempts.7 Thus, increased use of the SLOE across specialties will have a significant effect on the transition from undergraduate to graduate medical education.
While there are demonstrated benefits of the SLOE over the NLOR, there has not been a comprehensive study of the validity evidence of the SLOE. Messick defines validity as the “inductive summary of both the existing evidence for and the potential consequences of score interpretations and use.”8 Providing evidence for the validity of an assessment tool is therefore necessary for the meaningful use of the tool. Here we present a scoping review of the published validity evidence of the EM SLOE, using Messick's framework for construct validity.8 A scoping review was chosen to identify gaps and future directions, and because the heterogeneity of the literature makes a systematic review difficult.
A scoping review of the validity evidence of the EM SLOE was performed. Methods were developed following previously published guidance for conducting scoping reviews.9
In 2020, PubMed, Medline, Google Scholar, Web of Science Core Collection, and Embase were searched for “(sloe OR slor) emergency medicine” and all variations of the phrase “standard/standardized letter of recommendation/evaluation.” Inclusion criteria included any studies in which the EM SLOE was the subject of study. Citations were then assessed as to whether the study question was related to validity and were excluded if not; abstracts were also excluded. The initial search was conducted by a single author (P.K.) erring on the side of inclusivity. Included citations were reviewed separately for exclusion criteria by both authors. Any disagreements were resolved by discussion.
Messick's framework for validity includes the following aspects: Content, Response Process, Internal Structure, Relation to Other Variables, and Consequences.8 The study question in each article was reviewed by each author and placed into 1 of the 5 categories that seemed the best fit. There were no disagreements.
To determine whether a study provided evidence for or against each aspect of validity, each author again independently assessed the results and conclusions of the study. Any disagreement between the authors was resolved with a discussion.
The initial search terms returned 212 citations. After application of the inclusion and exclusion criteria, 22 articles were included in our review. The majority of studies assessed a single question with a dichotomous outcome. One study with multiple questions was determined to have “mixed” evidence. There is no published literature examining the evidence for content validity.
Fourteen studies have been published about the SLOE that could be categorized as representing evidence for response process, which makes this the most studied aspect of the SLOE.5,10–22 Three of the 14 studies provided evidence for validity and 11 of the 14 provided evidence against validity of the SLOE.
In favor of the SLOE, a study discovered that the interrater reliability was 0.97, in contrast to NLORs that had an interrater reliability of 0.78.5 The second study looked at gender bias in the narrative portion of the SLOE at one institution and found that the narrative was “relatively free of gender bias.”10 The third, published in 2019, again looked at gender differences in the narrative portion and determined that there was no difference in word type frequency.11
Eleven studies provided evidence against response process validity.12–22 Six studies have shown that authors do not adhere to the ranking guidelines and that ranking inflation is rampant on the SLOE.12–17 One review found that “nearly all” applicants were ranked near the top and that only 2% of letters used the bottom rankings.12 Another study demonstrated that students were ranked in the “top 10%” 40% of the time, 83% of students were “above the level of their peers,” and more than 95% of SLOEs ranked the students in the “top third” compared to their peers in the “qualifications for EM” section.13 Similarly, a survey of SLOE writers found that only 39% admitted to using the full scale to rank applicants.14 However, the most recent study in this area does show improvement from these 3 earlier studies, demonstrating a more even distribution between the categories of top 10% and top, middle, and bottom third.15 Even with the demonstrated improvement, writers still exhibited a reluctance to use the full scale as students were still ranked in a top-heavy fashion.15 Additionally, 68% of SLOE writers do not follow the given SLOE instructions, and 67% of writers were not formally instructed on how to fill out a SLOE.16
Another study examining grading differences found wide grading practice variability between clerkships.18 The percentage of students who received an honors grade at a specific clerkship varied between from 1% to 87%, some schools used 3-point grade scales while others used 5-point scales, and some schools were graded as pass/fail.18 The grade is included on the SLOE.
Furthermore, studies have shown that variables specific to the letter writer can affect the SLOE. Literature demonstrated higher ratings being given to students by less experienced writers and by writers who have known the student for a longer period of time.19 Similarly, student scores were consistently higher on a letter written by their home institution compared to those written after visiting clerkships.20 Moreover, while the 2 studies described above state that there is no gender effect in the SLOE, 2 other studies do testify to this as a phenomenon.21,22 A study found that it was significantly more likely for a student to receive the highest possible ranking if the student was female and the writer was female; no other differences existed for any other gender pairing.21 Finally, female students were found to have statistically significant higher scores than male students on the SLOE.22
The majority of studies regarding response process provide robust evidence against validity. Additionally, studies regarding gender differences provide conflicting conclusions. This aspect of validity has been studied the most, and while the evidence against validity is discouraging, the most recent and largest study does show a significant improvement in an even distribution of rankings, along the top 10, and top, middle, and bottom third, versus older studies.
There is one study published relating to the internal structure of the SLOE. A 2001 study correlated the rank of “guaranteed match” (the highest possible ranking prior to SLOE revision in 2002) with other parts of the SLOE.23 The authors demonstrated that the guaranteed match ranking was correlated with the honors grade, a ranking of “outstanding” on differential diagnosis, a ranking of “outstanding” on work ethic, and a ranking of “outstanding” on the global assessment, all as one would expect, providing some evidence for internal structure.23 However, guaranteed match also correlated with the author's position, as well as if the author and student had a relationship outside of the emergency department.23 This single study from 2002 provides very little overall evidence either way for internal structure, demonstrating that this aspect of validity of the SLOE needs further study.24
Relation to Other Variables
Four studies have been published regarding the SLOE's relation to other variables.2,24–26 The first study compared rankings on the SLOR (this study was undertaken prior to the instrument's name was changed to SLOE) to a ranking of residents' “final success” upon graduation, with “final success” being defined after the faculty ranked each graduating resident against all previous residents at one institution.24 The SLOR was not strongly correlated with this measure of success in residency.24 The next study examined whether the SLOE category “predicted rank on the match list” correlated with the actual match list and found that the assessment accurately predicted the final rank order 26% of the time.25 The authors found that the students' positions on the SLOE were overestimated 66% of the time and underestimated 8% of the time. A later study showed that the global assessment portion of the SLOE was positively correlated with the final rank list, with a Spearman's correlation of 0.332.2 Finally, the most recent article compared the individual's SLOE to their performance as a graduating resident; institutions grouped the residents into thirds based on a score created from the numerical values on their Accreditation Council for Graduate Medical Education Milestone assessments. The authors found that the residents' “final ability” was correlated with the SLOE's global assessment as well as the SLOE's ranking of competitiveness.26 In summary, there is minimal study regarding relation to variables, making it hard to draw conclusions in either direction. While the results from the 3 studies are mixed, the 2 most recent studies are trending in the correct direction for validity.
Two articles have been published regarding the consequences of the SLOE.3,4 Both are surveys of EM program directors which found that the SLOE is the most important piece of data when choosing who to interview and, subsequently, rank.3,4 These studies provide evidence that consequences of the SLOE are high; however, no studies have been performed looking at how the high-stakes nature of the SLOE may affect letter writers or how it may affect students' behavior during a clerkship. While we can predict with some degree of certainty that the consequences to the SLOE are very high, studies are necessary to uncover its exact relation to the validity of the SLOE. Currently, it is not possible to conclude how the high consequences of the SLOE affect its validity.
See the Table for a summary of the evidence for validity of the EM SLOE.
Overall, we found that the evidence for validity for the EM SLOE is lacking. While the SLOE has good evidence for content validity owing to its creation process, there is not strong evidence for any other aspect of validity.
We believe the development process for the SLOE provides evidence for content validity. CORD initially convened a task force in 1995 to create the SLOE after concerns that usual NLORs were not adequate.1 The task force was comprised of a representative sample of CORD membership, consisting of program directors, assistant program directors, and clerkship directors. In 1999, Keim et al described the initial creation process and how the task force determined what to include on the form.1 In 1996 and 1999, the SLOR was edited by the task force based on unpublished surveys that had been distributed to program directors throughout the country.1 The task force was reconvened in 2011 to update and improve the SLOE. Changes were made after 2 published studies and one unpublished survey, which included a change to the name from the Standardized Letter of Recommendation to the Standardized Letter of Evaluation.3,16 Additional categories were added to the “Qualifications for EM” section, including teamwork, ability to communicate a caring nature to patients, how much guidance an applicant would need in residency, and predicted success in residency. Further, CORD has shown that it can adapt quickly when necessary; the task force reconvened in 2020 to address SLOE issues related to changes due to the COVID-19 pandemic. This process provides continuing evidence for content validity, as the content of the SLOE changes to reflect the changing informational needs of program directors. We, therefore, conclude that the content of the SLOE should represent what the SLOE is intended for, and have evidence for content validity.
Response process has been the most studied, and the evidence overall currently argues against validity. Studies on the dermatology SLOR, otolaryngology SLOR, and orthopedic SLOR have all demonstrated similar rank inflation.27–29 The overall theme emerging from the literature is that better rater training will improve adherence to ranking distribution; however, there may not be evidence to support this claim. Multiple studies do show that rater training can improve the quality of assessment reports and improve the ability of faculty to assess residents.30,31 Nevertheless, studies also show that rater training has no effect, even on standardized clinical examinations.32,33 On the EM SLOE, adherence to the rating system has improved over the years, and the authors of the most recent study suggest that rater training is the reason for the improvement.15 While an increased focus on rater training may have improved adherence to the rankings on the EM SLOE, the questionable effect of rater training in general and number of years the EM SLOE has existed leads us to believe that rater training is unlikely to yield further improvement to the SLOE's response process.
Concern about the consequences of the SLOE may limit adherence to the ranking scale despite any additional rater training. A survey presented at the 2016 CORD Academic Assembly shows that 40% of EM program directors do not match students ranked in the lower third.34 Further, current instructions on the electronic SLOE (eSLOE) state that when choosing a comparative ranking, writers should consider only “candidates you have recommended in the last academic year” (see online supplementary data). If an institution writes a small number of SLOEs, this potentially creates a situation that creates an unfavorable designation for an otherwise competitive student. For example, an outstanding student who is slightly outperformed by a handful of others should technically be rated as “lower third” even though the writer knows the performance was outstanding. Based on the above survey data, the current SLOE asks writers to choose between adhering to the ranking scale or potentially consigning outstanding students to a lower likelihood of matching. Therefore, the consequences of a “lower third” ranking may dampen any positive effect that rater training may have on ranking scale adherence.
Thus, rather than continuing to study whether or not there is strict adherence to the ranking system or pushing for further rater training, we submit that a reconsideration of the current ranking system and instructions is necessary. Rather than using norm-based percentiles that create difficulties in compliance, criterion-based descriptors may help writers faithfully assign students to a category. The current norm-based ranking system uses strict percentile cutoffs, meaning absolute adherence could cause 2 students of almost identical ability to be placed into different rankings. Proper norm-based ranking would use standard deviation from the mean,35 which is not feasible for the EM SLOE, as it requires precise numerical scores, such as with multiple-choice tests. Criterion-based rankings with descriptions would not eliminate ranking inflation, but writers may have an easier time placing students into categories that contain a description of the typical student in that category (eg, “independently creates treatment plans that do not require modification”). This would add more meaningful contextualization of the applicant for residency programs as well as create a more equitable evaluation system for students.
Switching from a norm-referenced to a criterion-based system may also help to combat bias on the SLOE. A study of language use in narrative assessments found that female and underrepresented in medicine (UiM) medical students had significantly more personality attributes described, compared to competency-based language used for male and non-UiM students.36 Changing to a criterion-based system grounded with competency descriptors will force writers to consider the chosen competencies when assessing students rather than relying on personality attributes and may therefore decrease implicit bias in ranking. This would need to be further studied but would present an opportunity to examine a potential method to systematically reduce bias in medical assessments.
Whether or not the evaluation system changes, bias on the SLOE requires further study. Gender bias has been examined by multiple studies, with mixed results, trending toward favoring female applicants. However, racial bias in SLOE rankings has not been examined. Studies in other domains, including induction into the Alpha-Omega-Alpha (AOA) honor society, MSPE letters, and clerkship grades have all shown evidence of racial bias that negatively affects UiM groups.37–39 Due to the documented existence of bias and the outsized importance the SLOE has on residency applications, future studies must assess what effect race has on the SLOE rankings.
Further complicating the response process is the lack of interrater reliability. While there will always be a degree of variability in workplace-based assessment, the large differences in each institution's clerkship make a standardized comparison between them difficult. While there is a published national curriculum for EM clerkships,40 significant differences between clerkships remain.41 Importantly, differences include how assessments are performed, with variations in whether residents are allowed to assess students; if a written test is used for assessment and, if so, which one; and whether direct observation is a requirement of assessment.41 Key clerkship differences are further illustrated by the wide variability of grading practices, in which some clerkships are pass/fail, some give grades but not honors, and some use a range of 3- to 5-point scales.18 These factors make creating a “standardized” letter to compare students across the country very difficult, if not impossible. To address this, stakeholders need to push for further standardization among clerkship curricula. Additionally, consensus on how assessments are performed and by whom should be published. Finally, using a standardized shift assessment, so that SLOEs are based on the same inputs across clerkships, will create a more reliable assessment. The National Clinical Assessment Tool created by a consensus conference at CORD is a potential tool that could be widely adopted to assist with this process.42 This tool will need further evidence for validity prior to its widespread use. Leaders in EM education need to push for the study, and if it demonstrates evidence for validity, adoption of this tool, as well as the inclusion of an item on the SLOE to indicate whether or not it is used during the clerkship so that applications reviewers can make their own assessment about validity.
Next, relation to other variables for the EM SLOE remains understudied. Without larger, more robust studies in this domain, it is difficult to know whether the SLOE is actually predictive of future success in residency and therefore serving its original purpose. Our results demonstrate that the focus of study on the EM SLOE has been weighted heavily toward the inputs, despite the predictive value perhaps being even more important. The new eSLOE format creates a large database to perform multi-institutional studies comparing it to other variables; performing these studies will be a necessary step to provide further evidence for validity for the EM SLOE.
Taking steps to improve and study the EM SLOE will become even more important to both EM and to all specialties using or considering a standardized letter after the recent decision by the Federation of State Medical Boards and National Board of Medical Examiners to make the USMLE Step 1 to be reported as pass/fail.43 Previous surveys have shown that Step 1 was either the third most important factor or factor of “middle importance” to interviewing and ranking for matching.3,4 It would be reasonable to predict that by removing another objective variable, the SLOE will gain even more importance to program directors and future residents. This could have even more significant effects in other specialties currently using or considering adopting the SLOE, as each specialty values the USMLE Step 1 score differently. If the SLOE continues to be utilized by program directors as the most important factor in medical students' applications, further improvement to make it the best tool possible is required.
There are limitations to our study's findings. During our data collection process we did not include poster presentations and abstracts, meaning there could be further evidence for validity for the EM SLOE that was not discovered. Second, many studies examining the same aspect of the SLOE have differing results, which can make consistent conclusions on these aspects of validity difficult. Third, the nature of this review is inherently subjective regarding each individual study examined. Despite this limitation, applying Messick's framework for validity evidence to the whole should add reliability to our results.
Other specialties should take note of the current challenges facing the EM SLOE and edit or create their own standardized letters accordingly. First, stakeholders should consider the drawbacks of using norm-based percentile rankings and consider using criterion-based descriptive categories. Next, evaluators must be aware of the implicit and systemic bias that exist within assessments and work to address this in any standardized letter. Additionally, specialties need to examine current clerkship differences and advocate for the standardization of the clerkship experience, particularly the assessment portion. Finally, specialties should perform early study on the relation to other variables to provide further evidence for validity for their standardized letters.
There is little evidence for validity for the EM SLOE regarding response process, internal structure, or relation to other variables.
Editor's Note: The online version of this article contains an example of the standardized letter of evaluation.
Funding: The authors report no external funding source for this study.
Conflict of interest: The authors declare they have no competing interests.
Findings from this study were previously presented as an abstract at the Council of Emergency Medicine Program Directors Academic Assembly, New York, NY, March 8–11, 2020.