Past decisions about teaching often were based on the “PHOG” approach: “prejudices, hunches, opinions, and guesses.”1 In the last decade, major advancements have occurred in the development and understanding of new evidence to guide medical education decisions. The formation of the Best Evidence in Medical Education (BEME) international working groups is an example of this new approach.2 BEME work groups systematically search for studies to answer key questions, with a rigorous approach to evaluating the quality of evidence. Other groups have examined the quality of methods and of reporting in English-language education research studies.3–10 Despite these developments, many decisions in medical education are still based on “persuasion and politics.”11
One of the primary goals of the Journal of Graduate Medical Education (JGME) is to improve the quality of graduate medical education research. Systematic reviews of education research have identified areas of concern.3–6,10 One of our strategies will be to improve readers' understanding of these areas. In each issue, the Journal plans to provide a summary about one aspect of research quality. For this issue, the subject is reliability and validity of assessment instruments used for research outcomes (pp 119–120). This topic is particularly relevant to our readers, as validity and reliability evidence for assessments is routinely underreported in manuscripts submitted to JGME.
In this editorial, we introduce areas of concern in the quantitative methodologies delineated in systematic reviews of English-language publications and instruments available to examine the quality of education research. These instruments include the Medical Education Research Study Quality Instrument (MERSQI),5,7 the BEME global scale,3 and the Newcastle–Ottawa Scale (NOS) for assessing quality of nonrandomized studies.12 These instruments are based in part on Kirkpatrick's hierarchy of educational outcomes,3,13 which provides a valuable conceptual framework for planning and evaluating educational initiatives. Standards are also available to assess the quality of methods reporting,14,15 but they will not be discussed here. Similarly, other topics, such as quality of research questions and overinterpretation of results, will not be addressed in this editorial.16
The NOS was developed to rate the quality of nonrandomized studies included in systematic reviews and has data to support its validity.12 Although the NOS was created for clinical research, it has been modified and used in systematic reviews of educational research.6,10 Examining one's own research for the presence or absence of specific items may be instructive (table 1).
In 2 studies, the modified NOS was highly correlated with the MERSQI (table 2) and the BEME global rating scale (table 3).7,10 Of the 3 scales, the MERSQI may be most useful for researchers wishing to examine their work for methodologic rigor, as it includes a comprehensive list of review items and also has a growing body of validity evidence.5,7 Less evidence is available for the BEME global rating scale, which includes a modified version of Kirkpatrick's hierarchy17 of the outcomes of educational interventions. Kirkpatrick's hierarchy of levels is also included in the MERSQI scale, with higher points assigned to higher levels of outcomes. Kirkpatrick's hierarchy, also termed Kirkpatrick's pyramid (figure), is employed widely by education experts to characterize the level of outcomes in an educational intervention. Authors could enhance the quality of their papers by including a discussion of their work in relation to the BEME global or Kirkpatrick frameworks. To date, these discussions rarely occur in JGME submissions.
The levels of Kirkpatrick's outcomes include (1) participation rates or learner satisfaction; (2) changes in attitudes, knowledge, and skills; (3) changes in behaviors; and (4) changes to the care system or patient outcomes. For example, in a study comparing an interactive web-based program with readings and lectures for teaching residents techniques for smoking cessation, potential outcomes could be classified as follows:
Level 1: The percentage of residents completing each intervention; resident satisfaction with the interventions
Level 2a: Residents' attitudes about smoking cessation counseling
Level 2b: Residents' performance of smoking cessation counseling with standardized patients
Level 3: Documentation of smoking cessation counseling in clinic charts; clinic patients' reports of smoking cessation counseling
Level 4b: Number of patients who quit smoking in residents' clinics.
In systematic reviews of education research, the majority of studies reported outcomes at Kirkpatrick levels 1 and 2.5,7 Although undoubtedly easier to study, achievement of outcomes at these levels may not translate into effective, sustained changes in behaviors or improved patient outcomes. In general, outcomes reported were more often subjective rather than objective. Of greater concern is that outcomes are entirely absent in many studies: 19% in one 2008 review.7 On average, the data analysis portion of reviewed papers received the highest quality ratings, while validation of assessment instruments received the lowest quality ratings.5,7,9,10
Other areas of methodologic concern found in literature reviews include (1) predominance of single-site studies; (2) small studies that are underpowered to find a difference between intervention and comparison groups; (3) lack of a comparison group or lack of description of the intervention for the comparison group (eg, description of usual teaching); (4) inadequate description of multifactorial interventions; (5) overconfidence in randomization to eliminate the influence of confounding variables (ie, bias); and (6) overuse of the single-group pretest/posttest strategy to assess differences, with resulting potential overestimates of the magnitude of the effect of the intervention.16,18 In future issues of JGME, we will examine some of these issues in greater detail.
JGME editors suggest that authors consider evaluating their planned and ongoing work with the above-described instruments, the MERSQI, NOS, and BEME global scale, and other quality scales developed for specific interventions, such as online teaching modules.19 In addition, authors should consider Kirkpatrick's hierarchy when formulating studies and considering outcome measures. These additional steps in reflection may produce a study and eventual manuscript that requires fewer revision cycles and is ultimately of greater value to consumers of medical education research.
Gail M. Sullivan, MD, MPH, is Editor-in-Chief, Journal of Graduate Medical Education.