Past decisions about teaching often were based on the “PHOG” approach: “prejudices, hunches, opinions, and guesses.”1 In the last decade, major advancements have occurred in the development and understanding of new evidence to guide medical education decisions. The formation of the Best Evidence in Medical Education (BEME) international working groups is an example of this new approach.2 BEME work groups systematically search for studies to answer key questions, with a rigorous approach to evaluating the quality of evidence. Other groups have examined the quality of methods and of reporting in English-language education research studies.3–10 Despite these developments, many decisions in medical education are still based on “persuasion and politics.”11 

One of the primary goals of the Journal of Graduate Medical Education (JGME) is to improve the quality of graduate medical education research. Systematic reviews of education research have identified areas of concern.3–6,10 One of our strategies will be to improve readers' understanding of these areas. In each issue, the Journal plans to provide a summary about one aspect of research quality. For this issue, the subject is reliability and validity of assessment instruments used for research outcomes (pp 119–120). This topic is particularly relevant to our readers, as validity and reliability evidence for assessments is routinely underreported in manuscripts submitted to JGME.

In this editorial, we introduce areas of concern in the quantitative methodologies delineated in systematic reviews of English-language publications and instruments available to examine the quality of education research. These instruments include the Medical Education Research Study Quality Instrument (MERSQI),5,7 the BEME global scale,3 and the Newcastle–Ottawa Scale (NOS) for assessing quality of nonrandomized studies.12 These instruments are based in part on Kirkpatrick's hierarchy of educational outcomes,3,13 which provides a valuable conceptual framework for planning and evaluating educational initiatives. Standards are also available to assess the quality of methods reporting,14,15 but they will not be discussed here. Similarly, other topics, such as quality of research questions and overinterpretation of results, will not be addressed in this editorial.16 

The NOS was developed to rate the quality of nonrandomized studies included in systematic reviews and has data to support its validity.12 Although the NOS was created for clinical research, it has been modified and used in systematic reviews of educational research.6,10 Examining one's own research for the presence or absence of specific items may be instructive (table 1).

TABLE 1

Modified Newcastle–Ottawa Scale12,20

Modified Newcastle–Ottawa Scale12,20
Modified Newcastle–Ottawa Scale12,20

In 2 studies, the modified NOS was highly correlated with the MERSQI (table 2) and the BEME global rating scale (table 3).7,10 Of the 3 scales, the MERSQI may be most useful for researchers wishing to examine their work for methodologic rigor, as it includes a comprehensive list of review items and also has a growing body of validity evidence.5,7 Less evidence is available for the BEME global rating scale, which includes a modified version of Kirkpatrick's hierarchy17 of the outcomes of educational interventions. Kirkpatrick's hierarchy of levels is also included in the MERSQI scale, with higher points assigned to higher levels of outcomes. Kirkpatrick's hierarchy, also termed Kirkpatrick's pyramid (figure), is employed widely by education experts to characterize the level of outcomes in an educational intervention. Authors could enhance the quality of their papers by including a discussion of their work in relation to the BEME global or Kirkpatrick frameworks. To date, these discussions rarely occur in JGME submissions.

FIGURE

Kirkpatrick's Levels of Learning13 

FIGURE

Kirkpatrick's Levels of Learning13 

TABLE 2

Medical Education Research Quality Instrument5 for Quantitative Studies

Medical Education Research Quality Instrument5 for Quantitative Studies
Medical Education Research Quality Instrument5 for Quantitative Studies
TABLE 3

Best Evidence in Medical Education Global Scale3 

Best Evidence in Medical Education Global Scale3
Best Evidence in Medical Education Global Scale3

The levels of Kirkpatrick's outcomes include (1) participation rates or learner satisfaction; (2) changes in attitudes, knowledge, and skills; (3) changes in behaviors; and (4) changes to the care system or patient outcomes. For example, in a study comparing an interactive web-based program with readings and lectures for teaching residents techniques for smoking cessation, potential outcomes could be classified as follows:

  • Level 1: The percentage of residents completing each intervention; resident satisfaction with the interventions

  • Level 2a: Residents' attitudes about smoking cessation counseling

  • Level 2b: Residents' performance of smoking cessation counseling with standardized patients

  • Level 3: Documentation of smoking cessation counseling in clinic charts; clinic patients' reports of smoking cessation counseling

  • Level 4b: Number of patients who quit smoking in residents' clinics.

In systematic reviews of education research, the majority of studies reported outcomes at Kirkpatrick levels 1 and 2.5,7 Although undoubtedly easier to study, achievement of outcomes at these levels may not translate into effective, sustained changes in behaviors or improved patient outcomes. In general, outcomes reported were more often subjective rather than objective. Of greater concern is that outcomes are entirely absent in many studies: 19% in one 2008 review.7 On average, the data analysis portion of reviewed papers received the highest quality ratings, while validation of assessment instruments received the lowest quality ratings.5,7,9,10 

Other areas of methodologic concern found in literature reviews include (1) predominance of single-site studies; (2) small studies that are underpowered to find a difference between intervention and comparison groups; (3) lack of a comparison group or lack of description of the intervention for the comparison group (eg, description of usual teaching); (4) inadequate description of multifactorial interventions; (5) overconfidence in randomization to eliminate the influence of confounding variables (ie, bias); and (6) overuse of the single-group pretest/posttest strategy to assess differences, with resulting potential overestimates of the magnitude of the effect of the intervention.16,18 In future issues of JGME, we will examine some of these issues in greater detail.

JGME editors suggest that authors consider evaluating their planned and ongoing work with the above-described instruments, the MERSQI, NOS, and BEME global scale, and other quality scales developed for specific interventions, such as online teaching modules.19 In addition, authors should consider Kirkpatrick's hierarchy when formulating studies and considering outcome measures. These additional steps in reflection may produce a study and eventual manuscript that requires fewer revision cycles and is ultimately of greater value to consumers of medical education research.

1
Harden
RM
,
Lilley
PM
.
Best evidence medical education: the simple truth
.
Med Teach
.
2000
;
22
(
2
):
117
119
.
2
Harden
RM
,
Grant
J
,
Buckley
G
,
Hart
IR
.
BEME guide no. 1: best evidence medical education
.
Med Teach
.
1999
;
21
(
6
):
553
562
.
3
Littlewood
S
,
Ypinazar
V
,
Margolis
SA
,
Scherpbier
A
,
Spencer
J
,
Dornan
T
.
Early practical experience and the social responsiveness of clinical education: systematic review
.
BMJ
.
2005
;
331
(
7513
):
387
391
.
4
Price
EG
,
Beach
MC
,
Gary
TL
,
et al.
A systematic review of the methodological rigor of studies evaluating cultural competence training of health professionals
.
Acad Med
.
2005
;
80
(
6
):
578
586
.
5
Reed
DA
,
Cook
DA
,
Beckman
TJ
,
Levine
RB
,
Kern
DE
,
Wright
SM
.
Association between funding and quality of published medical education research
.
JAMA
.
2007
;
298
(
9
):
1002
1009
.
6
Cook
DA
,
Levinson
AJ
,
Garside
S
,
Dupras
DM
,
Erwin
PJ
,
Montori
VM
.
Internet-based learning in the health professions: a meta-analysis
.
JAMA
.
2008
;
300
(
10
):
1181
1196
.
7
Reed
DA
,
Beckman
TJ
,
Wright
SM
,
Levine
RB
,
Kern
DE
,
Cook
DA
.
Predictive validity evidence for medical education research study quality instrument scores: quality of submissions to JGIM's medical education special issue
.
J Gen Intern Med
.
2008
;
23
(
7
):
903
907
.
8
Reed
DA
,
Beckman
TJ
,
Wright
SM
.
An assessment of the methodologic quality of medical education research studies published in The American Journal of Surgery
.
Am J Surg
.
2009
;
198
(
3
):
442
444
.
9
Windish
DM
,
Reed
DA
,
Boonyasai
RT
,
Chakraborti
C
,
Bass
EB
.
Methodologic rigor or quality improvement curricula for physician trainees: a systematic review and recommendations for change
.
Acad Med
.
2009
;
84
(
12
):
1677
1692
.
10
Cook
DA
,
Levinson
AJ
,
Garside
S
.
Method and reporting quality in health professions education research: a systematic review
.
Med Educ
.
2011
;
45
(
3
):
227
238
.
11
Norman
G
.
Research in medical education: three decades of progress
.
BMJ
.
2002
;
324
(
7353
):
1560
1562
.
12
Wells
GA
,
Shea
B
,
O'Connell
D
,
et al.
The Newcastle–Ottawa Scale (NOS) for assessing the quality of non-randomised studies in meta-analyses
.
Available at: http://www.ohri.ca/programs/clinical_epidemiology/oxford.asp. Accessed February 22, 2011
.
13
Best Evidence in Medical Education
.
BEME Guide No. 8: A systematic review of faculty development initiatives designed to improve teaching effectiveness in medical education. Steinert Y, Mann K, Naismith L, Chin K (reviewers)
. .
14
von Elm
E
,
Altman
DG
,
Egger
M
,
Pocock
SJ
,
Gøtzsche
PC
,
Vandenbroucke
JP
.
for the STROBE Initiative
.
The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies
.
Ann Intern Med
.
2007
;
147
(
8
):
573
577
.
15
American Educational Research Association
.
Standards for reporting on empirical social science research in AERA publications
. .
16
Colliver
JA
,
McGaghie
WC
.
The reputation of medical education research: quasi-experimentation and unresolved threats to validity
.
Teach Learn Med
.
2008
;
20
(
2
):
101
103
.
17
Kirkpatrick
D
.
Revisiting Kirkpatrick's four-level model
.
Training and Development
.
1996
;
50
(
1
):
54
59
.
18
Cook
DA
,
Beckman
TJ
.
Reflections on experimental research in medical education
.
Adv Health Sci Educ Theory Pract
.
2010
;
15
(
3
):
455
464
.
19
Shortt
SED
,
Guillemette
J
,
Duncan
AM
,
Kirby
F
.
Defining quality criteria for online continuing medical education modules using modified nominal group technique
.
J Contin Educ Health Prof
.
2010
;
30
(
4
):
246
250
.
20
Reed
DA
,
Cook
DA
,
Beckman
TJ
,
Levine
RB
,
Kern
DE
,
Wright
SM
.
Association between funding and quality of published medical education research
.
JAMA
.
2007
;
298
(
9
):
1002
1009
.

Author notes

Gail M. Sullivan, MD, MPH, is Editor-in-Chief, Journal of Graduate Medical Education.