Imagine the sequence of events: You have designed a 6-station simulation objective structured clinical examination (OSCE) to assess the resuscitation skills of your trainees. At each station trainees are assessed on professionalism, communication, leadership, and technical skills relevant to the scenario. These individual scores are averaged to create a singular score for each station. Cognizant that gender of the trainee may play a role in how raters assess trainees, you wish to examine your OSCE for reliability and sources of variance, to ensure that gender effects are not driving competency decisions. How can you gather validity evidence to support your decisions?
Later, you read an article describing a new OSCE for assessing resident resuscitation skills. Eureka! This is exactly what you need for your program. However, the authors used Generalizability theory to examine validity for the OSCE. What exactly is G-theory and is this a credible approach?
If you can visualize yourself in the above situations and need an introduction or refresher in G-theory, read on. This article explains the basics and provides examples for further study.
Measuring Clinical Competence
In the age of competency-based medical education (CBME), we increasingly make decisions based on our assessment processes, which necessitate ensuring the reliability and validity of our assessments.1–3 Can we rely on the score to discriminate between trainees based on competence? Can we trust the process? When we create new or modify older assessments to measure relevant constructs (ie, specific aspects of clinical competence), we must determine whether our assessment data maintain validity for decision-making. Any assessment can be considered a measurement tool; thus, we can apply measurement principles to examine validity.3 As validity is not a stable characteristic of any measurement tool, it can be threatened by a multitude of construct irrelevant factors that potentially introduce measurement error.3 Reliability, similarly, is not a stable characteristic of any measurement tool, and is sensitive to changes in context within competency-based assessments.
If you conduct a literature review regarding methods to reduce measurement error, you will find a dizzying multitude of study designs and analysis approaches. Of these approaches, readers are likely familiar with Cronbach's alpha to examine measurement reliability. A calculation of Cronbach's alpha can inform test score reliability, but not whether systematic rater bias has an influence on scores. That is, if trainee gender influenced a rater's assessment of performance, we could not discover this from Cronbach's alpha alone.3
A highly useful theory that informs reliability, validity, elements of study design, and data analysis is Generalizability theory (G-theory).3–6 G-theory is a statistical framework for examining, determining, and designing the reliability of various observations or ratings.3–6
Using G-theory we can design Generalizability studies (G-studies) to better understand the composition of assessment scores (ie, what contributes to the actual score that you get at the end of an OSCE). We can then design Decision studies (D-studies) to help predict the reliability of the same data collected under different conditions.
In performance-based assessments we need to consider potential influences on assessment scores, such as rater bias, relative difficulty of items or stations, the rater's or examinee's attention or mood, the abilities of standardized patients, and the overall environment.3,7 G-theory offers a way to quantify the variance contributed by these factors, which G-theory refers to as facets.3,5,6 Each form of a given facet is called a condition.3 In our example vignette, the trainees are a facet and their gender is a condition (and for this example, we assume only 2 genders). Let's use the above example to review important terminology and concepts, which are supplemented by definitions in table 1.
Example: Application of G-Theory to Assessment
A standard OSCE design is ideal for a G-study because there are repeated measurements of the same construct (ie, clinical skills), much like we might use a measuring tape to measure some lumber several times before making a cut (or better yet, taking the average measurement from 5 people using different measuring tapes). Collecting repeated measures of the same construct has been shown to improve reliability. This is because random variance can cancel out in multiple measurements of the same construct. But there may remain systematic sources of measurement error. For example, the rater cognition literature is filled with examples of different kinds of rater bias and its impact on assessment scores and decisions.8
In experimental designs the focus is on minimizing error of all types to find true differences between groups.3,9 In G-studies the goal is to highlight sources of error (called variance) in order to determine if we can trust the critical measurement.3 In our vignette, all 6 stations' scores are combined or averaged to determine the total score for the entire OSCE.10 The use of multiple stations and an average OSCE score is a way to deal with measurement error introduced by raters.8 However, each station score contributes variance toward the total score. Each station is more or less difficult or each rater is more or less lenient and has unique influences on the trainee's score; rather than ignore them, using G-theory we can quantify these contributions. The figure demonstrates this concept. G-theory allows us to develop a recipe. Variance is like the distribution of ingredients in a single slice of pie; sometimes the variance works in an examinee's favor, and sometimes the variance may work against the trainee. Sometimes the variance acts in predictable ways (systematic error) and other times in unpredictable ways (random error).3 It is essential to consider these aspects of variance in order to fully evaluate reliability, which ultimately sets an upper limit on the validity (accuracy) of the assessment.3
Analogy (a Pie Recipe) for Determining Score Composition for a Total OSCE
In our vignette, we could use G-theory to consider the variance among scores of professionalism, communication, leadership, and technical skills. However, the real power of G-theory is that we can also consider additional sources of variance, such as the gender of the trainee. We could, if we were interested, add in sources of variance like the time of day, different standardized patients, or number of OSCE stations. Table 2 shows how we can keep track of various facets in our vignette to evaluate them for different purposes.
To proceed with our G-study, we first identify all likely sources of variance or facets in the OSCE for resuscitation skills and determine if they are fixed or random. In order to ensure we have all the pieces to conduct a G-study, data are best organized as in table 2. As a reminder, we are interested in (1) Determining if we collected reliable assessment data in the resuscitation OSCE, and (2) Whether there is any indication that gender—a factor that should be irrelevant to the evaluation of a clinical skill—contributes any variance to the overall assessment.
Generalizability Theory
“Any one measurement from an individual is viewed as a sample from a universe of possible measurements.”11
In G-theory we first define the universe of scores and facets we wish to generalize from and to. In a G-study, the facets being considered are predetermined to be fixed or random. We then conduct several G-studies to calculate G-coefficients. Each calculated G-coefficient evaluates the reliability of a given aspect of the measurement tool, for example, interrater reliability. In D-studies we can evaluate the impact of changing a facet's label, such as from fixed to random. We can use these calculations to make predictions about performance in a similar assessment situation. For example, we can ask how the G-coefficient would be affected by reducing the number of OSCE stations. Or we can ask if multiple raters per station would increase the G-coefficient. Typically, as shown in table 3, multiple stations can improve reliability, but multiple raters per station do not have a big impact.
When considering the reliability of a measurement tool, we can start with a basic formula to describe how different sources of variance or error relate.3,6
Using the pie analogy, consider that Errorvariance (the recipe) is composed of multiple components (individual ingredients). The variance may be slightly altered by various error-inducing facets (eg, the measuring cup precision, water purity, altitude of the bakery, etc). These facets would each introduce some element of error in the final composition of the resulting pie.
In our vignette, assuming that the resuscitation OSCE is conducted on one occasion, the facets would be the trainees, gender of the trainees, and stations (which include raters). The facet of trainee is nested in the facet of gender. Raters (nested in stations) are facets of generalization, as we hope to generalize from one observation or score at one station, or recorded by one rater, to another score from a different rater. In this example, trainees are the facet of differentiation as we wish to differentiate between individual trainees based on their skill level as measured by the OSCE. These facets describe the known universe of scores in this study. If the OSCE stations or scenarios are never going to change in future administrations, we can consider the facet of station to be fixed. Similarly, if you foresee the same clinical faculty acting as raters for this OSCE every time, the facet of rater may also be considered fixed. Note that in high-stakes OSCEs both of these facets are random as stations change for test security reasons. In program assessments, faculty also may change as they are typically volunteers and not dedicated assessment staff. Whether a facet is fixed or random changes what variance components are included in the calculation of the G-coefficient (see table 4).3–5
For any administration of the resuscitation OSCE, the average OSCE score acts as the universe score against which all individual scores are evaluated. G-studies have the same starting point for determining standard deviation, mean square error, and variance as an analysis of variance (ANOVA).3–5 We start with factors in a standard ANOVA, but then continue with the variance components, or facets in a G-study. The difference from an ANOVA analysis is that in G-theory we are not as concerned about establishing a significant difference between groups, but rather to determine how error variance is distributed among the various facets. The goal is to extend classical reliability coefficients to describe how much variance is due to the object of measurement, in this case the trainees. Ideally, the greatest source of variance is the trainees themselves, which would indicate individual differences in ability. Large amounts of variance attributed to raters or other facets such as gender are undesirable, as these factors should not influence decisions about clinical competence.
Ideally, G-studies would assess all sources of variance, or error. The limitation for any evaluation of an assessment is the inability to estimate contributions from unknown sources of variance. However, by focusing a light on as many known important variables, it is possible to begin to understand what may be missing.
Next time you read an article that includes a G-study, remember that this strategy will help determine whether the largest source of variance was the subjects being tested—which we would expect in an assessment to determine different trainee competence—or due to other factors, such as the person rating the trainee, time of day, number of test situations, or other factors. For reference, we list 3 articles that use G-theory to examine measurement error (box). When creating your own assessment programs, consider using G-theory to understand the role of sources of variance, not only to enhance the reliability of your own measurements, but also to add benefit when disseminating your work to others. We look forward to your questions and comments about G-theory and reliability studies.
boxArticles Using Generalizability Study Design to Examine Test Properties
Lang VJ, Berman NB, Bronander K, Harrell H, Hingle S, Holthouser A, et al. Validity evidence for a brief online key features examination in the internal medicine clerkship. Acad Med. 2019;94(2):259–266. doi:10.1097/ACM.0000000000002506.
Monteiro S, Sibbald D, Coetzee K. i-Assess: evaluating the impact of electronic data capture for OSCE. Perspect Med Educ. 20181;7(2):110–119. doi:10.1007/s40037-018-0410-4.
Lord JA, Zuege DJ, Mackay P, Roze des Ordons A, Jocelyn L. Picking the right tool for the job: a reliability study of 4 assessment tools for central venous catheter insertion. J Grad Med Educ. 2019;11(4):422–429.