ABSTRACT
The objective structured clinical examination (OSCE) is a commonly used assessment of clinical skill. Ensuring the quality and reliability of OSCEs is a complex and ongoing process. This paper discusses scoring schemas and reviews checklists and global rating scales (GRS) for marking. Also detailed are post-examination quality assurance metrics tailored to smaller cohorts, with an illustrative dataset.
A deidentified OSCE dataset, from stations with a checklist and GRS, of 24 examinees from a 2021 cohort was assessed using the following metrics: Cut scores or pass rates, number of failures, R2, intergrade discrimination, and between-group variation. The results were used to inform a set of implementable recommendations to improve future OSCEs.
For most stations, the cut score calculated was higher than the traditional pass of 50% (58.9.8%–68.4%). The number of failures was low for traditional pass rates and cut scores (0.00–16.7%), except lab analysis where number of failures was 50.0%. R2 values ranged from 0.67–0.97, but the proportion of total variance was high (67.3–95.9). These data suggest there were potential missed teaching concepts, that station marking was open to examiner interpretation, and there were inconsistencies in examiner marking. Recommendations included increasing examiner training, using GRSs specific to each station, and reviewing all future OSCEs with the metrics described to guide refinements.
The analysis used revealed several potential issues with the OSCE assessment. These findings informed recommendations to improve the quality of our future examinations.
INTRODUCTION
The Objective Structured Clinical Examination (OSCE) is a high-stakes, performance-based summative assessment of clinical skills.1 Since the OSCE format was first used by Harden in the 1970s2 it has been thoroughly studied and widely adopted by medical and complementary and alternative medicine educational institutions, including chiropractic programs.3–5
With assessments such as the OSCE, it is important to ensure the quality and rigor of examinations.6 But how the quality of an OSCE is measured, and what mechanisms are available to ensure improvements in the quality of assessments over time, is not always clear. Moreover, given that chiropractic programs often have class sizes under 100,7,8 statistical analyses appropriate for smaller cohorts are needed, as many analyses require large samples sizes, which chiropractic programs cannot always provide. While many analyses currently exist, there is no recommended battery of tests for small samples of OSCE scores. This paper provides an evidence-based pathway for educators to analyze, review, and improve small-scale OSCEs.
The purpose of this paper was to review the scoring of OCSEs and discuss post-exam statistical analyses. Its aim was to demonstrate a battery of analyses appropriate for small samples sizes using an actual OSCE data set from a European chiropractic program. A secondary aim was to illustrate how these analyses could be used to inform quality improvement for futures OSCEs.
Scoring Issues in OSCE-Style Examinations
The OSCE assesses specific healthcare competencies in a mock environment, as a substitute for clinical competence, using a checklist and/or global rating scale.9 Many areas of student performance can be assessed with an OSCE,9 and assessments should aim to maximize the validity, reliability, objectivity, and feasibility of the OSCE.10
Scoring Checklists
Checklists are often used to score student performance, but assessing multiple areas within a single examination can lead to increasingly intricate checklists, trivializing the task.9,11 Complex checklists can also lead to observer overload, negatively impact scoring behavior, reduce interobserver reliability, and increase the risk of inaccurate assessment, decreasing the exam validity.9 The number of checklist items depends on the station and time allotted—about 8–25 customized checklist items are acceptable.12,13 When used correctly, checklists can improve interexaminer reliability and support novice examiners.14
Each item on the checklist should be discrete, objective, and represent only 1 concept. If several points are combined in 1 item, specific instructions should be provided regarding scoring said item.9 Table 1 presents an example of a checklist with nondiscrete concepts that may be too open to examiner interpretation.
Dichotomous checklist items (ie, competent or not yet competent) are easier to score, but may be narrow and outdated.12 Table 2 presents some examples. A more modern “key features” approach focuses on essential elements of the task, such as eliciting history, seeking critical physical findings, or planning investigations that confirm or refute differential diagnoses.15 A review of the marking checklist in Table 1 suggests it may be too open to examiner interpretation, as each point encompasses many concepts. This contrasts with the checklist in Table 2, which may be too prescriptive.
Weighting of checklist marking is favored (such as 0, 1, 4 for not done/done/done well) as it increases rubric accuracy12 but more complex scoring does not enhance validity.12,15 Negative scoring, where points are removed for incorrect answers, does not increase validity, instead it measures risk-taking behavior.16
Global Rating Scales
In cases where students have memorized the marking guide or used a nonspecific approach, the student may pass but the examiner may not be confident of their competence or clinical decision-making skills. Global rating scales (GRS) can address these issues as they are sensitive to different levels of expertise in examinees.14 A GRS can increase interitem and interstation reliability, can be used in multiple tasks, and may better describe many facets of student expertise.14
Because a GRS requires examiners to use their judgment, it is crucial to minimize its subjectivity, so clear examiner instructions, adequate training, and behavioral anchors are needed. Behavioral anchors are descriptions of the range of performance for each station and can improve inter-rater reliability.17 An example is shown in Table 3. GRSs should be recorded directly at the end of the station, after the checklist is scored.12,18
Statistical Analyses for Post-Examination Quality Metrics
Once an examination is complete and student scores have been acquired, there are several ways to objectively review the exam quality. The following section describes possible analyses that investigate OSCE quality, and their usefulness is discussed in relation to what they show, when they can be used, and when to exclude them from an analysis package.
Number of Failures
This is a quick overview of the examination. A high failure rate does not necessarily mean a poorly designed station; expert judgment should be used to determine if the station was inappropriate for examinee skill level, and if a review of the course content is needed.19
Cut Scores
Traditional testing has a predetermined pass mark (50%–70%) that a student must attain to pass their examination, but this approach may not be useful when determining how to handle a borderline student or a particularly lenient or stringent examiner.20 A borderline regression (BLR) analysis allows for a defensible and feasible method to identifying the cut score (or pass mark) and is reliable in small samples.21 In BLR, checklist scores (Y-axis) for each GRS level (X-axis) are plotted and fitted with a regression line. Where the regression line for the borderline group intersects with the Y-axis is taken as the cut score.21,22 This is shown in Figure 1.
Plot of mock OSCE scores against their associated mock GRS (1 = Fail, 2 = Pass, 3 = Excellent pass) indicating the difference between a 50% pass (black dashed line) and the cut score (red dashed line), or passing level suggested by a borderline regression analysis.
Plot of mock OSCE scores against their associated mock GRS (1 = Fail, 2 = Pass, 3 = Excellent pass) indicating the difference between a 50% pass (black dashed line) and the cut score (red dashed line), or passing level suggested by a borderline regression analysis.
Borderline regression does more statistical skills to compute and can be potentially sensitive to outliers. Such outliers may be a badly performing student who gets a near zero checklist score, or where an examiner gives the wrong overall grade.21 Because of these potential limitations, other metrics should also be used.
G Coefficient
Generalizability theory, or G-theory, is an alternative assessment of reliability that assumes the reliability of a score or observation depends on the population that is observed and the environment where the testing is performed.23 G-Theory is most robust and unbiased with sample sizes over 300,24 making it an inappropriate statistical choice for smaller OSCE cohorts. Thus, it will not be discussed further in this paper.
Cronbach’s Alpha
Cronbach’s alpha is a measure of internal consistency. Meaning, it is a measure of how well a test actually measures what one wants it to measure.25 The higher the internal consistency, the more confidence one can have that the examination is reliable. A Cronbach’s alpha ≥ 0.70 is considered acceptable.26 The degree to which multiple measures (or sections within the station) agree with each other is usually presented as “alpha if item deleted.”6 The “alpha if item deleted” scores should be lower than the overall alpha score. If this is not the case, it suggests the station was not measuring what it was supposed to, the station was poorly designed, the topic was poorly taught, or the assessors were inconsistent.23 Interestingly, an alpha of over 0.90 can be instructive as well, meaning the station might be too easy or redundant in nature. Cronbach’s alpha is most reliable when used with high sample sizes and scales with a higher number of items,26 so this metric should not be viewed in isolation when assessing an OSCE for quality.
R Squared
The R2 coefficient is the proportional change in the dependent variable (the checklist score) due to change in the independent variable (the GRS). Generally speaking, a higher R2 is preferable.27 An adjusted R2 of 0.759 implies that 75.9% of the variation in the students’ GRS is accounted for by variation in their checklist scores. If the R2 is low, it suggests a review of the station checklist or station design is necessary.6
Intergrade Discrimination
Between-Group Variation
The proportion of total variance is an estimation of variance in checklist scores due to student performance29 and indicates the consistency of the examination process. This is a reflection of other factors, such as differences in room setup, environment, or examiners.20 This metric should be under 30%, with values over 40% being problematic.6 This metric should be viewed in concert with R2. A high proportion of variance and low R2 suggests a poorly designed checklist, whereas a high proportion of variance and high R2 suggests inconsistent marking.6
Using the information presented above as review, we conducted an analysis of OSCE scores from an actual exam. The aim of this project was to demonstrate a battery of analyses for OSCE quality that are appropriate for smaller samples. A secondary aim was to illustrate how these analyses could be used to inform post-examination changes to improve the quality of the OSCE in future iterations.
METHODS
A deidentified OSCE data set for the 2021 cohort (n = 24) from a European chiropractic program and their marking guides were supplied for review. OSCE stations were included in this analysis if they used a checklist score (numerical) and a global rating scale. This study was reviewed and approved by the Barcelona College of Chiropractic Research Committee and the Health and Disabilities Ethics Committee (HDEC) of New Zealand (2021 EXP 11582).
The OSCE data set of 7 stations and 24 examinees was evaluated. Descriptive statistics (unadjusted means, standard deviations, and counts) were used to describe the characteristics of the study sample. Adjusted R2, regression slopes, and proportion of total variance were created using the stats package from base R software (R Foundation for Statistical Computing, Vienna, Austria). Borderline regression was performed via simple linear regression of OSCE checklist (dependent variable) and GRS scores (independent variable) also using the stats package. Model assumptions were visually assessed via quantile–quantile plots, fitted value and residual plots, and scatter plots. Statistical tests for OSCE quality were included if they were appropriate for use in small cohorts. Cronbach’s alpha was not calculated for all stations as this required the individual section marks from each stations’ marking checklist, which was not provided in the data set.26 Statistical significance was defined as p values less than .05. All data were presented at 1 or 2 significant digits for ease of reading, but all calculations were performed with unrounded data.
RESULTS
The 7 stations analyzed, each with an identical GRS, were Technique 1 and 2, Professionalism, Neurological examination, Laboratory analysis, Systems review, and Care plans. The highest and lowest scoring stations were Systems review (mean: 77.1 ± 14.9) and Laboratory analysis (mean: 56.4 ± 17.7), and the percentage of fails based on a traditional 50% pass rate ranged from 0 (Systems review) to 50% (Laboratory analysis). Summary statistics are shown in Table 4. The raw data for the stations can be found in Appendix 1.
An overview of the data, including cut scores and 50% pass scores, is given in Figure 2. Both R2 (0.67–0.96) and between-group variation was high (67.31–95.85), with intergrade discrimination falling between 10.5 and 16.5.
Plots of all OSCE stations with borderline regression analysis. Panel A shows the Technique 1 (Tech 1) station, panel B the Technique 2 (Tech 2), panel C the Professionalism station, panel D shows Neurology, panel E the Lab analysis station, panel F shows the Systems review (Systems) station, and panel G shows the Care plan station. BLR = borderline regression.
Plots of all OSCE stations with borderline regression analysis. Panel A shows the Technique 1 (Tech 1) station, panel B the Technique 2 (Tech 2), panel C the Professionalism station, panel D shows Neurology, panel E the Lab analysis station, panel F shows the Systems review (Systems) station, and panel G shows the Care plan station. BLR = borderline regression.
For all stations, the cut score suggested by the borderline regression analysis was higher than the traditional pass rate of 50% (Table 5).
DISCUSSION
This study reviewed statistical analyses appropriate for assessing smaller OSCE cohorts and suggests the use of multiple metrics to analyze, review, and improve future exams. These metrics included the number of fails that traditional and borderline regression informed, Cronbach’s alpha, R2, intergrade discrimination, and between-group variation. Overall, these metrics suggest that the examination processes that generated the dataset may not be performing optimally, reducing the OSCE reliability and quality. In terms of which of the metrics seem to be most informative, the R2 in combination with the proportion of variance suggested the need for a review of the scoring scales and that inconsistent marking was a problem. At this point in time, it is challenging to compare these metrics to other quality assurance processes, as no analyses specific to small cohorts have been found.
The Laboratory analysis station also stands out as it has the largest number of fails of all the stations, with 50% of the examinees failing the station—using the 50% pass rate or cut scores. This suggests the Laboratory analysis station may be beyond the current capabilities of the examinees or there were missed concepts in teaching its content. A visual analysis of the plots in Figure 2 suggested that factors such as no clear fails in the Care-plan or Systems stations may have affected the BLR analysis. Additionally, while these analyses have recommended for smaller sample sizes there is a caveat, especially for BLR analysis—stations must be of high quality with an R2 of 0.50 or higher and have an even spread of candidates over the GRS.30,31 If both of these criteria are not met, as in the Care plan and Systems stations, then a previously identified pass rate be used instead of BLR cut scores.31
Overall, the R2 values were high, implying that most of the variation in the students’ global ratings are accounted for by variation in their checklist scores.27 There also were no apparent outliers in the intergrade discrimination scores, which are expected to be around 10% of the total checklist grade.6 The neurology station intergrade discrimination was low suggesting a large variance in marking, which may have been due to differences in examiners, as the broad nature of the marking guide (Table 1) required a great deal of examiner interpretation.
The most concerning trend noted is the proportion of total variance (should be under 30%6 ), which estimates how much variance in checklist scores is due to student performance alone.29 The proportion of variance in this dataset was much higher, ranging from 67.31 to 95.85. Furthermore, a high proportion of variance with a high R2, like in the Systems station, suggests inconsistency in examiner marking rather than a problem with the scoring checklist.6
Based on a review of these data, several recommendations are suggested to increase the quality and reliability of OSCE processes for this specific chiropractic program.
Review the stations with few or many fails for missed concepts or checklist utility.
Create a GRS specific to each station.
Use a borderline regression analysis to decide the cut score for each station.
Increase examiner training to reduce inconsistent marking.
Using all metrics discussed above to perform an overall check of station quality and reliability.
In terms of limitations to this study, while the sample size was small it was congruent with the study’s aims of statistical assessment valid for small sample sizes,30,31 but a larger sample may provide more robust findings. Additionally, it should be noted that these data and recommendations are specific to the program that provided the data and should not be generalized to all programs without further research.
Taking a step back from the specifics of these data, what this study has done is to provide a novel method for assessing OSCE quality in smaller programs—describing a number of statistical tests, why they would be used, and a description of their interpretation to show how an analysis could be used to improve small-cohort OSCEs. These metrics provide an objective, evidence-based method to uncover potential problems, whether they be missed concepts, biased examiners, or examinee performance. This battery of tests may provide the first step in serially improving clinical examinations in successive years as each set of metrics could be compared to previous years. Future studies could also compare and contrast OSCE results from other chiropractic programs to determine if the issues identified in this study were unique or universal throughout other programs. Further studies could also review quality changes before and after the recommendations that were informed by this study to illustrate the test batteries use and detail any improvements over time. Additionally, the software (R) used for analysis is free and relatively simple to use, further reducing barriers to faculty seeking to improve their own examinations.
CONCLUSION
This study identified statistical analyses useful for measuring the quality of small-scale OSCEs and used real-life data to illustrate how these analyses could be used to identify examination issues. It also created recommendations to correct problems specific to the dataset and may delineate the pathway to help anticipate future challenges and improve the quality of future examinations.
REFERENCES
FUNDING SOURCES AND CONFLICTS OF INTEREST There were no funding sources or identified conflicts of interest in this study.
Author notes
Alice Cade (corresponding author) is a senior lecturer and research fellow in the Department of Basic Science at the New Zealand College of Chiropractic (6 Harrison Road, Mt Wellington, Auckland, New Zealand; [email protected]).
Nimrod Mueller is the co-head of the Clinic Unit at Barcelona College of Chiropractic (Carrer dels Caponata, 13, 08034 Barcelona, Spain; [email protected]).
Concept development: AC, NM. Design: AC, NM. Supervision: N/A. Data collection/processing: AC. Analysis/interpretation: AC. Literature search: AC. Writing: AC. Critical review: AC, NM.
This paper is the first prize winner of the Chiropractic Educators Research Forum/World Federation of Chiropractic Alan Adams Education Research Award award-winning paper presented at the World Federation of Chiropractic/Association of Chiropractic Colleges Global Education Conference, November 2-5, 2022. The award is funded in part by sponsorships from NCMIC, ChiroHealth USA, Activator Methods, Clinical Compass, World Federation of Chiropractic, and Brighthall. The contents are those of the author(s) and do not necessarily represent the official views of, nor an endorsement by, these sponsors.