Validity is a fundamental consideration in developing and evaluating assessment programs, processes, and tools.1 In response to ongoing validity challenges with workplace-based assessment (WBA), Gofton et al developed the Ottawa Surgical Competency Operating Room Evaluation (O-SCORE).2 In its development, the authors argued that the ultimate goal of postgraduate medical education is to produce trainees who are competent to practice independently, and that it may be helpful to structure assessments based on this concept. Using colloquial faculty language, they created a supervision-aligned set of anchors that range from 1 (“I had to do…”) to 5 (“I did not need to be there…”). The O-SCORE and other WBA tools that used similar types of scales demonstrated evidence of validity and seemed to be performing better than previous rating scales.3 Given the alignment the O-SCORE scale language had with the way faculty understood the goals of training, and the conceptual link that was being made between supervision and entrustment, Rekman et al chose to describe these scales as “entrustability scales.”3 The alignment with competency-based medical education, intuitive appeal, ease of use, and early validity evidence led the O-SCORE anchors and entrustability scales in general to become widely adopted. However, they have also become a threat to assessment validity based on potential interpretation issues, in part from naming (and framing) supervision-type scales as meaning “entrustability.” Indeed, the use of entrustability scales in practice has been met with confusion, concern, and criticism.4 Since then, ten Cate et al and others have suggested a number of refinements and corrections, including the need to disentangle “retrospective” and “prospective” assessments.5–7
In this Perspectives article we pick up on this distinction between retrospective and prospective assessment using assessment validity as our framework and the O-SCORE as our example. Our first aim is to reframe the O-SCORE's rating scale anchors as point-in-time retrospective assessments of the faculty's experience with a trainee, and not as prospective indicators of readiness, or as immediate claims about the level of entrustability. We also aim to elaborate on how and why retrospective and prospective distinctions serve as a meaningful correction to potential misunderstandings about the O-SCORE anchors in graduate medical education. We selected the O-SCORE anchors because they may have perpetuated some interpretation issues yet continue to be widely adopted in graduate medical education.8–11 Using Cizek's conceptualization of validity, which separates score meaning from score use,12,13 we begin by describing the O-SCORE and how its meaning can be misinterpreted, to ultimately speak to its use.
Score Meaning
We propose the O-SCORE scale language itself could pose a validity threat if there is confusion about the meaning of its anchors. In its original development, the use of colloquial faculty language suggested a construct that exists with the faculty (eg, “I had to be there”) rather than the trainee, even if informed by the trainee's behaviors. In assessment contexts, this can create tension if faculty confuse rater and trainee level constructs. That is, faculty may struggle with positioning the construct in the moment, trying to resolve whether the language is about “me” (the faculty) or “them” (the trainee). The distinction is a subtle but important point in extrapolating what the scores mean and how they should be used.
What then do scores generated using the O-SCORE mean? The emphasis on “I” (as in “I had to do”) may reflect several faculty-owned influences and interpretations of the trainee's performance in complex contexts, and what the interaction means for them (ie, the faculty). Using “I” speaks to more than asking faculty to report behaviors exhibited by trainees or even matching behaviors to predefined performance expectations. The past tense language should be taken to represent a retrospective or reflective opportunity on the part of the faculty based on their experience with the learner for the encounter being assessed. On its own, these anchors make no claims about the entrustability of a learner, or about how that learner will perform in the future. It is simply a record of the faculty's rating of how much supervision or assistance they provided in that encounter.14
Suggesting the O-SCORE scale in WBA permits an inferential claim or that it means a characteristic of the trainee—in this case entrustability—may be less accurate than saying the observations reflect what the trainee's performance and several other contextual factors meant to the faculty member. For some, it is this subjective information that then becomes valuable in the collective.15 Completing these supervision-type scales speak only to how faculty have “made sense” of the observed performance or their experience with the trainee. Previous research exploring rater cognition has revealed this active, subjective, and personal translation process even for the same performance.15–17 When viewed in this way, issues like error or reliability are diminished for richness of the interaction and making use of those faculty experiences and differences—a philosophical issue with practical considerations.18 Reformulating supervision anchor language from being only about the trainee, to being more about how faculty experience the assessment activity (ie, interaction between clinical stimuli, learner, and faculty), promotes better construct alignment for faculty and for those interpreting and using the data.
Score Use
The second way to rethink the O-SCORE rating scale language is to consider its use. Here, now that we have clarity on what the scores mean, we would ask, Can or should the O-SCORE and other supervision-type scales be used for decisions related to presumptive trust? For example, How will a trainee perform in the future? Here the descriptions of retrospective and prospective assessment scales from ten Cate et al is helpful.6 We argued above that retrospective supervision-type scales in WBA reflect faculty experiences in the moment that shape what they did, informed by their own experiences and comfort, and what they understood about the trainee in a particular context. Our task then is to align that meaning with use. As opposed to claims about presumptive trust, retrospective scales can be used for formative purposes, where all the behavioral, historical, social, personal, and contextual features that led to the faculty's action or reflection can be excavated through “learning conversations”19 (eg, debriefing, feedback), since these are present in the assessment activity. This would be an example of alignment between score meaning and use. When this is present, interpretation issues are corrected, and validity can be optimized.
Prospective assessments represent a different intended use, one that involves decisions related to the trainee's ability to assume future responsibilities and care activities.6 Entrustment decisions are prospective. These types of assessments can be made using a blend of retrospective assessment and other data, and are typically categorical, rather than ordinal. Competence committees serve as examples of where prospective assessments can take place. Here the intended use can include, for example, progression to higher levels of responsibility, access to unsupervised activities, or graduation, something point-in-time retrospective assessments alone are unable to do. Prospective assessments are more complex, structured and enacted differently than retrospective assessments, and often (or should) include more than just individual documented observations.6 Therefore, the use of retrospective supervision scales (like the O-SCORE) have been described as having value for prospective purposes by providing data to support decisions about entrustment,6 but are themselves limited for that use. On their own, individual retrospective supervision scales do not provide information about the complex construct of entrustment and therefore should not be misinterpreted by faculty or trainees as trying to do so.
In summary, the O-SCORE scale and other retrospective supervision-type scales may be misinterpreted, leading to confusion and tensions in practice. While tools using the O-SCORE anchors have demonstrated strong psychometric properties, we are raising issues of interpretation, specifically related to score meaning and use, and the alignment between these as important validity considerations. Rather than describing these scales as entrustability scales, which suggest a prospective type of assessment, we support the recommendation that these types of scales are best thought of and used as retrospective scales.6,7 We would add that the reflective and faculty-aligned language and construct support this suggestion. Inferential claims or meaning must therefore not be about the entrustability of learners, but rather faculty reflections on their actions based on their experience with trainees for a given context and time. Whether the same assessment can and should be used for formative and summative purposes, and what the implications are of doing so, should also be examined carefully, but they do represent 2 distinct intended uses. This calls for careful action on the part of educators to avoid unintended meaning and use.
To assist further in providing clarity, while ten Cate et al suggested that these types of scales be described as “retrospective entrustment supervision scales,”6 our concern is that this may not go far enough. Using the word entrustment in the name of the scale may perpetuate ongoing confusion for front-line faculty who are providing these assessments about what construct is in play and when, in the same way that calling the O-SCORE scale an entrustability scale did. Instead, in an effort to mitigate validity threats by clarifying intended meaning and use, we suggest the assessment community consider describing scales that use faculty reflections of their supervision choices or reflections as just that—retrospective supervision scales.