ABSTRACT
Professionalism, which encompasses behavioral, ethical, and related domains, is a core competency of medical practice. While observer-based instruments to assess medical professionalism are available, information on their psychometric properties and utility is limited.
We systematically reviewed the psychometric properties and utility of existing observer-based instruments for assessing professionalism in medical trainees.
After selecting eligible studies, we employed the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) criteria to score study methodological quality. We identified eligible instruments and performed quality assessment of psychometric properties for each selected instrument. We scored the utility of each instrument based on the ability to distinguish performance levels over time, availability of objective scoring criteria, validity evidence in medical students and residents, and instrument length.
Ten instruments from 16 studies met criteria for consideration, with studies having acceptable methodological quality. Psychometric properties were variably assessed. Among 10 instruments, the Education Outcomes Service (EOS) group questionnaire and Professionalism Mini-Evaluation Exercise (P-MEX) possessed the best psychometric properties, with the P-MEX scoring higher on utility than the EOS group questionnaire.
We identified 2 instruments with best psychometric properties, with 1 also showing acceptable utility for assessing professionalism in trainees. The P-MEX may be an option for program directors to adopt as an observer-based instrument for formative assessment of medical professionalism. Further studies of the 2 instruments to aggregate additional validity evidence is recommended, particularly in the domain of content validity before they are used in specific cultural settings and in summative assessments.
Introduction
Medical professionalism is defined as “the habitual and judicious use of communication, knowledge, technical skills, clinical reasoning, emotions, values, and reflection in daily practice for the benefit of the individual and community being served.”1 Professionalism is critical to trust between physicians and patients as well as the medical community and the public.2 Assessing professionalism is essential to medical education because professionalism in practice is central to a physician's social contract with society.3,4 Despite growing recognition of its importance, the lack of a consensus definition of professionalism limits its effective operationalization.5 While approaches such as critical incident reporting have been used to recognize when professional breaches occur, the need for trainee assessment and program evaluation necessitates quantitative and objective positive measures of professionalism to track the demonstration of competence and assess curricular effectiveness.6 Valid and reliable instruments that can discriminate levels of professionalism and identify lapses to facilitate remediation and further training are needed.
Many instruments have been developed to assess medical professionalism as a comprehensive stand-alone construct or as a facet of clinical competence.7 There is a tendency for programs to use multiple instruments, and selecting the most suitable instrument for a given program can be challenging for educators.5,8 Workplace- and observer-based assessments allow for the systematic assessment of professionalism by different assessors in various clinical contexts,8 which may complement other assessment modes such as self- and peer assessments. Observer-based instruments are in keeping with the current trend of adopting entrustable professional activities.9
Previous systematic reviews of professionalism measures have focused on different assessment methods, including direct observation, self-administered rating forms, patient surveys, and paper-based ratings.10–12 The most recent review concluded that studies were of limited methodological quality and recommended only 3 of 74 existing instruments as psychometrically sound; of note, 2 of these were from studies involving nurses.10 There were no current systematic reviews that focus on observer-based instruments to assess medical professionalism13 and on the utility of the instruments. The primary aim of this study was to identify observer-based instruments for use by program directors and to examine their psychometric properties and utility for practical application.
Methods
We performed a systematic review in accordance with the Preferred Reporting Items for Systematic review and Meta-Analysis (PRISMA) checklist (provided as online supplemental material).
Search Strategies
We searched the PubMed, Scopus, ERIC, and PsycINFO databases from their inception to July 2018. The search strategy was adapted and revised from a previous systematic review14 in consultation with a medical librarian, and the full search strategy is provided as online supplemental material. Our focus was on observer-based instruments that measured professionalism.
Study Selection
Inclusion criteria were English-language, full-text original studies on the validation of observer-based instruments, or questionnaires assessing or measuring medical professionalism of residents and medical students. Instruments had to be applied to the evaluation of professionalism in an actual clinical setting or context (see figure). We excluded articles not in English, studies of professionalism in other health disciplines, and review articles. Duplicate studies were removed using EndNote X8 (Clarivate Analytics, Philadelphia, PA), and cross-checked by the researchers. Studies that met inclusion criteria were independently screened by 2 researchers (J.K.P. and H.G.) based on titles and abstracts. Full-text studies selected were independently read and assessed for eligibility, and the reference lists were hand-searched for additional eligible studies. Disagreements in the selection process were resolved by discussion with a third researcher (Y.H.K.).
The study did not involve human subjects and did not require Institutional Review Board approval.
Data Extraction
For studies deemed eligible, data were extracted independently by 2 researchers (H.G. and Y.S.) using a standardized data extraction form. The following data were extracted: general characteristics of each instrument (name of instrument, author, language, number of domains, number of items, and response categories) and characteristics of study samples (sample size, age, settings, and country).
Study Methodological Quality and Instrument Psychometric Property
We performed 3 levels of quality assessment. First, 2 researchers (K.P. and H.G.) independently assessed each study using the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) checklist (figure). Disagreements were resolved by a third reviewer (Y.H.K.). We selected the COSMIN checklist because it is a consensus-based tool for study appraisal involving instruments.15,16 The checklist addresses 9 criteria: content validity, structural validity, internal consistency, cross-cultural validity measurement invariance, reliability, measurement error, criterion validity, hypotheses testing for construct validity, and responsiveness. The checklist is presented in boxes, with each box comprising items to assess the study methodological quality for each criterion. Items are rated on a 4-point scale, which includes the ratings inadequate, doubtful, adequate, or very good.17 As there is no accepted “gold standard” for assessing professionalism, we did not assess criterion validity of the studies. Second, we assessed the psychometric quality of each instrument using an adapted version of the Prinsen et al criteria18 to synthesize evidence that supported the measurement properties of instruments (see figure). Third, we assessed the utility of each instrument for real-world practicality using prespecified criteria, including the ability to distinguish performance over time, objective scoring criteria, validity for use in medical students and residents, and number of items.
The quality of evidence was graded for psychometric properties, taking into account the number of studies, the methodological quality of the studies, the consistency of the results of the measurement properties, and the total sample size.18 The ratings for the level of evidence for the psychometric properties were as follows:
Unknown: No study
Very low: Only studies of inadequate quality or a total sample size < 30 subjects
Low: Conflicting findings in multiple studies of at least doubtful quality or 1 study of doubtful quality and a total sample size ≥ 30 subjects
Moderate: Conflicting findings in multiple studies of at least adequate quality or consistent findings in multiple studies of at least doubtful quality or 1 study of adequate quality and a total sample size ≥ 50 subjects
High: Consistent findings in multiple studies of at least adequate quality or 1 study of very good quality and a total sample size ≥ 100 subjects18
Instrument Utility and Scoring
We developed a utility scale using criteria from other studies.19–21 The 4 criteria chosen were (1) the ability to distinguish performance levels over time; (2) the availability of objective scoring criteria; (3) the utility for medical students and residents; and (4) the number of items, with a maximum of 8 points (see table 1) and a higher utility score indicating greater feasibility of implementation.
Results
Search Results
The electronic search yielded 20 676 article titles after removal of duplicates. Articles were reviewed by title and abstract, and 17 971 articles that did not meet inclusion criteria were removed. A second review of 92 full-text articles resulted in the selection of 15 articles after the removal of articles that did not examine professionalism but other constructs such as empathy. One article was added after hand-searching published systematic reviews. Sixteen articles assessing 10 observer-based instruments were included in this review and quality assessment (see the figure).
The 16 studies examined 10 instruments: the Amsterdam Attitude and Communication Scale (AACS),22 the Emergency Medicine Humanism Scale (EM-HS),23 the Education Outcome Service (EOS) group questionnaire,24–27 the Evaluation of Professional Behavior in General Practice (EPRO-GP),28 the German Professionalism Scale (Pro-D),29 the modified Physician Achievement Review (PAR),30 the multisource feedback (MSF) questionnaire,31 the Nijmegen Professionalism Scale,32 the Professionalism Assessment Instrument (PAI),33 and the Professionalism Mini-Evaluation Exercise (P-MEX).34–37 Four instruments assessed residents and medical students (the EPRO-GP, Pro-D, Nijmegen Professionalism Scale, and P-MEX). Each instrument was assessed in 1 study except for P-MEX and the EOS group questionnaire, which were assessed in 4 individual studies.
All 10 instruments measured professionalism as a single construct with multiple domains (see table 2 and online supplemental material). The instruments varied in item number from 9 to 127. Study sample size ranged from 9 to 442 participants. All instruments used a Likert scale (ranging from 3 to 9 points) to measure professionalism. Four instruments (P-MEX, EPRO-GP, Nijmegen Professionalism Scale, and Pro-D) were tested in medical students and residents.13,35–40 The AACS and EM-HS had the lowest number of items at 9, while the EPRO-GP had the most at 127.
COSMIN Methodological Quality Assessment
Methodological quality was generally adequate for 9 studies (provided as online supplemental material). The structural validity psychometric property was the most commonly assessed, being the focus of 9 studies (56%). Eight studies assessed internal consistency, with 5 (63%) scoring adequate or very good. The 8 studies that assessed content validity had scores of doubtful. Inadequate methodological quality was observed for the single study that assessed reliability. Only 1 study assessed measurement error, and there were questions about its methodological quality.
Although translations were performed in 5 studies,27,36,37,39,42 no studies assessed cross-cultural validity. Lack of effective interventions was the main reason for the inadequate evaluation of responsiveness, as validating responsiveness required the assessment tool to be able to detect change over time after an intervention.
Psychometric Properties
The quality of psychometric properties varied for the 10 instruments that assessed it (table 3). Internal consistency scored better than other criteria, with low or better levels (low, moderate, high) observed for 4 of 6 instruments (the EOS group questionnaire, Pro-D, Nijmegen Professionalism Scale, PAI). For structural validity, the EOS group questionnaire and the P-MEX scored high. Content validity had low levels of evidence overall, with the P-MEX scoring the highest with moderate quality.
Utility Scores
Utility scoring for the 10 instruments ranged from 2 to 4 points (table 4), with only the Pro-D showing good correlation coefficients between level of training and sum score. The ability of the instrument to distinguish performance level over time was not examined for the other instruments. Only the PAI provided behavioral descriptors/anchors for extreme and selected intermediate anchors. Based on the 4 utility criteria, the Pro-D and PAI had the highest score at 4 points.
Discussion
We identified 16 studies assessing 10 instruments for assessing medical professionalism, with instruments showing varying quality. The P-MEX performed best relative to evidence for measurement properties and adequate utility scoring among the available instruments. Considering the psychometric properties and utility, the P-MEX may be the most suitable instrument for assessing medical professionalism in medical trainees due to evidence to support its measurement properties and higher utility.
For many instruments, methodological quality assessed via the COSMIN checklist and the level of evidence synthesized was very low to low. Our findings are similar to those reported in a systematic review of instruments for measuring communication skills in students and residents using an objective structured clinical examination,41 where the authors identified 8 psychometrically tested scales from 12 studies, often of poor methodological and psychometric quality. Compared with 32 instruments to measure technical surgical skills among residents42 and 55 instruments for assessing clinical competencies in medical students and residents,43 the number of professionalism assessment instruments meeting quality criteria was lower. This may reflect challenges educators face in defining and assessing this competency.
Our study has limitations. First, the number of studies available for evidence synthesis was limited, and we may have missed studies published in languages other than English. The utility assessment tool was developed by the authors, based on previous reports, but was not evaluated further for evidence.19–21
Our review showed inadequate investigation of content validity of assessment tools for medical professionalism, and future studies are needed to identify the relevant domains of medical professionalism. It is important for future studies to assess the validity of instruments across different cultural contexts, as definitions of professionalism may differ among national and cultural contexts.
Conclusion
Our review found that the P-MEX has the best evidence for measurement properties and adequate utility scoring among available instruments for assessing medical professionalism. This too may be an option for program directors to adopt as an observer-based instrument for the formative assessment of professionalism in trainees. Further aggregation of validity evidence for instruments is recommended, particularly in the domain of content validity before implementation in a specific cultural setting or for summative assessments.
References
Author notes
Editor's Note: The online version of this article contains tables of PRISMA 2009 checklist, study search strategy, domains measured by each instrument, and methodological quality assessment.
Yu Heng Kwan, BSc (Pharm) (Hons)*, is an MD-PhD candidate, Program in Health Services and Systems Research, Duke–NUS Medical School, Singapore; Kelly Png, BSc (Pharm) (Hons)*, is a Pharmacist, National Heart Centre Singapore; Jie Kie Phang, BSc (Life Science) (Hons)*, is Research Coordinator, Department of Rheumatology and Immunology, Singapore General Hospital, Singapore; Ying Ying Leung, MBChB, MD, is Senior Consultant, Department of Rheumatology and Immunology, Singapore General Hospital, and Associate Professor, Duke–NUS Medical School, Singapore; Hendra Goh is a BSc (Life Science) Candidate, Faculty of Science, National University of Singapore, Singapore; Yi Seah is a BDS Candidate, Faculty of Dentistry, National University of Singapore, Singapore; Julian Thumboo, MBBS, MRCP, FRCP, is Senior Consultant, Department of Rheumatology and Immunology, Singapore General Hospital, Professor, Program in Health Services and Systems Research, Duke–NUS Medical School Singapore, and Adjunct Professor, Yong Loo Lin School of Medicine, National University of Singapore, Singapore; A/P Swee Cheng Ng, MBBS, MRCP, is Senior Consultant, Department of Rheumatology and Immunology, Singapore General Hospital, Singapore, and Adjunct Associate Professor, Duke-NUS Medical School, Singapore; Warren Fong, MBBS, MRCP, is a Consultant and Program Director of Rheumatology Senior Residency, Department of Rheumatology and Immunology, Singapore General Hospital, Singapore; and Desiree Lie, MD, MSEd, is Professor, Duke–NUS Medical School, Singapore.
Competing Interests
*These authors are considered co–first authors.
Funding: The authors report no external funding source for this study.
Conflict of interest: The authors declare they have no competing interests.
The authors would like to thank Dr Arpana Vidyarti, Head and Senior Consultant, Division of Advanced Internal Medicine Associate Professor, Duke–NUS Medical School, for reviewing the manuscript, and Ms Min Li Toon, MBBS candidate, National University of Singapore, for reviewing the manuscript and scoring the instruments.