The objective of this paper is to describe changes made to chiropractic national board examinations in the United States, including methodologies in test scoring, and to discuss future directions in test development and administration being considered by the National Board of Chiropractic Examiners (NBCE). Additionally, this paper serves as an introduction to the articles written by the NBCE staff and published in this issue of the journal. Statistical perspective on the properties of a test are presented, and reasons for the NBCE moving to item response theory for test scoring are described. NBCE consideration of on-demand testing and changes implemented in the Part IV practical examination are also discussed.
The objective of this paper is to communicate changes initiated by the National Board of Chiropractic Examiners (NBCE) testing programs and to serve as an introduction to the articles written by NBCE staff that are published in this issue of the journal. In the past 5 years, the NBCE gradually modified statistical models used for scoring, changed the mode of administration for the written examinations, and revised the Part IV exam.
The NBCE engages in constant review of its practices and compares its operational procedures of test development, administration, scoring, and reporting to the practices of sister fields in health care. In the last 5 years, the NBCE evaluated all of its products in terms of validity, reliability, fairness, and alignment with the best practices accepted in the health care industry and the field of educational measurement. The first change was that the NBCE implemented computer-based testing (CBT) for the Part I, II, III, and Physiotherapy exams. The CBT administration was previously adopted by the National Board of Medical Examiners,1 the National Board of Dental Examiners,2 and the National Board of Osteopathic Medical Examiners.3
The second change was to introduce item response theory (IRT) scoring to all NBCE exams. Previously, not all NBCE exams were scored using IRT, which is the practice in the fields of medical,4–7 dental,8,9 and osteopathic10,11 examinations. As a result, a decision was made to gradually adopt new IRT-based scoring methodologies while closely monitoring classification consistency and pass/fail rates.
The third change was to revise the Part IV exam (chiropractic practical exam). The exam needed revisions of the postencounter probe stations in order to better assess the examinees' case management skills. In particular, diagnostic images and laboratory results were excluded from all postencounter probe stations. This was implemented for better evaluation of entry-level competence. Furthermore, electronic presentation of the images was introduced to the diagnostic imaging portion of the exam.
What Is a Test?
As the articles published in this issue address NBCE exams and scoring methodologies, we would like to start by explaining the essential parts of a test. A test (or exam) is an assessment intended to measure examinees' competency in a subject, aptitude, or skill.12 However, the relationship between a test score and actual knowledge is often misinterpreted or even misunderstood. Because this can be confusing, we provide a definition of a test from a statistical perspective. The relationship between a test score and actual competency is similar to the relationship between an estimate and a true value (parameter). Test scores are sample-based estimates of the real competency. The parameter, if we could measure it directly, is a population-based true value.13 For example, it may be very hard to measure exactly how many fish are in a lake, but we can estimate it. One may count fish in a part of the lake and then rescale this estimate to the size of the lake. The precision of such an estimation would depend on the quality of the procedure. Testing is merely a technique that samples the true competency of examinees, while the relationship between the score and reality depends on the quality of a test.
Statistical estimates calculated on a sample are valid only when the sample is representative of the corresponding population. Going back to the fish example, the area of the lake where we collect measurements should be very similar to the rest of the lake. We should not collect our measures next to a bridge where people throw food into the water, thereby attracting the fish, nor should we count the fish in an empty part of the lake.
The same is true for test scores. To make valid inferences, the scores need to be representative of the competency of the test takers. However, some degree of error between the sample-based estimate and the true value is always expected. This error is termed sampling error, and our task is to minimize it. The precision of an estimate is inversely related to the degree of sampling error.14 The topics of score precision, validity, and reliability, as well as the efforts NBCE makes to ensure them, are discussed in Himelfarb15 and Himelfarb et al.16
Classical Test Theory vs Item Response Theory
Classical test theory (CTT) and IRT are 2 major measurement theories employed to model item responses. For many years, the NBCE's scoring procedures were based on CTT, which is known as the true score theory.17 In recent years, innovative scoring methodologies have been developed and adopted by large-scale assessment programs. IRT-based methods are considered contemporary, more informative, and more effective. While the CTT and IRT frameworks are similar in many ways, CTT has received substantial criticism over the past several decades.18
In accordance with CTT, every test taker's true score (a score that would be obtained if there were no errors in measurement) is influenced by an additive, unsystematic error term (the difference between the true and observed scores). The implication is that, for individual test takers, the test is an imprecise tool.18 Furthermore, since the true score is defined using a specific set of items on a test, it is entirely dependent on that set of items. In IRT, on the other hand, individual ability for each test taker is estimated and accounted for when the final score is derived. Another difference between the theories exists in the estimation of the standard error of measurement (the imprecision in the test). With CTT, this error is a constant for all examinees, while in IRT, it could be estimated for different levels of ability. The IRT-based estimation of standard error of measurement is more realistic.19(p482)
In 2014, the NBCE began transitioning its exams to IRT scoring. Today, all NBCE prelicensure exams are scored using IRT models. Thus, the numbers provided to test takers, chiropractic institutions, and state licensing boards are calculated based on more realistic assumptions and are more precise. Himelfarb15 and Himelfarb et al.16 provide an in-depth discussion of the differences between the 2 theories. For example, Himelfarb et al.20 explain how Part IV is scored with IRT models using the diagnostic imaging portion of the exam.
Future Directions: Is On-Demand Testing Right for Chiropractic?
Beginning in 2019, the NBCE fully implemented CBT for the Part I, II, III, and Physiotherapy examinations, which allows for more test innovation, more convenient scheduling, and a smaller scoring window. Adopting Parts I and II for CBT administration, we reduced the exams to 300 items each (50 items per domain). Preliminary validity studies have been conducted,21 and now a validity argument is being built while developing and using the assessments.
Recently, however, there have been several inquiries concerning on-demand testing (a testing service that is available anytime) and its feasibility for the chiropractic profession. Although on-demand testing is being increasingly used in many areas of assessment, it has not been easily adopted by high stakes testing22 such as the NBCE licensure exam programs. One of the major issues with on-demand testing is that some of the psychometric methods used in conventional testing are no longer available when tests are administrated on demand. While new methodologies have been developed, today they require per-administration sample sizes prohibitively larger than that which the chiropractic profession is currently able to produce.23 Currently, per-administration sample sizes for written exams are between 1,000 and 1,200 test takers, which includes first-time examinees and repeat test takers. However, the NBCE uses only item responses from a norming group (first-time, nonaccommodated test takers) to fit the IRT models, which further reduces the available sample sizes. Excluding repeaters helps to control for the possible effect of repeaters on equating.24,25
If NBCE exams were given on demand to our current number of examinees, we would not have sufficient data to perform psychometric analyses properly, as specified by the best practices detailed in the Standards for Educational and Psychological Testing.26
Yet, in the context of competency assessment, a computer adaptive testing (CAT) approach may be a conceivable alternative to on-demand tests.27 CAT is a form of assessment that adapts to the ability level of each examinee. Based on the examinee's previous responses, for subsequent questions CAT selects from test items that maximize the precision of the exam. Consequently, test takers with different ability levels will receive different tests. IRT methodology is used to select optimal items for the test, which are chosen based on the statistical estimates of the information and difficulty. The advantage of using CAT is in uniform precision for all test takers, whereas traditional testing provides the best precision for examinees in the middle of ability range. Matching the difficulty of items on the test with the ability of the test taker allows for obtaining maximum information from each item, so the length of the test could be reduced without loss of reliability. Furthermore, by transitioning to CAT, the NBCE will be able to increase the number of testing windows.
The basic principle behind adaptive testing is simple: avoid asking questions that are much too difficult or much too easy for a particular examinee. It is likely that able examinees will answer easy items correctly and struggling test takers will stumble on hard questions; thus, their responses are not particularly informative. Much more is learned by administering questions that challenge, but don't overwhelm, the examinee. Properly identifying and then presenting these questions is the goal of CAT.
CAT is designed to maximize measurement efficiency, or the precision of test scores in relation to test length. This means that an adaptive test can either save time by being shorter than a conventional test of equal precision or improve score quality by being more precise than a conventional test of equal length.27 The employment of CAT is a laborious process; however, with the CAT implementation, the majority of examinees will take shorter tests with items more closely selected at their ability levels. For this reason, the NBCE is committed to exploring CAT transition.
Content validity studies using the Delphi method28 for the Part I and II exams are planned for 2020. During the course of study, the NBCE will provide current test plans and weights to each doctor of chiropractic program, and each will have an opportunity to indicate areas in the test plans that require additions or deletions. The NBCE will summarize the results received from the colleges. The test plans will be modified according to study results.
Future Directions for the Part IV Practical Examination
In January 1996, the NBCE introduced the Part IV exam, the large-scale assessment for chiropractors, which was made up of 3 domains: diagnostic imaging, chiropractic technique, and case management.29 The main component of Part IV is the objective structured clinical examination (OSCE), an assessment designed to test clinical skill performance and competency by simulating real-world procedures.30 The OSCE is a standard mode of assessment of medical competency and clinical skills in the United States, Canada, and the United Kingdom.31–33
Several changes were made to the Part IV exam. First, diagnostic images and laboratory findings were removed from postencounter probe stations. This was implemented for better evaluation of entry-level competence. In accordance with best practices in imaging and patient care, initial case management of some conditions would not require imaging. Second, electronic presentation of the images (on a computer) was introduced to the diagnostic imaging part of the exam. Finally, all domains of the Part IV exam are now scored using IRT methodology. In 2020, an investigation will begin to determine if there are options to further modernize the Part IV exam and further align the chiropractic OSCE with the standard practices currently accepted in health care.
We see testing as a dynamic process that occasionally requires an update. To stay relevant, the NBCE needs to keep up with the modern practices in measurement. Transitioning to IRT was a necessary step in that direction. The implementation of CBT challenged us to construct a fair, valid, and reliable assessment system, to minimize examinees' frustration, and to limit sources of test anxiety. CBT also prompted us to shorten the test and increase the number of testing windows. We hope that the rest of the chiropractic community will share our perception of successful testing on computers. For our part, we will augment the effectiveness of this new mode of assessment through better orientation, easier registration, and possibly, even more testing windows.
Ignoring the evolution in assessment of skills and changes in testing technology may result in a mismatch between the professional skills and testing instruments. The current efforts to modernize the Part IV exam will align the chiropractic OSCE with the standard practices currently accepted in health care. Certainly, a critical part of this transition is to ensure that there will be no disadvantage to our examinees.
FUNDING AND CONFLICTS OF INTEREST
This work was funded internally. The authors have no conflicts of interest to declare relevant to this work.
Norman Ouzts is the chief executive officer of the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; firstname.lastname@example.org). Igor Himelfarb is the director of psychometrics and research of the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; email@example.com). Bruce Shotts is the director of written exams of the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; firstname.lastname@example.org). Andrew R. Gow is the director of practical examinations of the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; email@example.com). Address correspondence to Igor Himelfarb, 901 54th Avenue, Greeley, CO 80634; firstname.lastname@example.org. This article was received May 22, 2019, revised May 31, 2019, and accepted November 3, 2019.