This article introduces changes made to the diagnostic imaging (DIM) domain of the Part IV of the National Board of Chiropractic Examiners examination and evaluates the effects of these changes in terms of item functioning and examinee performance.
To evaluate item function, classical test theory and item response theory (IRT) methods were employed. Classical statistics were used for the assessment of item difficulty and the relation to the total test score. Item difficulties along with item discrimination were calculated using IRT. We also studied the decision accuracy of the redesigned DIM domain.
The diagnostic item analysis revealed similarity in item function across test forms and across administrations. The IRT models found a reasonable fit to the data. The averages of the IRT parameters were similar across test forms and across administrations. The classification of test takers into ability (theta) categories was consistent across groups (both norming and all examinees), across all test forms, and across administrations.
This research signifies a first step in the evaluation of the transition to digital DIM high-stakes assessments. We hope that this study will spur further research into evaluations of the ability to interpret radiographic images. In addition, we hope that the results prove to be useful for chiropractic faculty, chiropractic students, and the users of Part IV scores.
INTRODUCTION
All jurisdictions in the United States require proficiency in radiography within a chiropractor's scope of practice. Most also permit licensed chiropractors to order and evaluate the reported results of advanced imaging procedures, such as spinal computed tomography scans and magnetic resonance images. The diagnostic imaging (DIM) component of the National Board of Chiropractic Examiners (NBCE) Part IV exam was originally developed to assure chiropractic licensing boards that applicants for licensure possessed the requisite skills and ability to perform these functions in a safe and effective manner, thereby protecting the public's health. Technological advances in DIM (primarily the transition from film-based to digital images) have mandated significant changes in the methods of testing current examinees.
In data collected in 2014 for the NBCE's 2015 practice analysis, 28.1% of chiropractors who took x-ray images in their offices used digital equipment to obtain images of their patients.1 Since the radiography industry was rapidly undergoing technological change, chiropractors without radiographic equipment frequently referred their patients to imaging facilities with digital equipment and then reviewed the resultant digital images. Considering these developments, the NBCE directed its staff to redevelop the DIM component of the Part IV exam to use digital images.
In 2016 and 2017, the NBCE pilot tested a modified digital version of the DIM exam at 5 chiropractic colleges with promising results.2,3 The 2018 Part IV Test Committee then selected and approved digital images for the modified DIM component, which was administered at the subsequent Part IV examination in November 2018. The objective of this article is to introduce the changes in the exam and to evaluate the possible effects of these changes.
BACKGROUND
Digital Radiographs in Testing
In response to the technological advancements in health care, boards and organizations responsible for pre-licensure and certification testing started making headway in adopting digital technology for testing purposes. To construct items with digital images, testing organizations had to change the platform of exam delivery—tests needed to be delivered on computers. The American Board of Radiology began to develop computer-based exams in 1997 when a flexible examination platform, adapted to the graphical needs of an image-based item, was developed.4 Since then, the medical specialty boards of pathology, pediatrics, family practice, internal medicine, neurology, and obstetrics and gynecology have all moved their exams to computer-based testing.5 Today, the fact that the future belongs to digital radiology is well recognized by all areas of health care. The transition to digital imaging in DIM is supported by the industry,6,7 and best practices in digital radiography have been developed and followed.8
The NBCE Practice Analysis survey1 inquired whether students in chiropractic training programs have access to digital x-ray imaging. The responses were 73.3% “yes” and 26.7% “no” in 2008 and 100% “yes” in 2014. In 2014, the NBCE surveyed the radiology faculty in chiropractic institutions to determine the extent of usage of digital imaging in chiropractic colleges. Sixty-nine percent of the respondents indicated that digital images were used in patient clinics in 100% of the cases, 23% indicated that digital images were used in 75% of the cases, and 8% indicated that digital images were used in 0% of the cases.
With evidence of increased implementation of digital radiography in both chiropractic practice and chiropractic education, the NBCE made the decision to move away from the use of conventional radiographic images in the Part IV exam. In early 2015, the NBCE began a feasibility study for the transition to digital imaging, looking at various modes of delivery, methods of building an image library, and identifying chiropractic campuses to begin pilot studies. Five chiropractic educational institutions were identified for the pilot examination, and in the period between July 2016 and June 2017, a 10- and 20-station/image exam form was administered. Following the pilot exams, a 2-way univariate analysis of variance was conducted to investigate the effects of test form and test site (institutes) on the test scores. The analysis performed showed that when test forms and test sites are considered, the variability in scores could be explained only by differences in performance among various test sites while controlling for the effect of test form and the interaction effect.2 Next, the challenge was to decide which psychometric models to use for scoring the newly refined exam.
Models for Polytomously Scored Items
Each of the DIM digital presentations asks for 2 interpretation responses from the examinee. Since these 2 questions are linked to the same stimulus (the station's digital images), they cannot be treated as independent items. A wide range of item response theory (IRT) models have been proposed to handle polytomous items with possible local dependency when testlets are formed, such as the Rating Score Model,9 Partial Credit Model (PCM10), Generalized Partial Credit Model (GPCM11), and Graded Response Model (GRM12).
Samejima12 developed a logistic model for graded responses in which the probability that a student i with a particular ability level θi will provide a response to an item j of the category k is the difference between the cumulative probability of a response to that category or higher and the cumulative probability of a response to the next highest category or higher. Let's consider the following:
where bjk is the difficulty parameter for category kj, aj is the discrimination parameter for item j, and D is the scaling constant 1.702.12,13 Samejima's GRM is classified as a difference model because the category probability is calculated by the difference of 2 cumulative probabilities.10
Another model for ordered-categorical responses was developed by Masters.10 In this PCM, the probability that a student i will provide a response x on item j with Mj thresholds is a function of the student's ability and the difficulties from the Mj thresholds in item j is given by the following:
where x = 1,2,…, Mj is the count of successfully completed thresholds and
In the PCM, the bjk is referred to as a step difficulty parameter with a category x, and the higher value of bjk, the more difficult a category is relative to other categories within an item. The bjk can also be interpreted as the intersection point of 2 adjacent category response curves.14 The PCM is a divide-by-total model, as it is written as an exponential divided by the total of exponentials.15
Muraki11 developed a generalization of the PCM that allows the items within an instrument to have a different discrimination parameter as in the case of the GRM12 ; this is referred to as the GPCM. The GPCM substitutes a discriminant parameter aj into Masters's PCM:
where
Muraki11 described the discriminant parameter aj as “the degree to which categorical responses vary among items as θ level changes.”
Ostini and Nering16 concluded that the theoretical choice between the GPCM and GRM is somewhat arbitrary and that the difference between the 2 models is purely mathematical. They suggest that the best method to select a model should begin with a consideration of the data characteristics.
In the area of cognitive testing, Cook et al17 systematically compared the GPCM and the GRM in the context of testlet scoring for the fall 1994 administration of the Scholastic Assessment Test I. They found that the correlation between theta estimates was very high (r = .987) and that both models exhibited a good model fit across examinees' ability ranges as assessed by the plots of empirical and theoretical probabilities. The only analysis that produced notable differences between the 2 models was that the GRM had greater information function than the GPCM across most of the examinees' ability ranges. However, they showed that the greater information obtained from the GRM was explained partly by the higher value of the discrimination parameter estimates by the model.
Naumenko18 compared the GRM and GPCM for testlet scores with a task-based simulation certified public accountant exam. The results showed the close relationships between the GRM and GPCM ability estimates. The examination of item fit statistics revealed a basically equivalent fit of both models to the exam data. When comparing the information functions, the GRM provided greater information over a wider range of ability estimates than the GPCM. The author noted that the comparison of information function between the 2 different models could be theoretically challenging because the calculation of discrimination parameters differed for the different models.
In the area of noncognitive tests, Baker et al13 compared the performance of the Samejima GRM with Masters's PCM in a questionnaire about subjective well-being with a 5-point Likert scale. They found that Samejima's GRM outperformed Masters's PCM model by being more robust to violations of the unidimensionality assumption and better fitting the data. Further, they noted that Masters's10 PCM may be more applicable in situations where the items meet the assumption of an equal slope parameter across items.13
Maydeu-Olivares19 compared the fit for both the GRM and the GPCM to the Social Problem Solving Inventory with a 5-point Likert scale. The results showed that the GRM consistently outperforms all divide-by-total models, including the GPCM, having the smallest mean of the χ2/df ratio for item pairs and triplets for all scales.
Research Questions
Given the above, the following research questions were addressed by this study:
RQ1: Are the classical test theory (CTT)–based parameter estimates comparable across test forms and across May and November DIM administrations?
RQ2: Are the IRT-based parameters comparable across test forms and across May and November DIM testing administrations?
RQ3: Is there evidence of decision consistency/accuracy across the May and November DIM testing administrations?
METHODS
Study Objectives
The goals of this research were to introduce the changes made in the DIM portion of NBCE's Part IV exam and to study the effects of these changes on the test items and on examinees' performance. We employed methodologies based on CTT and IRT.20,21 We also studied the decision accuracy of the redesigned DIM exam. One of the changes made to the exam was to substitute x-ray films on view boxes with digital images. Therefore, it is natural to expect that this change may account for some variability in item parameters and/or in test scores. Nevertheless, this was not an investigation of the effect of administration mode. Based on our pilot test data, we assumed that if such an effect existed, it would have minimal impact on the item difficulty parameters and even less on test scores.
Participants
This study used data from operational administrations of the NBCE's Part IV exams; therefore, all participants were chiropractic students who are within 6 months of the graduation or graduates of an eligible chiropractic college and who have passed the Part I exam. The study was approved by the institutional review board of the NBCE. The number of test takers in May 2018 was n = 1424; of them, n = 1152 were nonaccommodated, first-time examinees (norming group). There were n = 796 examinees who were administered Form 1 of the exam and n = 628 who were administered Form 2. There were n = 1298 examinees who attempted the redesigned Part IV exam in November 2018; of them n = 1169 were of the norming group. The number of examinees who were administered Form 1 in November was n = 867, while n = 431 were administered Form 2.
The average DIM raw scores in May were mean (M) = 28.13, SD = 4.86, for Form 1 and M = 27.78, SD = 4.94, for Form 2. The average DIM raw scores for November were M = 29.95, SD = 4.17, for Form 1 and M = 28.31, SD = 4.61, for Form 2.
Measures
In May 2018, the DIM portion of the exam consisted of 10 stations with 2 items per station. Each item had 10 response choices, and 2 correct responses were required to obtain full credit. The examination in May was administered on film (on lighted view boxes). In November, the DIM portion consisted of 20 stations with 2 items per station. Each item had 4 response choices, and the correct response was required to obtain full credit. The exam in November used digital images displayed on computer monitors.
The time allowed per station was 4 minutes in May and 2 minutes in November. In May, the first item in each station inquired about the x-ray findings present on the film, while the second item was developed to address the impression/diagnosis, case management, or sequela of the condition. In November, the first item probed the impression/diagnosis, and the second item addressed case management or sequela.
Scoring Rubrics
The test items in May were scored polytomously from 0 to 4. The following algorithm was employed to convert raw to scored responses in May: 0 out of 4 results in a score of 0, 1 out of 4 results in 1, 2 out of 4 results in 2, 3 out of 4 results in 3, and 4 out of 4 results in 4. The test items in November were scored polytomously from 0 to 2. The scoring rubric implemented in November was as follows: 0 out of 2 resulted in a score of 0, 1 out of 2 resulted in 1, and 2 out of 2 resulted in 2.
Classical Item Analysis
After receiving all examinee response data, implementing scoring rules, validating the responses in data files, and applying agreed-on valid case criteria to the data, classical item analysis to evaluate item difficulties and item discriminations was performed. The classical item analysis was conducted on the operational and field test items to collect information about item performance. The analysis is called “classical” because all statistical estimates calculated during this procedure are based on CTT. The basic assumption of CTT is that the overall (observed) score has 2 components—the true score, which represents the actual ability of a test taker,22 and an error component, which by itself is a combination of systematic and random errors.23–26 CTT assumes that the true score is relatively stable, while the random error, on the other hand, is not. Therefore, supposing that all systematic variability is accounted for by true score(s), an average of a measurement that was taken a reasonable number of times will approximate the true score.
Classical item analysis provides information about 2 concepts of interest to test developers and psychometricians—the estimates of item difficulty and the estimates of item discrimination. Item difficulty is a relative concept, as the same item may be difficult for 1 population while easy for another. Psychometricians make difficulty judgments by taking into account the content on the test, the purpose of the test, and the population to which the test is to be administrated. Thus, for a dichotomously scored test item, the estimate of item difficulty (p value) in the framework of CTT is computed by calculating the proportion of examinees who answered that item correctly. Items with difficulty values closer to 1.0 are considered easier items, while items with difficulty estimates closer to 0 are considered harder. Desired p values generally fall within the range of .25–.95.
For polytomous items, the difficulty is measured by taking the mean of the item score. The average item score could range from 0 to the maximum possible score points for the item. To help with the interpretation, the item average is usually expressed as a percentage of the maximum possible score, which is equivalent to the p value in dichotomous items. Numerous factors may cause items to fall outside the desired difficulty ranges. Items may not perform as expected due to the lack of familiarity with the content or the item type. Test administrators may wish to consider these items for future use based on the importance of the item content or the need to measure with more precision the performance of examinees with very high or very low ability levels.
The purpose of most tests is to classify the examinee population into mastery/nonmastery groups according to the construct measured by the test.24 Therefore, it is important that a test item effectively discriminate between these 2 populations. The statistic that relates the performance on an item to the total score obtained on the test is called the item-total correlation. In the CTT framework, this statistic serves as the index of discrimination. For polytomously scored categorical items, the polyserial correlation is computed as an estimate of the relation between a continuous variable and an ordinal, categorical variable.27
Item-total correlation can range from −1.0 to +1.0. Desired values are positive and greater than .20. A negative item-total correlation indicates that low-ability examinees outperformed the high-ability examinees, which may suggest a range of problems, from a mis-key during the scoring process to serious problems with content development.
In this study, the classical item analysis was conducted in R28 using the “psychometric” package.29 Frequency distributions were constructed to identify items with few or no observations at any score point. In addition, the average item scores and the estimates of the correlation with criterion (total score) were calculated. The correlations were derived without excluding the item under consideration from the total score.
Overview of Statistical Modeling
The choice of statistical model used to fit data depends on the type of the test item, the design of the response options, and the rubrics used to convert raw scored responses. For items scored dichotomously, when each answer is scored as correct or incorrect, a plethora of logistic IRT models is available.20,21 However, when the response is scored on the ordinal scale,30 more universal models are available for use.31 Various polytomous IRT models had been developed to fit ordered categorical responses. These models could be classified into 3 categories: adjacent category models, or generalized partial credit models; cumulative probability models, or graded response models; and sequential models, also known as continuation ratio models. Typically, in these models, higher levels of responses are associated with higher student ability levels.
Polytomous IRT models were developed to describe the probability that a response falls into a particular category conditional on an examinee's ability level and item parameters.32 Similar to the framework of generalized linear models, where models for ordinal response data are extended from generalized linear models for binary response,33 most of the IRT models for polytomous data are extended from IRT models for binary responses. The item characteristic curve, which is a function that connects test taker's ability to the probability of a response, is usually constructed only for the correct response in IRT models for binary data. This is because the probability of the incorrect response is simply 1: the probability of the correct response.31 This is not the case for polytomous models— the curve is constructed for each scored response. For example, an item scored 0 through 3 will have 4 curves corresponding to the probability of each response while conditional on the ability level.
IRT Calibration
The term calibration in psychometric context corresponds to the fit of statistical models to the item response data. The purpose of IRT calibration and scaling is to place all operational and field test items onto a common scale. There were 2 operational DIM forms used in May and 2 in November. The calibration was performed within forms for the May and the November administrations.
IRT is a collection of measurement models that connect a test taker's score on the latent trait to the probability of correct response based on the observed item responses.31 Traditionally, IRT models assume unidimensionality and local independency.32 Under the assumption of unidimensionality, a single latent trait should account for all systematic variability among item responses.34 The assumption of local independence states that the probability of a correct response on 1 item is not influenced by the performance on any of the other items.35 We tested the assumption of unidimensionality with factor analysis. To diagnose the assumption of local independence, Q3 statistics were estimated for Rasch models in each form,36 resulting in estimates for every item pair, which are the correlations between the item residuals after fitting the Rasch model. Yen37 provides guidance for the assessment of local independence: “The expected value of Q3, when local independence holds, is approximately −1/(n − 1).” There was no evidence of violation of local independence assumptions among the DIM items.
Graded Response Model
The GRM extends from the logistic positive exponent family of models for dichotomous responses.12 The GRM calculates the probabilities based on the 2PL model specification while estimating difficulty parameters for every step of the item and a single discrimination parameter.18
Baker et al13 compared the performance of Samejima's12 logistic model with Masters's10 model using psychological data with multiple response categories. They found that Samejima's logistic model outperformed Masters's model by being more robust to violations of the unidimensionality assumption and better fitting the data. Further, they noted that Masters's PCM may be more applicable in situations where the items meet the assumption of equal slope parameter across items. The following section describes the GRM in statistical terms.
Let's consider an ordinal item j with k response categories; then let θ be the latent variable representing the latent trait being measured. Then the GRM specifies the trace function for the item:
where Pj(category k|θ) is the probability of a response in category k given that the latent trait is (k|θ)—the probability of the observed response in category k or higher minus (k + 1|θ)—the probability of the observed response in category higher than k. If k = 1, 2, …, K − 1, then the probabilities of responses may be generalized in the following way:
where αj is the discrimination parameter for item j and βjk is the difficulty parameter for the response in category k within item j. That is, if an item is scored 0, 1, 2, 3, then, given that a response is provided, the probability that the response is in category 0 or higher is unity. The probability that the response is in category 4 is 0. Finally, the probability that the response is somewhere between 0 and 3 is given by12,38
Decision Consistency
We took the slant of reliability for the decision consistency study. The synonyms for reliability are dependability, stability, consistency, reproducibility, predictability, and lack of distraction. One of the approaches to establish reliability is to inquire how much error of measurement is in a measuring instrument.26 Since the object of our attention in this study was a classification instrument (a test), we believed it would be appropriate to measure the extent of error made in classification. Furthermore, since the test is scored using IRT, we decided to investigate the reliability of classification into an ability category.
According to Baker,39 each examinee responding to a test item possesses some amount of the underlying ability. Thus, one can consider each examinee to have a numerical value, a score that places the examinee on the ability scale. The basic premise of IRT is that the probability of correct response on an item is a function of ability, denoted by θ. The ability estimates are derived during the process of test calibration. Next, the examinees are categorized according to their estimated ability based on their item responses. In this study, we compared the frequency distributions of examinees in different ability categories.
To ensure longitudinal comparison of scores, decision consistency is required when changes are made to an assessment instrument. However, the desired levels of decision consistency do not guarantee that testing results necessarily reflect examinees' true ability.40 While decision consistency suggests reliability, the idea of the instrument testing what it is supposed to test is the most critical aspect of validity.23 This article does not attempt to make a content validity study; therefore, our results should be interpreted with caution.
RESULTS
Overview
The results were analyzed using rigorous, widely accepted statistical methodologies for evaluating validity and reliability; the statistical fit of the models used for item calibration, equating, and scaling; and longitudinal consistency of the exam. Classical statistics were used for the assessment of item difficulty and the relation to the total test score. Item difficulties along with item discrimination were calculated using IRT.
Data Cleaning and Classical Item Analysis
The classical item analysis was conducted in R using the “psychometric” package.29 In preparation for item analysis, all item responses were reviewed to verify that the data were free of errors. We looked for data entry errors, data merge errors, missing values, and out-of-range responses. The purpose of data cleaning procedures was to ensure that the psychometric analyses were conducted on a valid set of examinee responses.
Tables 1 through 4 present the results of item analysis for DIM administrated in May (Tables 1 and 2) and in November (Tables 3 and 4). The statistics used in this step, p values, polyserial correlations, and reliability coefficients were derived from CTT. Item analyses were conducted by form and by administration. The primary purpose of these analyses was to evaluate the quality of a test item; therefore, in addition to operational items, field-test items were included.
The stations on Form 1 administered in May showed consistent performance in terms of difficulty and discrimination. The difficulty estimates ranged from M = 1.9, SD = 1.34, to M = 3.69, SD = .61. The scoring rubric for DIM in May ranged from 0 to 4. Lower numbers correspond to more difficult items, whereas higher numbers indicate easiness. The average form difficulty was M = 2.85, SE = .6. The average correlation with the criterion was r̄ = .42, which is within the acceptable range. The item-total correlations can range from −1.0 to +1.0. Desired values are positive and larger than .20. There were no stations with extremely low or negative correlations on Form 1 administered in May.
The difficulty and discrimination parameter estimates on Form 2 administered in May were within expected ranges. The difficulties ranged from M = 1.87, SD = 1.23, to M = 3.66, SD = .7. For the 10 stations on the form, the average difficulty was M = 2.81, SE = .6. The average correlation with criterion was r̄ = .45. The 2 DIM forms administrated in May reveal consistency in terms overall difficulty and discrimination calculated using classical methodology.
The DIM exam in November was scored using a rubric ranging from 0 to 2. The average difficulty on Form 1 administered in November was M = 1.51, SE = .33. The average discrimination (correlation with the criterion) was r̄ = .3. The average difficulty for Form 2 administered in November was M = 1.43, SE = .33. The average discrimination was r̄ = .31. Similar to the forms administered in May, the November forms revealed consistency in difficulty and discrimination.
Calibration
The purpose of IRT calibration and scaling is to place the scores of different testing administrations on a common difficulty scale. For calibration of DIM items, the GRM12 was employed. The operational calibration was performed using IRTPRO 4.2 for Windows41—a computer program that provides an ability to calibrate item responses using a plethora of IRT models for dichotomous and polytomous data. For this study, however, in addition to operational results, the calibration was replicated in R using the “ltm” package,42 achieving a perfect match between both sets of results. The calibration was performed within form and within administration; therefore, 4 sets of results are presented: Form 1 and Form 2 administered in May 2018 and Form 1 and Form 2 administered in November 2018.
Table 5 presents the discrimination (a) and category difficulties (b's) along with standard errors associated with these estimates for Form 1 administered in May. The last column gives the S − X2 generalized to polytomous models.43 Two aspects of the results are notable. First, no items showed statistically significant misfit; however, several items had b1 estimates outside of expected ranges, indicating that the category may be too easy for the population of examinees. The difficulty averages for the form were b̄1 = −5.55, b̄2 = −2.85, b̄3 = −1.26, and b̄4 = .91. Second, the standard error estimates were reasonable. Station 1 failed to attract responses in the fourth category, resulting in a missing estimate for b4 and its associated standard error.
Table 6 shows calibration results for the Form 2 administered in May. All items but 1 (station 4) revealed good fit of the model. The fit index calculated for station 4 revealed statistical significance (p < .01), indicating a misfit. Furthermore, the difficulty estimate associated with the item was out of range (b1 = −10.65); therefore, the item was removed from the second round of operational calibration and was not included in the calculation of the total score. The difficulty averages for Form 2 were consistent with the estimates calculated for Form 1: b̄1 = −5.38, b̄2 = −2.58, b̄3 = −1.01, and b̄4 = .91. The discrimination estimates for both forms were reasonable.
Table 7 gives estimates of the discrimination and difficulty for the 20 items (stations) on Form 1 administered in November. For the majority of the items on the form, the difficulty estimates were reasonable. Two stations revealed out-of-range estimates for b1 (stations 4 and 9). These items were removed from the second round of operational calibration and calculation of the total score. Station 14 indicated a poor fit of the model; however, the difficulty estimates associated with the item were reasonable. Thus, the station was not removed from further consideration. The averages of the difficulty estimates on the form were b̄1 = −5.52 and b̄2 = −2.71.
Table 8 provides results for Form 2 administered in November. Analogously to Form 1, the estimates were reasonable for the majority of the items. While there were no items with poor statistical fit, stations 16 and 20 revealed out-of-range difficulty estimates. These stations were removed from the second round of operational calibration and calculation of the total score. The difficulty averages for Form 2 were b̄1 = −5.77, b̄2 = −1.23.
The calibration results showed consistency of the test before and after the change. The averages of the parameter estimates are comparable across the forms and administrations. The fit of the IRT model to the data is acceptable, and the number of items excluded from operational procedures is minimal. Further, the marginal reliability estimates for thetas were high and ranged from .78 to .85 for May and from .82 to .88 in November.
Figures 1 through 4 present item characteristic curves for DIM test stations on both forms administered in May and November. The rubrics used for the forms administered in May had 5 response options, while rubrics used for forms administered in November had 3. Each plot represents an item (station); the x-axis is the ability scale (theta) depicted in standardized (z score) units. The y-axis is the probability of the response conditional on the ability level. Each curve in the plot represents a response option.
For polytomously scored items, we expect that the probability of the response associated with the full credit will proliferate with the increase in the ability level. With minor exceptions, (eg, item 9 on May's Form 1), all plots show a monotonic increase of the curve associated with full-credit response as a function of ability level.
Decision Consistency
The consistency of classification of members of the same group into the same category by a testing instrument is very important for longitudinal comparison of the scores. In criterion-referenced testing, a common approach to study decision consistency is relative to the cut scores—when the same examinees are being tested two or more times, they should be classified with the same consistency while controlling for type I and type II errors. However, the test takers in May and November were different; therefore, we took a different approach to consistency study. We created 12 ability intervals ranging of .5 from −3.0 to 3.0, and we compared the count of examinees classified in each ability interval between testing administrations.
Table 9 presents results of classification consistency for the norming group (nonaccommodated, first-time test takers) and the overall sample for May and November administrations. It is evident from the table and the plots (Fig. 5) that the distributions are normal with the majority of test takers being clustered around the mean as expected. The minor between-category deviations in the counts of examinees are due to the random error.
DISCUSSION
Technological advances in the field of medical imaging required educational programs and testing organizations in all health care fields to reexamine their curricula and testing content. A 2014 survey of the chiropractic colleges in the United States revealed that only 8% of the doctor of chiropractic programs did not use digital images in their outpatient clinics and 15% did not utilize digital radiographs for educational or examination purposes. Although a small percentage of doctor of chiropractic programs were not utilizing digital images, private practice usage of digital imaging increased from 11.6% in 2009 to 28.1% in 2014.1 With the increase in private practice use along with the movement of all chiropractic educational institutions in the direction of digital radiography, the NBCE began the transition of the DIM component of the Part IV exam from plain film to digital radiography.
Following the pilot examination, it was determined that a 20 station/image was preferable in order to decrease item bias and content underrepresentation. It was also determined that in order to standardize the examination experience, the NBCE would provide identical monitors to the college campuses for the inaugural digital DIM examination that occurred in November 2018. While in the pilot planning stage, the NBCE also initiated contact with practicing chiropractors to begin building a substantial digital library, and this continues today.
Limitations
The findings of this research were derived from examining only 2 test administrations—May 2018, prior to the change in DIM, and November 2018, after the change. The Part IV exam started the transition to IRT scoring in 2017 with May 2018 being the first administration when IRT scoring was fully implemented. Therefore, it is reasonable to assume that May and November test takers may not be representative of all examinees who take the Part IV exam. Further, due to the nature of the chiropractic profession and chiropractic education, the cohorts of examinees are not very large, yet the statistical models used in this study are large-sample techniques. While the stability of the model estimates was reasonable, larger sample sizes may have been beneficial.
CONCLUSION
Aside from introducing chiropractic practitioners, chiropractic faculty, and chiropractic students to the changes in the DIM component of the Part IV exam, the goal of this study was to examine the performance of the revised exam. From a psychometric perspective, we examined the performance of the items on the exam as well as the performance of the 2 cohorts—examinees who took the test in May and November 2018.
The diagnostic item analysis revealed similarity in item functioning across test forms and across administrations. The distributions of responses across response options were reasonable, and the correlations with the criterion were in the expected ranges. The IRT models displayed reasonable fit to the data. There were only 2 occasions where fit indices showed a misfit at the item level—these were eliminated from the scoring process. The averages of the IRT parameters were similar across test forms and across administrations. The classification of test takers into ability (theta) categories was consistent across norming/all groups, across test forms, and across administrations.
The transition to digital DIM was well thought through, relied on empirical findings, and was aligned with the current chiropractic curricula in the United States. This research signifies a first step in the evaluation of the transition to digital DIM. We hope that this study will spur further DIM research. In addition, we hope that the results prove to be useful for chiropractic faculty, chiropractic students, and the users of Part IV scores.
FUNDING AND CONFLICTS OF INTEREST
This work was funded internally. The authors are employees of the NBCE.
REFERENCES
Author notes
Igor Himelfarb is the director of psychometrics and research at the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; [email protected]). Margaret Seron is a consultant for the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; [email protected]). John Hyland is a consultant for the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; [email protected]). Andrew Gow is the director of practical testing, research, and development at the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; [email protected]). Nai-En Tang is a data analyst at the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; [email protected]). Meghan Dukes is a chiropractic specialist at the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; [email protected]). Margaret Smith is senior data analyst at the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; [email protected]). Address correspondence to Igor Himelfarb, 901 54th Avenue, Greeley, CO 80634; [email protected]. This article was received February 18, 2019; revised May 16 and June 24, 2019; and accepted July 20, 2019.
Concept development: IH, JH, MS, MD, MF. Design: IH, NT. Supervision: IH, AG, JH. Data collection/processing: MS, MF, NT. Analysis/interpretation: IH. Literature search: JH, MF, NT, IH. Writing: IH, JH, NT, AG. Critical review: JH, AG, MD, MF, MS, NT.