The National Board of Chiropractic Examiners (NBCE) uses a robust system for data analysis. The aim of this work is to introduce the reader to the process of score production and the quantitative methods used by the psychometrician and data analysts of the NBCE.

The NBCE employs data validation, diagnostic analyses, and item response theory–based modeling of responses to estimate test takers' abilities and item-related parameters. For this article, the authors generated 1303 synthetic item responses to 20 multiple-choice items with 4 response options to each item. These data were used to illustrate and explain the processes of data validation, diagnostic item analysis, and item calibration based on item response theory.

The diagnostic item analysis is presented for items 1 and 5 of the data set. The 3-parameter logistic item response theory model was used for calibration. Numerical and graphical results are presented and discussed.

Demands for data-driven decision making and evidence-based effectiveness create a need for objective measures to be used in educational program reviews and evaluations. Standardized test scores are often included in that array of objective measures. With this article, we offer transparency of score production used for NBCE testing.

## INTRODUCTION

According to the Standards for Educational and Psychological Testing, assessment is among the most important contributions of cognitive and behavioral sciences to our society.^{1 } Decision processes in health care professional testing are often complex and ongoing and pose additional challenges due to numerous regulations and their enforcement by the agencies responsible for safety of the general public.^{2 }

The National Board of Chiropractic Examiners (NBCE) adheres to the Standards for Educational and Psychological Testing.^{1 } These principles dictate that the NBCE provide accurate, fair, valid, and reliable assessment results to the intended score recipients, follow specific guidance in assessment development, obey psychometric standards, and respect the rights and responsibilities of the test takers. The goal of the NBCE is to produce scores that are valid and reliable for all test takers and are comparable over time and across test forms. For that purpose, the NBCE follows an established protocol that ensures that these goals are met.

Therefore, our objective for this article is to introduce chiropractic educators to the field of measurement and to demystify the complex and laborious process of scoring NBCE exams. This article provides an overview of principle concepts and modern-day best practices accepted in testing. This is followed by a description and illustration of the multiphase process of score development used by the NBCE using a generated data set that mimics the data structure of NBCE Part I and Part II exams.

## OVERVIEW OF TESTING CONCEPTS

### Data Validity

Tabachnick and Fidell^{3 } supplied a checklist for screening data prior to statistical analysis: (1) inspect univariate descriptive statistics for accuracy of input, (2) evaluate the amount and distribution of missing data and deal with the problem, (3) check pairwise plots for nonlinearity and heteroscedasticity, (4) identify and deal with nonnormal variables and univariate outliers, (5) identify and deal with multivariate outliers, and (6) evaluate variables for multicollinearity and singularity. The NBCE follows their suggestions in our data screening procedures.

Furthermore, the NBCE understands the importance of ensuring the accuracy of data used in the process of test score production. After receiving the data from test sites, psychometric data analysts closely examine the match of the data set to the data map, ensuring that the responses for both paper-based and computer-based forms are within the expected ranges.

As part of the data validation procedure, the NBCE psychometric team, in collaboration with the Written Examinations and Part IV (Practical Testing) staff, established criteria to determine whether examinees exhibit a valid attempt to respond to the items on tests. For paper-based testing, we expect the test takers to mark at least some questions in the answer document. For the computer-based testing, we inspect the examinee response data, along with timing information, and determine whether a test taker made a valid attempt on the test. In addition, a score for a test or part of a test may be invalidated if an examinee completes a section in a very short time and/or receives a very low score. Although we flag item responses that do not meet valid attempt criteria, score invalidations are extremely rare.

Recent research^{4,5 } suggested the use of item response theory (IRT) to identify the cases that behave inconsistently with model assumptions. IRT model-based fit indices may serve as validity estimates for cases with particular response patterns.^{6 } The NBCE uses IRT-based modeling to establish data validity and conduct forensic data analyses.

### Missing Data

Statistical analyses involving inference and prediction become problematic in the presence of missing data.^{7,8 } Several methods of dealing with missing responses in psychological and educational research have been identified and developed.^{9 } However, further consideration should be given to this issue in operational psychometrics, as the effects of missing data on the estimation accuracy of IRT parameters is well documented.^{10 }

Test takers may omit responses when a page layout is complicated or when they have to follow a long passage or task and do not respond due to fatigue or intimidation. Missing data can also occur when test takers run out of time or are unmotivated, overly anxious, fatigued, or overwhelmed.^{11 } The NBCE is closely monitoring cases with missing data as well as investigating the possible causes of missing data from each testing administration.

### Scoring

Scoring could be defined as converting raw item responses to scored responses according to a rubric that makes this process a function of the item type. Items on a test are usually classified as selected-response (SR) or constructed-response (CR) items. In this study, we only review the scoring process for SR items, where the examinees select a correct answer from a limited number of choices. Examples of SR items include multiple-choice items, true-or-false items, or matching items. Scoring for SR items is called objective scoring, which means that no judgment is required for raters to score an item; thus, scored items will have the same score regardless of who scores them. The majority of SR items are scored using the “number correct method,” where the overall test score is the total number of correct responses.^{12 } For example, consider the following test item:

What is the capital of France?

London

San Francisco

Madrid

Paris

The item is a multiple-choice item that contains a single correct answer: Paris. This item is scored dichotomously in the following way: 1 = if the response is Paris, 0 = if otherwise. Other scoring methods are available for test items with more complex designs.^{13,14 }

### Diagnostic Item Analysis

The operational psychometric procedures include an evaluation of examinees' performance as well as the performance of items on the test. The first step in this evaluation is to conduct a diagnostic item analysis (DIA), which is a statistical analysis based on classical test theory (CTT).^{15 } The DIA provides measurement and bias information about items. This information is used for item reviews, test construction and revisions, technical reports, and other psychometric documentation. The DIA shows the number and percent of test takers responding to each answer choice, the *p* values, point-biserial correlations, and other useful statistics. Further descriptions of these statistics follow.

### Item Difficulty

Item difficulty is defined as the proportion of examinees who answered the item correctly, also known as the *p* value. The formula for *p* value is

where *N _{ic}* is the number of examinees who answered item

*i*correctly and

*N*is the total number of examinees who attempted the item.

_{i}Commonly, for dichotomously scored items, the difficulty of an item is measured by the proportion of test takers who answered the question correctly. The range of proportion correct is 0 to 1, with 0 indicating that all examinees responded to an item incorrectly and 1 indicating that all examinees responded to an item correctly. Higher *p* values indicate easier items and/or more able populations. Desired *p* values generally fall within the range of .25 to .95. For multiple choice items, Thompson and Levitow^{16 } suggested that the ideal difficulty for an item is slightly higher than the middle point between the percentage of answering correctly by guessing (25% for the 4-option multiple-choice items) and all examinees answering correctly (100%). For polytomously scored items, the *p* value represents the average item score or the proportion of the maximum obtainable score.^{17 } Desired values generally fall within the range of 30% to 80% of the maximum obtainable score.

### Item Discrimination

Item discrimination refers to the extent that a test item distinguishes between examinees with different levels of ability. The index of item discrimination is derived using correlation. The foundation of the correlation-based approach is the Pearson product-moment correlation used to measure the strength of linear relationship between 2 normally distributed variables.^{18 } The formula for the Pearson product-moment correlation is

where *COV*(*x*, *y*) is the covariance between variables *x* and *y*, *SD _{x}* is the standard deviation for

*x*, and

*SD*is the standard deviation for

_{y}*y*.

When measuring item discrimination for dichotomously scored items, a special case of the Pearson product-moment correlation, called point-biserial correlation, is used.^{11 } The formula for the point-biserial correlation is

where *X̄ _{s}* is the mean test score for examinees who provide a correct response to the item,

*X̄*is the mean test score for examinees who responded to the item incorrectly,

_{μ}*S*is the standard deviation of the test score,

_{Y}*p*is the proportion of examinees who respond to the item correctly, and

*q*is the proportion of examinees who responded to the item incorrectly.

An item is considered to perform well if high-ability test takers tend to answer correctly and low-ability test takers tend to answer incorrectly. An item with negative or extremely low correlations indicates serious problems and should be reviewed.

### Distractor Analysis

Analysis of distractors is required in determining the usefulness of the attractiveness of each option. A distractor should be a plausible choice, reflecting a common misconception. If a distractor fails to attract examinees with lower ability levels, the response option should be modified. A discrimination index (eg, point-biserial correlation coefficient) should also be calculated for each distractor to determine whether it is performing correctly.^{19 } We expect the discrimination index for distractors to be zero or negative.

### IRT

The logic behind testing is to develop an instrument that, with a number of items, will reliably measure the ability of interest. Then, using the pattern of item responses, with reasonable precision, the place on the ability continuum for each examinee can be determined. In the IRT literature, the ability parameter is denoted *θ _{j}*. Then, using mathematical models, a probability of responding correctly on a test item conditional on ability could be calculated. For a properly functioning test item, this probability will be near 0 for examinees who are low on the ability continuum and near 1 for examinees who are high on the continuum. The S-shape curve connecting the ability (x-axis) and the probability of the correct response (y-axis) for a particular item is known as the item characteristic curve.

^{20 }

CTT is built around the framework of linear models—it uses the linear decomposition to separate the true score from error. IRT models relate the student's ability to item scores using a nonlinear framework.^{21 } The IRT models for dichotomous data differ in the number of parameters included in the models. The 1-parameter logistic model (1PL) estimates the probability of the correct response to an item as a function of the difference between the examinee's ability and item difficulty. This estimation becomes possible because the examinee's ability and item difficulty are on the same, logit scale. The logit, or the log odds, is a logarithm of odds , where *p* is the probability for the event of interest. Thus, . A unit increase on the logit scale represents an increase in odds of the correct response to the item. Logit scale is a perfect representation of the interval scale,^{22 } which is considered to be one of the key advantages when using logit. Furthermore, the logarithmation allows creating a continuum between the dichotomously scored responses.

The 2-parameter logistic model (2PL) introduces a discrimination parameter that is the slope of the curve—the steeper the slope, the better the discrimination. The model is advantageous when compared to 1PL due to a closer fit to the response data resulting in more precise parameter estimates. Finally, the 3-parameter logistic model (3PL) introduces the guessing parameter in addition to the difficulty and discrimination estimated by the 2PL, which is the probability of providing a correct response to an item by chance. The value of the guessing parameter does not vary as the function of ability level.^{20 }

### Calibration

Calibration is a process of fitting IRT models to scored item responses. The purpose of item calibration is to obtain IRT parameters (difficulty, discrimination, and guessing) for each item on a test. Himelfarb^{23 } detailed the history, assumptions, and models of IRT. The NBCE uses the 3PL IRT model^{24,25 } for dichotomously scored items and the graded response model^{26 } or the generalized partial credit model^{27 } for items scored using more than 1 correct answer. The calibration of item responses provides us with many features that are helpful in the decision-making process about test takers and test items. In a later section, we will discuss the item-related information that the NBCE obtains from calibration procedures. The following illustrates the operational psychometric procedures that the NBCE employs.

## SIMULATED DATA SET AND SCORE DEVELOPMENT

The purpose of this study is to introduce the reader to the course of test score production. To illustrate the scoring, 1303 synthetic item responses to 20 multiple-choice (MC) items with 4 response options to each item were generated. The “psych” package^{28 } within R programing language^{29 } version 3.3.1 was used to generate unidimensional item responses. It was assumed that each item has only 1 correct answer. Throughout the remainder of the article, we will refer to these generated data as “the Test.”

A popular scoring schema for MC items is dichotomous scoring. When scored dichotomously, an examinee receives 1 scored point for selecting the correct response and 0 scored points for choosing a distractor.^{30 }Tables 1 and 2 present raw and scored item responses for the Test, respectively.

The DIA based on CTT methods was performed using raw data responses on the Test. We used the Structured Query Language^{31 } to generate the DIA. The estimated statistics included a number of examinees choosing a particular response category, a *p* value, and an estimate of point-biserial correlation for each response category on an item.

The IRT calibration was performed with the 3PL IRT model.^{32 } The 3PL is a model that estimates the parameters of item discrimination (*a _{i}*) and item difficulty (

*β*) with an additional parameter,

_{i}*γ*—the lower asymptote of the item characteristic curve, representing the probability of a test taker with a low ability providing a correct answer to an item

_{i}*i*. The inclusion of this parameter suggests that test takers who score low on the latent trait may still provide a correct response by chance. This parameter is referred to as “guessing.” The following is the mathematical representation of the 3PL IRT model:

Table 3 presents the DIA results for item 1 on the Test. The key, A, is evidently preferred by the majority of test takers (n = 851). The *p* value = .78, which demonstrates that this item is in the acceptable difficulty range. The item-total correlation for the key is positive, *r* = .29, while the correlations for distractors are all negative. Figure 1 presents smoothed plots of the key and each of the distractors on item 1. The x-axis represents 4 categories of scores: lower, middle 50%, middle 75%, and upper. The lower category is the proportion of test takers from the lowest score group choosing the response, while upper is the proportion from the highest score group. The y-axis represents the percentage of test takers in each category who chose that response. The graph shows that test takers of lower category are not able to differentiate between the key and the distractors. For the middle 50%, the distractor B is preferred over the key. However, for the middle 75% and upper score categories, there is a clear prevalence of the key over the distractors. Based on the numerical and graphical analyses, we conclude that item 1 is performing appropriately.

Table 4 presents the DIA for item 5 on the Test. Similar to item 1, the key is preferred by the majority of the test takers (n = 844). The item-total correlation for the key is positive and within the accepted range, while the correlations for the distractors are all negative. The numerical analysis advises that A, the distractor, is attracting almost 3 times as many test takers as option B or D. Figure 2 shows that the key is visibly prevalent for examinees in the middle 75% and upper score categories.

For calibration, Table 5 displays the item-parameter estimates for *a* (discrimination), *b* (difficulty), and *c* (guessing) and the standard errors associated with these estimations obtained via calibration of the Test using the 3PL IRT model. The item difficulty represents the point on the ability scale where a test taker has a 50% probability (point of median probability) of providing a correct response to the item. The accepted range for difficulty is between −4.0 and 4.0; however, items with values above +2.0 are considered hard, and items with values below −2.0 are considered easy. The discrimination parameters are not limited in range; however, negatively discriminating items are discarded from a test.^{32 } The estimates of guessing signify pseudochance; these are the values of the asymptote for the curve representing an item.

Figure 3a demonstrates the item characteristic curves for the 20 items on the test and Figure 3b the item information functions. The x-axis on both graphs represents the test takers' ability or the latent trait, while the y-axis on the left graph shows the probability of the correct response conditional on ability level. On the right graph, the y-axis represents the amount of information each item provides for a specific ability level.

The items to the right on both graphs are more difficult, while items on the left are easier. On the left graph, the items with the steeper slopes are better-discriminating items. From the numerical information and the graphs, item 1 appears to be the easiest on the Test, while item 13 is the hardest. On the right graph, the items with higher curves provide more information regarding ability. This plot helps to see which item is more informative for each ability segment.

## DISCUSSION

Professional assessment is a key component for any evidence-based practice.^{33 } According to the Standards Educational and Psychological Testing,^{1 } assessment is intended to provide the public, including employers, and governing agencies with a dependable mechanism for identifying practitioners who have met particular requirements and are ready to practice according to established standards. To be able to provide stakeholders with that mechanism, professional testing programs must ensure a close connection between the occupation and the content of the test. The process of gathering such evidence is called validation and is never ending— the use of test scores may be valid for one purpose and not valid for another. Validity would not be possible without reliability,^{34 } which is a quantification of measurement precision for test scores.^{35 } The theory of reliability assumes that every examinee possesses a latent true score—the true parameter indicating the degree of knowledge he or she truly has. A test is then an inference that provides an estimate of that parameter.

Error is an inherent factor of measurement. The process of classification is always susceptible to type I error (the probability of rejecting the null hypothesis when null is true) and type II error (the probability of not rejecting the null hypothesis when null is false). The sources for type I and type II errors are countless and may include test taker–related factors such as fatigue or anxiety and psychological or situational factors. However, the goal of a testing program is to minimize the errors related to the instrument (test) by striving to increase the validity and reliability.

To be able to produce a test score, a scoring model needs to be assumed. Every model is a simplification of reality. For example, when a child learns that 1 apple plus another apple equals 2 apples, this models the higher mathematical concept of addition. Thus, often, the simplification of reality is quite useful, as it provides explanation of a phenomenon and affords the ability for prediction. As Gorge Edward Pelham Box, a great British statistician, once said, “All models are wrong, but some are useful.”^{36 }

## CONCLUSION

Today the faculty and educational administrators in all sectors of American higher education follow high-stakes accountability policies. The demands for data-driven decision making and evidence-based effectiveness create a need for objective measures to be used in educational program reviews and evaluations. Standardized test scores are often included in that array of objective measures. Thus, educational institutions may base faculty members' evaluations on how well their students do on standardized tests, which leads to implications for professional development, compensation, benefits, and tenure.^{37 } In turn, faculty members express their frustration criticizing the validity and reliability of scores and the legitimacy of agencies that produce these scores. With this article, we offer transparency of score production and hope that faculty members will take time to understand the seriousness of the work involved.

## ACKNOWLEDGMENT

The authors would like to thank Alison Day for her review and editing contributions of this article.

## FUNDING AND CONFLICT OF INTEREST

This work was funded internally. The authors have no conflicts of interest to declare relevant to this work.

## REFERENCES

## Author notes

Igor Himelfarb is the director of the Department of Psychometrics and Research at the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; ihimelfarb@nbce.org). Bruce L. Shotts is the director of written examinations at the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; bshotts@nbce.org). Nai-En Tang is a psychometric data analyst in the Department of Psychometrics and Research at the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; ntang@nbce.org). Margaret Smith is a senior data analyst in the Department of Psychometrics and Research at the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; msmith@nbce.org). Address correspondence to Igor Himelfarb, 901 54th Avenue, Greeley, CO 80634; ihimelfarb@nbce.org. This article was received September 16, 2018; revised December 20, 2018; and accepted January 18, 2019.

Concept development: IH. Design: IH. Supervision: BS. Data collection/processing: IH, NET, MS. Analysis/interpretation: IH, MS. Literature search: NET. Writing: IH, BS. Critical review: IH, BS.