This article presents health science educators and researchers with an overview of standardized testing in educational measurement. The history, theoretical frameworks of classical test theory, item response theory (IRT), and the most common IRT models used in modern testing are presented.
A narrative overview of the history, theoretical concepts, test theory, and IRT is provided to familiarize the reader with these concepts of modern testing. Examples of data analyses using different models are shown using 2 simulated data sets. One set consisted of a sample of 2000 item responses to 40 multiple-choice, dichotomously scored items. This set was used to fit 1-parameter logistic (PL) model, 2PL, and 3PL IRT models. Another data set was a sample of 1500 item responses to 10 polytomously scored items. The second data set was used to fit a graded response model.
Model-based item parameter estimates for 1PL, 2PL, 3PL, and graded response are presented, evaluated, and explained.
This study provides health science educators and education researchers with an introduction to educational measurement. The history of standardized testing, the frameworks of classical test theory and IRT, and the logic of scaling and equating are presented. This introductory article will aid readers in understanding these concepts.
In the 20th century, the concept of public protection dictated implementation of licensing laws to those professions having a direct relationship to public health and safety.1 A plethora of discipline-specific prelicensure standardized assessment instruments (tests) exists to ensure compliance with the disciplinary standards. In the chiropractic profession, every year thousands of students take the prelicensure Part I, II, III, and IV examinations of the National Board of Chiropractic Examiners. As with any examination, some students feel that these standardized tests are unfair and have little relevance to clinical practice. Even faculty members often understand little about the boards. This article aims to provide an introduction to the world of standardized assessment not only for chiropractic educators but also for any health sciences educator or educational researcher.
OVERVIEW AND SIMULATED ANALYSES
History of Standardized Testing
The early history of standardized testing goes back several centuries. In the 3rd century BCE in imperial China, to qualify for civil service, Chinese aristocrats were examined for their proficiency in music, archery, horsemanship, calligraphy, arithmetic, and ceremonial knowledge. Later, the examinations tested knowledge of civil law, military affairs, agriculture, geography, composition, and poetry.2,3 Those who passed these exams were qualified to serve the Chinese emperor and his family. The exams were accompanied by an atmosphere of solemnity and attention to the young nobles who dared to be scrutinized for the prestigious positions. The topics of the exams were frequently provided by the emperor, and he often examined the applicants during the final stage of the competition.
In the late 1880s, Francis Galton was inspired by the work of his cousin, Charles Darwin, regarding the origin of species and became interested in the hereditary basis of intelligence and the measurement of human ability. Galton developed the theoretical bases of testing—the application of a series of identical tests to a large number of individuals and the statistical processing of the results.4 In 1904, Alfred Binet, a Parisian with a doctorate in experimental psychology, was commissioned by the French ministry of education to study schoolchildren who were developmentally behind their peers. His task was to develop a method to identify children who were not benefiting from inclusion in regular classrooms and required special education.5 For this purpose, Binet and his associate, Theodore Simon, designed and administered a 30-item instrument arranged by difficulty that tested ability for judgment, understanding, and reasoning.1
The field of testing developed rapidly during World War I (1914–1918), when the problem of professional selection for the needs of the army and military production became a priority. During that time, leading psychologists organized the Army Alpha Examination to test army recruits.6 Their success further inspired psychologists to advocate for civilian testing. During the 20th century, large-scale assessment in the United States became a necessity for college admissions and school accountability. The reliance on standardized tests for college admission was a response to the increasing number of students applying to colleges, and it became a tool to tighten the gates in the face of limited resources.7
In the 21st century, standardized tests constitute an inseparable part of American culture. Assessment instruments are administered in a wide range of settings: K–12, college admission, academic progression, professional licensure, clinical credentialing, industrial, forensic, and many more. “Gatekeepers of America's meritocracy—educators, academic institutions, and employers—have used test scores to label people as bright or not bright, as worthy academically or not worthy.”8 The study of measurement processes and the methods used to produce scores in testing evolved into a specialized discipline—psychometrics, a combination of education, psychology, and statistics.9
Critique of Standardized Tests
As the use of standardized tests for high-stakes exams increased, so did the critique of their use.10 Counsell11 conducted a case study exploring the effect of the high-stakes accountability system on the lives of students and teachers. The findings revealed that the culture of testing introduces a continuum of fear and ethical and moral dilemmas related to the pressure experienced by instructors when schools use test scores as a measure of accountability. Often, instructors decontextualize the material to the students with an intention to artificially inflate the test scores.12 Such a phenomenon is known to researchers as “teaching to the test” and is often controlled for by psychometric procedures.13
Kohn14 claimed that admission tests (such as the SAT and ACT) are “not very effective as predictors of future academic performance, even in the freshman year of college, much less as predictors of professional success.” Zwick and Himelfarb15 predicted 1st-year undergraduate grade-point average (FYGPA) in 34 colleges from high school GPA (HSGPA) and SAT scores using linear regression models. The average R2 for these regression models was .226 (this coefficient indicates the amount of variance in the regression outcome explained by the linear combination of the predictors). However, in most of the models, the HSGPA was the predictor that accounted for the majority of variance. Zwick and Himelfarb stated, “The only substantial increase in R2 values occurred when SAT scores are added to a prediction equation that included self-reported HSGPA.”
Furthermore, the study highlighted the overprediction (the predicted outcomes were higher than actual) of FYGPA for African American and Latino students and the underprediction (the predicted outcomes were lower than actual) for Caucasian and Asian students when high school grades and SAT scores were used. Zwick and Himelfarb concluded that these errors in prediction were partially attributed to high school socioeconomic status—African American and Latino students are more likely than Caucasian students to attend high schools with fewer resources.
Measurement and Classification
Two processes are involved when a test is administered—measurement and classification. Measurement is the process of assigning numerical values to a phenomenon. This is a thorny process because numbers are used to categorize the phenomenon, and numerical scales hold qualities such as differentiation (1 is different from 2), order (2 is higher than 1), equality of intervals (the interval between 1 and 2 is equal to the interval between 2 and 3), and a 0 point, which is not always a true absence of value. By assigning numerical values to categories, the rules associated with numbers are carried over to the properties of the measured phenomenon and may not always correspond to the actual properties of the measured objects.
Stevens16 developed a hierarchy of measurement scales: nominal, ordinal, interval, and ratio. The nominal scale is a system of measurement where numbers are used for the purpose of differentiation only. For example, the numerical part of a street address or apartment number is numbered on the nominal scale. The number on the jersey of a football player is used to differentiate the player from others, and it too is on the nominal scale. The categorical coding of most demographic variables, such as gender, ethnicity, and political party affiliation, constitutes nominal measures.17 Since nominal enumeration is used only to distinguish categories, the numbers assigned to the categories do not follow any order or presume interval equality. The nominal scale is the most rudimentary form of measurement.
The ordinal scale is a measurement scheme where, in addition to simple differentiation (the attribute specified by the nominal scale), the numbers represent a rank order of the measured phenomenon. Examples of ordinal measures are rankings in the Olympic Games, progressions of the spiciness of a dish in a restaurant (mild, spicy, and very spicy), military rank, birth order, and class rank. Another example of an ordinal measure is the emoji-face pain scale commonly used in health care. An ordinal scale establishes the order of categories but lacks the ability of comparison between the categories' intervals.
The subsequent scale in Stevens's hierarchy is the interval scale, which, in addition to differentiation and rank order, establishes the property of interval equality. On this scale, the intervals between adjacent points are presumed to be equal. One example of the interval scale is a number line, where, going from left to right, each subsequent number is higher in rank, and the intervals between adjacent numbers are equal across the entire domain of the line. Another example is a temperature scale measured in Celsius or Fahrenheit. In the social sciences, items commonly measured on the Likert scale, ranging from “strongly disagree” to “strongly agree,” for the purposes of statistical analysis of opinions, are assumed to be on the interval scale.
The highest measurement scale in the hierarchy is the ratio scale. In addition to the properties established by the nominal, ordinal, and interval scales, a ratio scale has a true 0 point (complete absence of value). Neither the number line nor the Celsius or Fahrenheit temperature scales have an absolute 0 point. The 0 on the number line is nothing more than a separation between the negative and positive numbers and can be rescaled with a simple linear transformation. The 0 on the temperature scale (in Celsius) is also not an absence of value but rather a point at which water becomes ice. An example of a ratio scale is the Kelvin temperature scale, where 0 indicates a complete absence of temperature.
Every assessment is designed to measure and classify the test takers' performance in a specific domain. Depending on the assessment design, the scores can be on the ordinal, interval, or even ratio scale. Then, depending on the score obtained on the test, a test taker can be classified into the mastery or nonmastery categories (in the case of professional testing) or into basic, proficient, or advanced levels of performance in the case of K–12.18
When test takers present themselves at the test site for an exam administration, they arrive as members of a single population. The goal of the test designer and test administrator is to separate the test takers into subpopulations according to the intended users' objectives for the scores. Thus, each item on the test is a classification tool that helps make the categorization decision regarding each individual test taker. With each item that is answered correctly, a test taker is more likely to be classified into the higher category, while each incorrect response increases the likelihood of classification into a lower category.
Reliability and Validity
The quality of a measurement instrument is expressed in terms of the reliability and validity of the scores collected by this instrument. Reliability is the consistency with which a measure, scale, or instrument assesses a given construct, while validity refers to the degree of relationship, or the “overlap” between an instrument and the construct it is intended to measure.13 The traditional meaning of reliability is the degree to which respondents' scores on a given administration of a measure resemble their scores on the same instrument administered later within a reasonable time frame. Kerlinger and Lee19 suggested 3 approaches to reliability: stability, lack of distortion, and being free of measurement error. The first 2 definitions are addressed in this section; the third definition requires an introduction to classical test theory20,21 and is addressed later.
If a measurement instrument or a comparable form is administered multiple times to the same or a similar group of people, we should expect similar scores. This is called temporal stability—the degree to which data obtained in a given test administration resemble those obtained in following administrations. When an assessment is conducted, a score user expects assurance that scores are replicable if the same individuals are tested repeatedly under the same circumstances.9 There are 2 techniques to assess temporal stability: the test–retest method and the parallel forms method.
In the test–retest method, a set of items is administered to a group of subjects, then the test is readministered later to the same group. The correlation of the 2 sets of scores is then measured. A higher correlation between the scores indicates higher reliability.
In the parallel forms method, 2 different forms of the same test are constructed, both measuring the same critical trait (knowledge base). Next, both forms are administered to the same group of test takers at the same test session. A higher relationship between the 2 sets of scores indicates higher reliability. However, it is very difficult to correctly construct equivalent test forms, and a weak relationship between the 2 sets of scores may actually reflect a lack of equivalence.
Another component of reliability is a scale's internal consistency. The lack of distortion or internal consistency of an instrument refers to the extent to which the individual components of a test are interrelated and thus produce the same or similar results. Items on the test should “hang together.” One of the earlier techniques to establish the internal consistency of a scale is known as the split-half reliability.22 The test is randomly split in half, and the 2 sets of test scores are compared to each other. Once again, a closer relationship between the 2 sets of scores indicates a higher test reliability.
Cronbach6,23 developed the coefficient alpha, an alternative to the once common split-half technique, which has become the most universal technique for estimating internal consistency reliability. His coefficient alpha assesses reliability as a ratio of the summed variances of individual items and the total variance for the instrument, subtracted from 1 and adjusted for the number of items in the instrument. Cronbach's alpha coefficient is computed as follows:
where α is the estimate of the instrument's internal consistency reliability; k is the number of items on the instrument; i is the item indicator, i = 1, 2, …, k; is the variance of item i; and is the total variance of the scale.
Cronbach's alpha ranges from 0 to 1.0 with values closer to 1.0 indicating higher reliability. The internal consistency of a test is considered acceptable if the alpha coefficient is above .70.24,25 An alternative interpretation of Cronbach's alpha is the mean of all interitem correlations. If a correlation coefficient is squared, it becomes a coefficient of determination, which indicates the proportion of variability shared between 2 variables.19 Thus, when .70 is squared, it becomes .49. This means that at least half of the variability in the responses collected by the instrument is explained by the instrument's internal consistency.
Reliability alone is not sufficient to establish the quality of a test. A good test must also measure what it was designed to measure, which is often referred to as validity. The validity of a scale refers to the extent of correspondence between variations in the scores on the test and the variation among respondents on the underlying construct being tested.13 The process of validation is closely related to the intended use of the scores. For example, scores collected on a test of general anatomy given in English ideally depict the knowledge of anatomy possessed by a test taker. Yet, if a test is given to a sample of English-language learners, a part of the variability in scores can be explained by English proficiency (or lack thereof). Therefore, the scores collected by the same test in an English-first population of test takers may have higher validity than scores collected from English-language learners.
Importantly, the validity of a test is a matter of degree, not all or none. Further, the existing evidence of validity may be challenged by new findings or by new circumstances. Unavoidably, validity becomes an evolving property, and test validation is a continuous process.26 This process of validation requires ongoing empirical research efforts outside of those used for reliability. The methods employed for establishing validity of a test include a thorough analysis of the content of the test during the phase of its scale development and quantitative assessment of the relationship between the test scores and the criterion that has been tested.2 The degree of accuracy with which test scores relate to their intended use may be established by studying the predictive validity.
Test scores with low validity can still be reliable, while reliability is a prerequisite for validity. Establishing reliability is more of a technical matter, whereas validity requires much deeper thinking and consideration; it is much more than a statistical procedure. Continuous vigilant consideration of each item in terms of content representation and its statistical performance as well as the reflection on the populations of test takers are all essential for confirming a test score's validity.
Classical Test Theory
Any measurement is an inference, and any statistical inference is subject to error. All measurements are susceptible to random error and, if repeated, may vary. To comprehend the size and the origin of the error, ideally, the measurement should be repeated several times, as the average of a series of measurements is more precise than any individual measurement by a factor equal to the square root of the number of measurements.27 Classical test theory (CTT) postulates that any observation is a linear combination of the true score and error. The fundamental equation of CTT states the following:
where Oi is the observed score for an examinee i, Ti is the true score for that examinee, and Ei is the error in the measurement. Thus, every test could be seen as a combination of 2 hypothetical components: the true score (true knowledge of the material tested) and the deviations from the true score due to random or systematic factors. Any systematic errors in measurement become part of an individual's true score and affect the validity since the score is no longer an estimate only of the latent trait but also of the systematic variability. The random errors, on the other hand, affect the reliability of the score and create a distortion in the observed score's precision over repeated administrations of the test.
Test scores can be described as random variables.9 A random variable X is an outcome of a process that is determined by a probability distribution. The term “expectation” or “expected value,” denoted as E(X), is used to signify the mean of the probability distribution. Assuming that all systematic variability in the observed score is accounted for by the true score and the error component consists of only random error, we can specify the distribution of the errors as follows:
which means that if examinee i takes the exam an infinite number of times, by definition of random, the same amount of error will be distributed above and below the true score. Thus, the error will average at 0. The relationship between the observed score and the true score can be clarified by taking the expectation of the observed score:
Meanwhile, if the expectation of error is 0 (see equation 3) and the expected value of the observed score is the true score,
Then it follows from equations 2 and 5 that
There are 3 other fundamental assumptions made by CTT: it is assumed that the correlation between true score and error is 0, that the correlation between error score on test 1 and error score on test 2 is 0, and that the correlation between the true score on test 1 and the error score on test 2 is 0.
The definition of reliability can be formulated in the framework of CTT if the following extension is made to the equation 2:
where Var(Oi), the observed score variability, is partitioned into the true score variability, Var(Ti), and the variability of error, Var(Ei). Reliability is the proportion of the true score variability to the observed score variability or the proportion of the error variability to the observed score variability subtracted from 1.0:
with ρO1,O2 being the reliability coefficient.
The variability of the scores, as viewed by CTT, provides the explanation for score stability. Test takers who are not satisfied with their exam scores may choose to repeat the test. While an examinee repeating a test is interested in the increase of the observed score, psychometricians consider any increase in the true score separately from the increase in the error component. If a test is reliable, it is very hard to increase the true score component when the assessment is repeated over a short period of time. Only long-term learning is associated with an increase in the true score component.28,29 At the same time, the scores for a repeat test taker will vary from 1 administration to another, and, usually, improved performance may be seen on a second measurement occasion, even if different questions are used.12 This is due to the known phenomenon called the practice effect,30 which is defined as an increase in an examinee's test score from 1 administration of the same assessment to the next in the absence of learning, coaching, or other factors that are known to increase the score.31
Other sources of measurement error may include temporary or momentary fatigue, fluctuations of memory or mood, or fortuitous conditions at a particular time that temporarily affect the outcomes measured by the test.19 Test scores may also be influenced by the content of the material that appeared on the test, guessing, state of alertness, and even scoring errors.
Another likely explanation of the differences in scores from 1 measurement occasion to another is the phenomenon known as regression to the mean.32 Each form of a test will tend to favor certain students but not others in a nonsystematic way. Students may get a test with items representing the material they are most familiar with or have studied the most. However, students who were favored by 1 form of the test are not likely to be favored by another when they retake the test. Therefore, the scores obtained on the second or third testing occasions will tend to be closer to the mean than the scores obtained on the first testing occasion.33
Even though it is never possible to measure exactly how much an increase in the observed score is influenced by the error component, CTT allows for estimation of the standard error of measurement (SEM), which is a function of the standard deviation of the set of observed scores and the reliability of the test:
where SDO is the standard deviation of the set of observed scores and ρ^O1,O2 is an estimate of reliability. Estimates of the SEM can be helpful in interpreting increases in individual test scores.
Item Response Theory
Item response theory (IRT) is a collection of statistical and psychometric methods used to model test takers' item responses.34 The initial development of IRT models took place in the second half of the 20th century. First, Rasch35 developed a model for analyzing categorical data. Next, Lord and Novick21 wrote chapters on the theory of latent trait estimation, which gave birth to a new way of data analysis in testing. Prior to the development of IRT, the testing industry relied on CTT methods for modeling test item responses. Since then, IRT has made its way into every aspect of the testing industry. IRT methods are used today in test development, item banking, data analysis, analysis of differential item functioning, adaptive testing, test equating, and test scaling.36
The early IRT models were first developed for dichotomously scored item responses (eg, 0 = wrong, 1 = right). These models included the 1-parameter logistic model (1PL), the 2-parameter logistic model (2PL), and the 3-parameter logistic model (3PL). Common assumptions for the early IRT models include unidimensionality—only 1 latent trait is necessary to explain the pattern of item-level responses37—and local independence—after accounting for the latent trait, there is no dependency among the items.36 Later, models for polytomous responses were developed: the partial credit model38 and the generalized partial credit model.35
In the early 1990s, significant efforts were made to develop multidimensional IRT models39,40 and models that were able to account for item dependency over and above the dependency explained by the common trait.41,42 Due to the introductory nature of this article, I will present the mathematical logic and graphical examples of the 1PL, 2PL, and 3PL models only.
One advantage of IRT over traditional testing theories is that IRT defines a scale for the underlying latent variable that is being measured by the test items.43 IRT assumes that responses on a unidimensional test are underlined by a single latent trait (θ), often called the test taker's “ability.” This latent trait is not able to be observed directly; however, it can be constructed using observed responses to the items on a test. Assuming IRT, the probability of a response to an item on a test is conditional on θ:
where fi(ui|θ) is the function of providing response u on an item i conditional on ability θ, is the probability of a correct response (ui = 1), and is the probability of an incorrect response (ui = 0) Subsequently, if (ui = 1), fi(ui|θ) = Pi(θ), and if ui = 0, then fi(ui|θ) = Qi(θ). The function connecting the means of conditional distributions (equation 10) is the regression of the item score on ability and is referred to as the “item characteristic curve” (ICC). The ICC relates the probability of providing a correct response on an item to the ability measured by the entire test.37
The student's ability and the item difficulty are on the same scale; therefore, θj = βi corresponds to θ – β = 0, meaning that there is an exact match between an examinee's ability and item difficulty; θj > βi corresponds to θ – β > 0, which means that the item is easy for the examinee's ability level; and θj < βi means that when θ – β < 0, the item is difficult for the test taker. Thus, the probability of providing a correct response by an examinee j to an item i is a function of the difference between theta and beta; formulaically,
where f is a function that relates the ability and the probability (ICC).
In this model, the probability of the response to an item is a function of the difference between the test taker's ability and the item's difficulty. The following is the equation for 1PL:
where D is a scaling factor, set to D = 1.7, so the values of P(θ) for 2-parameter normal ogive and the values for 2PL differ by less than 0.01.
The computing language R (an open-source environment for statistical computing and graphics) is often used to fit IRT models to data and estimate item parameters. Presented here is an example by means of the “irtoys” package44 to fit various IRT models using a set of simulated responses (n = 2000) to a 40-item test. The items were scored dichotomously. Table 1 presents estimates of model parameters and associated standard errors for the 1PL model. The item difficulty is the only parameter that was estimated, while the item discrimination was fixed at 1. Figure 1a presents the ICC curves for the 40 items. The curves differ by their location in relation to the x-axis, which is a reference scale for the test takers' ability and item difficulty—more difficult items are to the right, while less difficult items are to the left. The 1PL model assumes that all items relate to the latent trait (ability) equally and differ only in the amount of difficulty.
Figure 1b presents the item information functions (IIF) for the 40 items. The IIF shows the point on the ability scale for which the item provides maximum information. Assuming that these curves are Gaussian, the ranges of ability for which an item provides the most information can be estimated using the 3-sigma empirical rule.45 The IIF depends on the slope of the item response function as well as the conditional variance at each ability level. The greater the slope and the smaller the variance, the greater the information and the smaller the standard error of measurement (SEM).32 In 1PL, the slopes are held constant; therefore, there is no variability in the height of the curves.
The 2PL model estimates another parameter—the discrimination of an item, seen as the slope of the ICC. The discrimination is between those test takers who know the right answer and the population of test takers who do not demonstrate that knowledge. The items with better discriminating qualities have steeper slopes. The following equation represents the 2PL model:
where ai is the discrimination parameter for item i. Table 2 presents the model parameter estimates and related standard errors for the 2PL model. Figure 2a presents the ICCs for the same 40 items as Figure 1a; it is now obvious that some items are better at discriminating between the 2 populations (have steeper slopes) than others.
The estimation of the slope relaxes the assumption of an invariant relationship between the items and the latent trait. This relationship can now be estimated, and it is similar to the factor loadings in factor analysis.46 The items with higher discrimination coefficients are more responsive to small changes in the latent trait, whereas the items with low discrimination coefficients require large changes in the latent trait to reflect a change in the probability. Figure 2b presents the items' information curves, which now show variability in the amount of information they provide.
The 3PL model is a 2PL model with an additional parameter, γi, which is the lower asymptote of the ICC and represents the probability of a test taker with a low ability providing a correct answer to an item i. The inclusion of this parameter suggests that test takers who score low on the latent trait may still provide a correct response by chance. This parameter is referred to as “guessing.” The following is the mathematical representation of the 3PL model:
where γi is the guessing parameter. Referring back to equation 14, if a test taker guessed (γi = 1), then the probability of the correct response is entirely explained by guessing (the term after the plus sign disappears). However, if the test taker did not guess (γi = 0), the model defaults to the 2PL. Table 3 presents model parameter estimates for the 3PL, while Figure 3a and b presents ICCs and IIFs, respectively, for the 40 items.
Polytomous IRT Models
Various polytomous IRT models have been developed to account for ordered categorical responses. Samejima47 developed a logistic model for graded responses in which the probability that an examinee j with a particular level of ability will provide a response to an item i of the category k is the difference between the cumulative probability of a response to that category or higher and the cumulative probability of a response to the next highest category or higher. Consider the following:
where bik is the difficulty parameter for category ki and ai is the discrimination parameter for item j.47
A different model for ordered categorical response was developed by Masters.33 In this partial credit model, the probability that an examinee j will provide a response x on item i with Mi thresholds is a function of student's ability and the difficulties from the Mi thresholds in item i is given by the following:
where x = 1,2, …, Mi is the count of successfully completed thresholds, and .33
Samejima's graded response model was fitted to a simulated data set of n = 1500 responses to 10 polytomous items scored using the following categories: 0, 1, 2, and 3. Table 4 presents model-based parameter estimates; Figure 4a presents ICC curves for items 1–4 of the 10 polytomous items. Figure 4b and c presents ICC curves for items 5–8 and 9 and 10, respectively.
Measurements of the same construct collected at different times or by different forms must be brought to the same scale to be comparable. In the field of testing, when tests are used to make high-stakes decisions, the scores for examinees who took the test on 1 occasion using 1 test form should be comparable to the scores of examinees who took the test on another occasion using a different test form. Due to the security of test programs, it is common practice to administer different forms of the test on different testing occasions. However, it is hard to construct 2 truly parallel forms, and often these test forms differ in difficulty. Yet it is important to avoid a situation where 1 group of test takers has an unfair advantage because they were administered an easier form of the exam.48 Therefore, the test scores must be equated to account for the possible differences in difficulty between the test forms or differences in ability between the groups of test takers.
Equating is a statistical process used to adjust scores on test forms so that scores on the forms can be used interchangeably.36 After equating, alternate forms of the same test yield scaled scores that can be used interchangeably even though they are based on different sets of items.49 It is important to point out that statistical adjustment is not possible for differences in content. The responsibility for the content equivalence between 2 forms of a test lies entirely on test developers.
For the past 30 years, equating has received much deserved attention and research. Many new equating methods have been proposed and tested in both research and operational testing programs. I will introduce only general principles related to equating here, as my goal is to make the reader aware of the procedure. Those who wish to expand their knowledge of equating should turn to the literature published in the field of educational measurement.
The first step in the process of equating is to decide on an equating design. Test scores can be equated using either the same populations or the same items. Single-group design assumes that 2 test forms can be equated if they are given to the same population of examinees. Since the same examinees take both tests, the difficulty levels are not confounded by the ability of the examinees.37 Equivalent-group design assumes that 2 test forms are given to similar but not the same populations of examinees. Reasonable group equivalence may be achieved through random assignment.13
Common-item design requires that both forms of the test contain a set of the same items, usually called “anchor” items; the forms are then administered to different populations of examinees. Subsequently, a function that relates the statistics computed for each anchor set will account for the differences in difficulty. This mathematical function is then used to equate the nonanchor items on both forms.36,37
An appropriate equating methodology must be chosen, depending on which theoretical framework is preferred by the testing program, to obtain the test-taker statistics and the item-level statistics. Equating methods have been developed based on both CTT and IRT. When pairs of statistical values for 2 forms have been obtained, a decision is made regarding the methods to be used to relate these exams. Several methods can be selected from the framework of linear modes for this; they include regression methods, mean and sigma procedures, or characteristic curves methods.
Equating is the strongest form of linking. The tests can be similar or even equivalent in content and different in difficulty, or they can be different in content and also in difficulty. When tests are different in content, the scores obtained on these exams may still need to be put on the same scale. In this case, the statistical process of adjusting the scores for difficulty is called linking. When linking is used for equating, the relationship is invariant across different populations.36 The term equating is reserved for the situation when scores from 2 tests of the same content are linked. The statistical procedures used in equating may not differ for linking; however, no linking procedures can adjust for differences in content.
This article presents researchers and clinicians in the health sciences with an introduction to educational measurement—the history, theoretical frameworks of the CTT and IRT, and the most common IRT models used in modern testing.
This article is dedicated to Dr Howard B. Lee, a mentor and friend.
FUNDING AND CONFLICTS OF INTEREST
No funding was received for this work, and the author has no conflicts of interest to declare relevant to this work.
Igor Himelfarb is the director of the Department of Psychometrics and Research for the National Board of Chiropractic Examiners (901 54th Avenue, Greeley, CO 80634; firstname.lastname@example.org). Address correspondence to Igor Himelfarb, National Board of Chiropractic Examiners, 901 54th Avenue, Greeley, CO 80634; email@example.com. This article was received July 2, 2018; revised September 16, 2018, and December 20, 2018; and accepted December 27, 2018.
Concept development: IH. Design: IH. Supervision: IH. Data collection/processing: IH. Analysis/interpretation: IH. Literature search: IH. Writing: IH. Critical review: IH.