Objectives

To map the statistical methods applied to assess reliability in orthodontic publications and to identify possible trends over time.

Materials and Methods

Original research articles published in 2009 and 2019 in a subset of orthodontic journals were downloaded. Publication characteristics, including publication year, number of authors, single vs multicenter study, geographic origin of the study, statistician involvement, study category, subject category, types of reliability assessment, and statistical methods applied to assess reliability, were recorded. Descriptive statistics, Chi-square tests, and logistic regression analyses were performed to investigate associations between reliability analysis and study characteristics.

Results

A total of 768 original research articles were analyzed. The most prevalent study category was observational (69%) with a statistician involved in 16% of studies. Overall, reliability was assessed in 47% of studies, and the most frequent methods applied to assess reliability were intraclass correlation coefficients or kappa statistics (60.4%). The odds of applying appropriate methods were greater in 2019 than in 2009 (odds ratio [OR]: 2.43; 95% confidence interval [CI]: 1.75, 3.37; P < .001). Involvement of a statistician resulted in greater odds of applying appropriate methods compared to no statistician involvement (OR: 1.88; 95% CI: 1.23, 2.87; P < .01).

Conclusions

Over the past decade (2009 vs 2019), reliability assessment became more common in the orthodontic literature, and studies applying correct statistical methods to assess reliability significantly increased. This trend was more apparent in studies that involved a statistician, which may highlight the role of the statistician.

Reliability in clinical measurements pertains to reproducibility over time with minimal measurement errors. In a clinical study, it is important to assess the reliability of the applied measurements to exclude imprecision and biases due to use of inappropriate measures. Reliability should be assessed even for methods reported to be reliable in past studies because there is no guarantee that the same method implemented by a different investigator in a different setting will also be reliable.1  Though some uncertainties are inevitable during an experiment or survey, interobserver variability also needs to be addressed.2 

In 2000, when BeGole examined common statistical procedures in 203 articles published in three leading orthodontic journals, the proportion of studies that had included reliability assessment was relatively small, with only 33 using reliability statistics out of 407 statistical procedures (8%).3  In the past, reliability assessment in healthcare commonly relied on the root mean squared error (Dahlberg procedures),4  the intraclass correlation coefficient (ICC),5  and the kappa statistic6  with no studies applying graphical methods such as the Bland-Altman plot.7  More recently, reliability assessment has become routine in the orthodontic literature and rigorous and advanced statistical methods have been increasingly used. However, several studies still used suboptimal methods such as a t-test or the correlation coefficient, for reliability assessment.8  The t-test was designed to compare the difference between two means and not the concordance between two measurements. In addition, the use of different statistical tests for reliability assessment can create significant interstudy variation and can prevent comparison of repeatability across studies. Use of more appropriate methods can provide a more optimal assessment of reliability and implementing such tests as standard practice would allow for comparison of interstudy reliability.

There has been no recent study assessing reliability analysis in the orthodontic literature. Therefore, the aim of the present study was to map the statistical methods that were applied to assess reliability in orthodontic publications. In addition, changes in reliability assessment over time were examined as well as possible associations between the use of optimal approaches and study characteristics.

Data Sources, Search, and Selection

Original research articles published in 2009 and 2019 in five orthodontic journals with an average Journal Citation Reports (JCR) score > 1 over the past decade were selected. JCR has often been used to select leading journals9,10  with the highest impact factors in each specialty. The selected journals were American Journal of Orthodontics and Dentofacial Orthopedics (AJODO), Angle Orthodontist (AO), European Journal of Orthodontics (EJO), Orthodontics and Craniofacial Research (OCR), and Korean Journal of Orthodontics (KJO).

In the present study, all original research articles were downloaded and electronically searched by one investigator (Y.M.A.A.). Editorials, case reports, systematic reviews with or without meta-analyses were excluded. To extract reliable data, multiple calibration sessions were developed by two investigators (J.A.P. and S.J.L.) under detailed written instructions (Figure 1).

Figure 1.

Flow diagram of study selection.

Figure 1.

Flow diagram of study selection.

Close modal

Data Extraction, Collation, and Collection Process

From each publication, the following variables was extracted: publication year, number of authors, single- vs multicenter study, geographic origin of the study, statistician involvement, study category, subject category, types of reliability assessment (intra- and interexaminer reliability), and statistical methods applied to assess reliability. Among the article characteristics, geographic origin (continent), study category, subject category, and types of statistical tests were consistent with previous categories reported by Koletsi et al.10 

Statistician involvement (yes/no) was determined using the title page and acknowledgment statement. When statistician involvement was not clearly stated in these sections, statistician involvement was coded as “no.” Data extraction followed an iterative process until all disagreements were eliminated.

Statistical Analysis

Descriptive statistics were calculated and Chi-square tests and logistic regression analyses were performed to examine associations between study characteristics over time. Study characteristics that were associated with the primary outcome (appropriate/inappropriate reliability analysis) in the univariable logistic regression were added in the multivariable model. All statistical analyses were performed using Language R version 4.1.1 (R Foundation for Statistical Computing, Vienna, Austria).11 

A total of 768 original research articles published in five selected orthodontic specialty journals were analyzed (Figure 1).

First, a pilot calibration was conducted on 10 articles, which produced disagreement on seven items between the two investigators. After 4 days, the second calibration conducted on another 10 articles resulted in disagreement on four items. After 1 week, a third calibration conducted on another 60 articles resulted in disagreement on only two items. Not all the disagreements between the two investigators could be resolved by multiple calibration sessions. Overall, there were disagreements on 56 items out of a total of 6912 items, which were later corrected by unanimous consent.

AO included the highest proportion of original research articles, followed by AJODO. Fewer original research articles were published in 2019 (n = 335) compared to 2009 (n = 433) (Table 1).

Table 1.

Assessment of 768 Original Research Articles According to Article Characteristicsa

Assessment of 768 Original Research Articles According to Article Characteristicsa
Assessment of 768 Original Research Articles According to Article Characteristicsa

Specifically, the number of original research articles published by authors in Europe and Asia had decreased, whereas the number of articles was similar between the two periods for America. Number of co-authors > 5 and number of multicenter studies significantly increased in 2019 relative to 2009. A statistician was involved in 14% and 18% of the studies in 2009 and 2019, respectively. Over time, the most prevalent study category was observational (n = 463, 60.3%) with the highest proportion concerning human subjects (n = 559, 72.8%). The numbers of in vitro and animal studies significantly decreased in 2019 relative to 2009 (Table 1).

Overall, 47% of studies assessed reliability. Among studies that assessed reliability, only 22% reported both intra- and interexaminer reliability. The proportion of studies that assessed both intra- and interexaminer reliability significantly increased in 2019 relative to 2009. The most frequent method applied to assess reliability was ICC or the kappa statistic (60.4%). In reporting reliability, the proportion of studies using a graphical method to assess reliability, for example, the Bland-Altman plot, increased in 2019 relative to 2009. The proportion of studies incorporating optimal reliability statistics increased over the past decade, and incorrect use of inferential tests as a reliability measure, such as t-tests, analysis of variance (ANOVA), and correlation statistics, decreased from 2009 to 2019 (Table 2).

Table 2.

Type of Reliability Assessment and Statistical Analysis Method to Measure Reliability

Type of Reliability Assessment and Statistical Analysis Method to Measure Reliability
Type of Reliability Assessment and Statistical Analysis Method to Measure Reliability

Figure 2 depicts the counts of appropriate/inappropriate analyses per year and statistician involvement. In the adjusted multivariable analysis, the odds of applying appropriate methods to assess reliability were greater in 2019 than in 2009 (odds ratio [OR], 2.43; 95% confidence interval [CI]: 1.75, 3.37; P < .001). For in vitro, interventional, and animal studies, the odds ratios of applying correct methods were < 1 (P < .001). Involvement of a statistician resulted in greater odds of applying appropriate reliability assessments compared to no statistician involvement (OR, 1.88; 95% CI: 1.23, 2.87, P < .01) (Table 3).

Figure 2.

Stacked bar plot demonstrating that the proportion of studies conducting correct use of methods to assess reliability increased over time and were more likely in studies involved with a statistician.

Figure 2.

Stacked bar plot demonstrating that the proportion of studies conducting correct use of methods to assess reliability increased over time and were more likely in studies involved with a statistician.

Close modal
Table 3.

Result of Univariable and Multivariable Logistic Regression Analyses to Examine Potential Associations Between Study Characteristics and Use of Appropriate Methods to Assess Reliability (n = 768)a

Result of Univariable and Multivariable Logistic Regression Analyses to Examine Potential Associations Between Study Characteristics and Use of Appropriate Methods to Assess Reliability (n = 768)a
Result of Univariable and Multivariable Logistic Regression Analyses to Examine Potential Associations Between Study Characteristics and Use of Appropriate Methods to Assess Reliability (n = 768)a

The present study was performed to record the statistical methods used to assess reliability in orthodontic publications, and to examine possible trends over time. Significant improvements were observed when the data extracted from 2019 publications were compared to data extracted from 2009 publications. For example, the percentage of studies that included reliability assessment increased from 39% in 2009 to 58% in 2019. The use of appropriate methods to assess reliability was greatly increased in 2019. It was also noticeable that involvement of a statistician increased the proportion of correct application of the reliability assessment. It could be conjectured that consultation with a statistician might be meaningful in conducting an appropriate method.

The quantity and quality of reliability statistics changed over time. According to a report published in 2000,3  reliability measures were applied to <10% of the articles and were limited only to two kinds of procedures: the Dalberg procedure (56%) and ICC (43%).3  More recently, the Dahlberg procedure was used less frequently (n = 107) than ICC statistics (n = 217). The proportion of incorrect methods to assess reliability decreased dramatically with increased application of the Bland-Altman plot in 2019 relative to 2009 being also noticeable. This trend likely indicates increased awareness and improvements in orthodontic research methodology.

The number of coauthors per study and the proportion of multicenter studies increased in 2019 relative to 2009 and was in agreement with the increased number of multidisciplinary research efforts in recent years.9,10  However, despite the increased use of more advanced statistical tests, statistician involvement did not increase in 2019 relative to 2009 in the present study samples. This could have been due to the versatility and easy accessibility of commercial statistical software now available to investigators. Currently, much of the software is commercially available but often requires a trained statistician to guide in the analytical process for it to be credible and reliable.

Use of inappropriate methods such as t-tests, ANOVAs, and correlation analyses to assess reliability decreased over time. However, those three methods for reporting reliability are still common and were used in 117/469 (25%) of the selected articles.

Mean comparison methods, such as t-tests or ANOVA, should be used to find differences between means of groups and not to measure reliability. No significant differences between groups after the use of mean comparing tests implies no difference in the means but provides no information regarding the range of deviations at the individual level. Large disagreements between pairs of individual reliability measurements can still result in small or even non-existent mean differences, completely masking the lack of agreement.1,2 

The correlation coefficient was the second most prevalent among incorrect methods to assess reliability. However, the correlation coefficient should not be used as a reliability measure since it does not indicate the agreement, but the linear association between two variables. It is likely that two highly correlated sets of measurements never agree. For example, the pairs (1,2) (2,4) (3,6) (4,8) (5,10) have a correlation coefficient of 1 but, in reality, their individual pair values are evidently in great disagreement. In addition, the null hypothesis for a correlation test is testing whether the correlation coefficient is zero and, thus, the magnitude of the P values after the correlation test are not very meaningful. A common misconception was that small P values would indicate strong correlation; however, low P values after a correlation test have nothing to do with the strength of the correlation.1,2  In this regard, it was encouraging to see that studies applying incorrect methods to assess reliability decreased over time.

This study was not without limitations since it included articles in 2009 and 2019, omitting the years in between. Including all years would have resulted in an intractable number of publications. However, the sample might be adequate to map the area and provide evidence in the reliability analysis practices over time.

  • Based on this cross-sectional survey of original research articles published in five orthodontic journals in 2009 and in 2019, the results demonstrated that studies conducting correct methods to assess reliability significantly increased over time.

  • Involvement of a statistician increased the odds of applying correct statistical methods to assess reliability, which may highlight the meaningful role of the statistician in orthodontic research.

1. 
Donatelli
RE,
Lee
SJ.
How to report reliability in orthodontic research: Part 1
.
Am J Orthod Dentofacial Orthop
.
2013
;
144
:
156
161
.
2. 
Donatelli
RE,
Lee
SJ.
How to report reliability in orthodontic research: Part 2
.
Am J Orthod Dentofacial Orthop
.
2013
;
144
:
315
318
.
3. 
BeGole
EA.
Statistics for the orthodontist
.
In:
Graber
TM,
Vanarsdall
RL,
eds.
Orthodontics: Current Principles and Techniques
.
St. Louis
:
Mosby;
2000
:
339
352
.
4. 
Dahlberg
G.
Statistical Methods for Medical and Biological Students
.
London
:
George Allen & Unwin Ltd.;
1940
.
5. 
Fleiss
JL.
The Design and Analysis of Clinical Experiments
.
New York
:
John Wiley & Sons;
1985
.
6. 
Landis
JR,
Koch
GG.
The measurement of observer agreement for categorical data
.
Biometrics
.
1977
;
33
:
159
174
.
7. 
Bland
JM,
Altman
DG.
Statistical methods for assessing agreement between two methods of clinical measurement
.
Lancet
.
1986
;
1
:
307
310
.
8. 
Donatelli
RE,
Lee
SJ.
How to test validity in orthodontic research: a mixed dentition analysis example
.
Am J Orthod Dentofacial Orthop
.
2015
;
147
:
272
279
.
9. 
Aura-Tormos
JI,
Garcia-Sanz
V,
Estrela
F,
Bellot-Arcis
C,
Paredes-Gallardo
V.
Current trends in orthodontic journals listed in Journal Citation Reports. A bibliometric study
.
Am J Orthod Dentofacial Orthop
.
2019
;
156
:
663
674.e661
.
10. 
Koletsi
D,
Madahar
A,
Fleming
PS,
Pandis
N.
Statistical testing against baseline was common in dental research
.
J Clin Epidemiol
.
2015
;
68
:
776
781
.
11. 
R Development Core Team.
R: A Language and Environment for Statistical Computing
.
Vienna, Austria
:
R Foundation for Statistical Computing;
2021
.

Author notes

The first two authors contributed equally to this work.

a

Assistant Professor and Program Director, Department of Orthodontics, College of Dentistry, University of Florida, Gainesville, Florida, USA.

b

Postgraduate Student, Department of Orthodontics, Graduate School, Seoul National University, Seoul, Korea.

c

Resident, Ministry of Health, Kingdom of Saudi Arabia.

d

Professor, Department of Orthodontics and Dentofacial Orthopedics, School of Dental Medicine, University of Bern, Bern, Switzerland.

e

Professor, Department of Orthodontics and Dental Research Institute, Seoul National University School of Dentistry, Seoul, Korea.