To compare detection patterns of 80 cephalometric landmarks identified by an automated identification system (AI) based on a recently proposed deep-learning method, the You-Only-Look-Once version 3 (YOLOv3), with those identified by human examiners.
The YOLOv3 algorithm was implemented with custom modifications and trained on 1028 cephalograms. A total of 80 landmarks comprising two vertical reference points and 46 hard tissue and 32 soft tissue landmarks were identified. On the 283 test images, the same 80 landmarks were identified by AI and human examiners twice. Statistical analyses were conducted to detect whether any significant differences between AI and human examiners existed. Influence of image factors on those differences was also investigated.
Upon repeated trials, AI always detected identical positions on each landmark, while the human intraexaminer variability of repeated manual detections demonstrated a detection error of 0.97 ± 1.03 mm. The mean detection error between AI and human was 1.46 ± 2.97 mm. The mean difference between human examiners was 1.50 ± 1.48 mm. In general, comparisons in the detection errors between AI and human examiners were less than 0.9 mm, which did not seem to be clinically significant.
AI showed as accurate an identification of cephalometric landmarks as did human examiners. AI might be a viable option for repeatedly identifying multiple cephalometric landmarks.
Recently, in the field of automated identification of cephalometric landmarks, the latest deep learning method based on the You-Only-Look-Once version 3 algorithm (YOLOv3)1,2 detected 80 landmarks and resulted in not only more accurate but also faster detecting performance.3 The performance of an automated identification system (AI) has traditionally been compared by the successful detection rates of 19 skeletal landmarks with a 2-mm range, which has conventionally been accepted as a clinical error range at AI performance competitions.4,–6 Rather than again comparing certain AI techniques to other AI techniques to determine which were more accurate, the present study proposed a new automatic identification method and tested whether this new AI method was better and more reliable than clinically experienced human experts. This could be more interesting and actually applicable to clinicians. However, when it comes to a reliability measure when identifying a certain cephalometric landmark, there is no firm “ground truth” or gold standard that can provide validation as to where the true location of the landmark is.7,–9 Consequently, a study design to answer questions as to (1) whether differences between AI and human examiners would be smaller than those between human examiners (better accuracy in finding landmarks) and (2) whether AI might result in smaller differences upon repeated detection trials than those resulted by humans (better reproducibility of landmarks) would be helpful and appropriate. These results could indicate if AI could be safely proposed for use in clinical practice.
The purpose of this study was to compare detection patterns of 80 cephalometric landmarks identified by a recently proposed deep-learning method, YOLOv3, with those identified by human examiners. The pattern of differences according to image quality and metallic artifacts on images was also investigated. The null hypothesis was that there would be no significant difference between AI and human examiners regarding (1) accuracy in finding landmarks and (2) reproducibility of landmarks.
MATERIALS AND METHODS
The institutional review board for the protection of human subjects reviewed and approved the research protocol (S-D 2018010 and ERI 19007).
Figure 1 summarizes the experimental design used in the present study. The x-ray image characteristics of data are listed in Table 1. As a result of ethical concerns, the authors' institution has not permitted researchers to use a high-quality electronic medical image in the DICOM format. All of the learning data images were downloaded in the .jpg format with a resolution of 150 and 300 DPI. When digitizing the test data images, a minimum resolution of 150 DPI was maintained.
The same 1311 lateral cephalometric radiograph images that consisted of 1028 learning and 283 test data were applied in the development stage of the YOLOv3-based AI.3 A total of 80 landmarks comprising two vertical reference points and 46 hard tissue and 32 soft tissue landmarks3 were manually identified by a single examiner (examiner 1) who has 28 years of clinical orthodontic practice experience.
The test set of 283 images were also manually identified for the same 80 landmarks by examiner 2 twice within 3-month intervals. Examiner 2 was a third-year resident at the same institution as examiner 1.
The test data were film images with varying degrees of image quality. Test images were classified according to gender, skeletal classification, and presence of metallic artifacts. These x-ray images displayed varying degrees of image quality, which were subjectively classified as “good,” “fair,” or “poor” (Table 1).
The AI method based on the YOLOv3 algorithm applied in the present study was described in detail in part 1 of this AI project.3 The deep-learning was processed by a workstation running Ubuntu 18.04.1 LTS with NVIDIA Tesla V100 GPU (NVIDIA Corporation, Santa Clara, Calif). After the deep-learning procedure on 80 landmark locations of 1028 images was conducted, the trained AI automatically found each landmark on the 283 test images.
Differences between AI and examiner 1, differences between examiners 1 and 2, and differences between examiner 2's first and second trials were calculated in terms of distance measured in millimeter scales. To compare the detection accuracy between AI and human examiners controlling for multiplicity problems, t-tests with the Bonferroni correction of alpha errors were performed. To investigate which image factors might have influence on significant differences in the landmark identification, multiple linear regression analyses were conducted.
Accuracy in Finding Landmarks
When identifying 46 skeletal landmarks, AI showed better accuracy in 14 out of 46 landmarks, the human examiner did better in 14 out of 46 landmarks, and the remaining 18 out of 46 did not show statistically significant differences. Regarding the 32 soft tissue landmarks, AI showed a better accuracy in 5 out of 32, the human examiner did better in 7 out of 32, and the remaining 20 out of 32 did not show statistically significant differences (Figure 2; Table 2).
The mean detection error between AI and the human was 1.46 ± 2.97 mm. The mean difference between human examiners was 1.50 ± 1.48 mm.
Figure 3 illustrates representative cases in which there was no statistically significant difference between AI and the human examiners (ie, Articulare) and in which the human demonstrated a more accurate detection (ie, upper incisal edge). In either case, however, comparisons in the mean detection errors between AI and the human were less than 0.9 mm. The only exception was the landmark for the lower incisor root tip that showed a 1.2-mm greater error from AI than did the human examiner (Table 2).
Reproducibility of Landmarks Upon Repeated Trials
Upon repeated trials, AI always detected identical positions on each landmark, while the human intraexaminer variability from repeated detection trials was 0.97 ± 1.03 mm.
Comparisons According to Image Variables
The results of the multiple linear regression analysis indicated that AI's accuracy in finding landmarks was not meaningfully affected by image variables such as gender, skeletal classification, image quality, and presence of metallic artifacts.
The present study was formulated to investigate whether AI might be a viable option for the repetitive and arduous task of identifying multiple cephalometric landmarks for use in clinical orthodontic practice. The null hypothesis that there would be no difference between AI and human examiners regarding accuracy in finding landmarks could not be rejected. The mean detection errors between AI and the human did not exceed 0.9 mm, except at only one landmark for the lower incisor root tip, which showed a 1.2-mm difference. In all landmarks, AI demonstrated as accurate identification as did trained orthodontists. In general, all of those mean differences showing less than 2 mm would not seem to be clinically significant errors. However, since AI always detected identical positions, the reproducibility by AI upon repeated detection trials was definitely better than that associated with human examiners.
Among the machine learning methods, deep-learning methods have demonstrated superiority in automatically recognizing anatomical landmarks on diagnostic images. Studies on related topics in various fields have also been gaining more popularity.3,11,,–14 Although three-dimensional images have gained popularity these days,15,,,–19 two-dimensional cephalometric analysis is still a vital tool in orthodontic diagnosis and treatment planning since it provides information regarding a patient's skeletal and soft tissue. Currently, computer-assisted cephalometric analysis eliminates human-induced mechanical errors. Fully automatic cephalometric analysis has long been attempted with the intention of reducing the time required to obtain a cephalometric analysis, improving the accuracy of landmark identification, and reducing the errors caused by a clinician's subjectivity. However, previous studies detected a limited number of landmarks, less than 20, and the accuracy results were not satisfactory for use in clinical orthodontic practice. For example, in 2009, 10 landmarks on 41 digital images were identified.20 In 2013, 16 landmarks were identified on 40 cephalometric radiographs, and the mean error from automatically identified landmarks was 2.59 mm.21 The accuracy of those automated methods was not as good as that associated with manual identification. In addition, cephalometric landmarks need not be limited to simply obtaining the skeletal characteristics of patients but could be also be applied to plan treatment and to predict treatment outcomes, including soft tissue drape changes. For those purposes, an expanded number, even hundreds, of variables of anatomic landmarks is necessary.14,22,,–25
In the present study, unlike the learning data that included images from a variety of malocclusion patients, the test images were selected from patients who had a severe type of mandibular deficiency, prognathism, or facial asymmetry. They eventually had orthognathic surgeries performed. From the first formulation of the current study, the selection of these types of patients was intended to test the performance of AI in a more difficult condition, rather than identifying landmarks on images from good-looking subjects. The descriptive summary in Table 1 reflects and matches well with the current trend of patients seeking a university-affiliated dental healthcare institution that has a high proportion of orthodontic patients with severe skeletal discrepancies.26
The cephalometric landmarks identified could potentially result in errors on both the x and y axes. There are several advantages when visualizing results with scattergrams and the 95% confidence ellipse that was a two-dimensional expansion of the Bland-Altman plot.7,8 One of them is to observe the correlation between the x- and y-axis errors in the shape of the ellipse. Closer to an isometric circle indicates more independence between the x- and y-axis errors. The greater the degree of deformation of the ellipse, the greater the indication of the correlation between the x- and y-axis errors.7,8
In general, the pattern of differences between AI and human examiners demonstrated that AI acted like a human examiner. For example, when human examiners had difficulties in identifying landmarks on poor-quality images, so did AI. This might be the reason why image factors did not meaningfully affect the accuracy of AI in finding landmarks. In those subjects with fixed orthodontic appliances, massive prostheses, and/or surgical bone plates, it was initially anticipated that there would be difficulties in identifying the landmarks because of the multiple metallic artifacts. However, metal artifacts did not appear to have a clinically significant impact on the identification of landmarks either.
One strength of the present study might be that it included the largest number of both learning and test data sets when compared to previous studies. The number of cephalometric landmarks was also the greatest: 80 landmarks including soft tissue glabella to the terminal point on the neck. Conventional key landmarks that have previously been required for cephalometric analysis as well as a large number of other landmarks are essential for accurately predicting posttreatment changes.22,,–25
As a limitation of the present study, the way AI learned during the training session and how it identified landmarks later in the test step are not explainable without describing computer science jargon. Although some technical details have been necessary, this present study intended to focus on showcasing the results from AI. Further details of the modification algorithms appear elsewhere.1,2 Upon repeated trials, AI always found identical positions. However, during preceding pilot studies, when the quantity of learning data was less than 500 images, AI did not identify an identical point. In this regard, how much learning data might be sufficient enough to teach AI is currently unknown. Furthermore, it could be conjectured that the number of target landmarks might also be a contributing factor in deciding a sufficient number for learning data. A study to elucidate the sufficient quantity of data for deep-learning of AI might be necessary in the future.
From the clinical perspective, however, AI would never replace trained specialists in orthodontics, nor might AI intend to replace a comprehensive orthodontic training program. Rather it could supplement, augment, and amplify diagnostic performance by objectively evaluating each patient seeking orthodontic treatment. The AI proposed in the present study can be compatible with the current clinical environment and would retain its validity under the constant supervision of experts in orthodontics.
In general, the pattern of differences between AI and human examiners demonstrated that AI acted like human examiners. AI showed as accurate an identification of cephalometric landmarks as did human examiners.
Upon repeated trials, AI detected always identical positions, which implies that AI might be a more reliable option for repeatedly identifying multiple cephalometric landmarks.
This study was partly supported by grant 05-2018-0018 from the Seoul National University Dental Hospital Research Fund and the Seoul R&BD program (grant number IC 170010) funded by the Seoul Metropolitan Government.
The final form of the machine learning system was developed by DDH Inc (Seoul, Korea), which is expected to own the patent in the future. Among the coauthors, Hansuk Kim and Soo-Bok Her are shareholders of DDH Inc. Youngsung Yu and Girish Srinivasan are employees there. Other authors do not have a conflict of interest.
The first two authors contributed equally to this study.
Resident, Department of Orthodontics, Seoul National University Dental Hospital, Seoul, Korea.
Clinical Lecturer, Department of Orthodontics, Seoul National University Dental Hospital, Seoul, Korea.
Research Assistant, DDH Inc, Seoul, Korea.
Staff Scientist, DDH Inc, Seoul, Korea.
Research Scientist, DDH Inc, Seoul, Korea.
Courtesy Resident, Ministry of Health, Damman, Kingdom of Saudi Arabia.
Assistant Professor, Assistant Program Director, Department of Orthodontics, University of Florida College of Dentistry, Gainesville, Fla.
Professor, Department of Orthodontics, Seoul National University School of Dentistry and Dental Research Institute, Seoul, Korea.