ABSTRACT
To compare an automated cephalometric analysis based on the latest deep learning method of automatically identifying cephalometric landmarks (AI) with previously published AI according to the test style of the worldwide AI challenges at the International Symposium on Biomedical Imaging conferences held by the Institute of Electrical and Electronics Engineers (IEEE ISBI).
This latest AI was developed by using a total of 1983 cephalograms as training data. In the training procedures, a modification of a contemporary deep learning method, YOLO version 3 algorithm, was applied. Test data consisted of 200 cephalograms. To follow the same test style of the AI challenges at IEEE ISBI, a human examiner manually identified the IEEE ISBI-designated 19 cephalometric landmarks, both in training and test data sets, which were used as references for comparison. Then, the latest AI and another human examiner independently detected the same landmarks in the test data set. The test results were compared by the measures that appeared at IEEE ISBI: the success detection rate (SDR) and the success classification rates (SCR).
SDR of the latest AI in the 2-mm range was 75.5% and SCR was 81.5%. These were greater than any other previous AIs. Compared to the human examiners, AI showed a superior success classification rate in some cephalometric analysis measures.
This latest AI seems to have superior performance compared to previous AI methods. It also seems to demonstrate cephalometric analysis comparable to human examiners.
INTRODUCTION
With the recent advancement in computer technology, machine learning has played a significant role in the detection and classification of certain diseases identified in medical images.1,2 In orthodontics, there have also been efforts to utilize machine learning techniques in a variety of ways, one of which was the automated identification of cephalometric landmarks via artificial intelligence (AI).3–10 Although research using three-dimensional images has attracted attention,11–13 the two-dimensional cephalometric image is still important and is the most commonly utilized tool in orthodontics for diagnosis, treatment planning, and outcome prediction.3,14–18
As more attempts were made to incorporate AI into cephalometric analysis, worldwide AI challenges began in 2014 at the International Symposium on Biomedical Imaging conferences under the support of the Institute of Electrical and Electronics Engineers (IEEE ISBI). As accuracy measures of AI, the AI challenge detected 19 cephalometric landmarks (Table 1) and used them to test the success detection rate calculated from 2-mm to 4-mm error ranges for the 19 landmarks (Table 2). Since 2015, the challenge became more clinically oriented and started reporting the success classification rates for the eight cephalometric measurements shown in Tables 3, 4, and 5.5,7–10
Comparison Between the Latest AI and Human Examiners in Terms of Success Classification Ratesa

In line with the trend of comparing the performance of evolving AI, the purpose of this study was to compare and evaluate an automated cephalometric analysis, based on the latest deep learning method of automatically identifying cephalometric landmarks,4,6,19 with a number of previously published AIs.7–9,20,21 In addition, the study assessed how accurately the latest AI performed cephalometric analysis compared to human examiners.
MATERIALS AND METHODS
The institutional review board for the protection of human subjects at Seoul National University Dental Hospital reviewed and approved the research protocol (ERI 19007).
Figure 1 summarizes the experimental design. The new AI applied in this study was based on a modification of the latest deep learning method, the YOLO version 3 algorithm.4,6,19 The deep learning was processed by a workstation running Ubuntu 18.04.1 LTS with NVIDIA Tesla V100 GPU (NVIDIA Corporation, Santa Clara, CA, USA). During the learning process of the latest AI, a total of 1983 cephalograms were used as the training data. An examiner (examiner 1, SJL), with 30 years of clinical orthodontic practice experience, manually detected cephalometric landmarks.
The test data consisted of 200 new images. Examiner 1 manually identified landmarks in the test data. These acted as reference points. Then, another examiner (examiner 2, HWH), a board-certified orthodontist, manually identified the same landmarks in the test data. Lastly, the trained new AI detected landmarks in the test data set. The comparison of the results from the test data set used the conventional 19 landmarks that were identical to those used at the IEEE ISBI challenges (Table 1).7,9,10
The performance of the cephalometric analyses were analyzed according to the same measurement values and test formats suggested by the IEEE ISBI challenges: (1) the success detection rates (SDR) of 2-, 2.5-, 3-, and 4-mm error ranges for the 19 landmarks (Table 2), and (2) the success classification rates (SCR) for eight cephalometric analysis measures (Tables 3, 4, and 5).7,9,10
When calculating SDR in the 2-mm error range, if the landmark detection by the AI showed an error within 2 mm from the reference, the landmark was assumed to be successfully detected. Most AI studies reported that errors within the range of 2 mm were clinically acceptable.4,6,8–10
For the classification of anatomical types, the same eight clinical measurements set in the IEEE ISBI 2015 challenge were analyzed. The criteria are shown in Table 3. The measurement values and clinical classification results derived by examiner 1 were set as the reference values, and the classification results by the latest AI and examiner 2 were compared through SCR. For instance, when the ANB value of a certain patient was identified as Class II from the analysis by examiner 1, if the same patient's ANB classifications by the AI and by examiner 2 were independently identified as Class II, the classification results by the AI and by examiner 2 were assumed to be successful.
RESULTS
SDR of the new AI for each of the 19 landmarks are shown in Table 2. Eleven of the 19 landmarks showed more than 80% SDR with an error range within 2 mm. The average SDR in the 2 mm error range was 75.45%. Figure 2 shows the detection error pattern for cephalometric landmarks.
Scatter plots with 95% confidence ellipses for the landmark detection errors.
The comparisons between the new AI with previous AIs in terms of the SCR are summarized in Table 4. The average SCR of the new AI was 81.53%, which was the highest when compared to previous AIs. In each of the eight measurements, the SCRs of the new AI were superior to previous AIs.
Comparisons of the SCR between the human examiner and the latest AI are summarized in Table 5. The latest AI showed a superior success classification rate compared to the human examiner (examiner 2) in three cephalometric measurements. In the remaining measurements, examiner 2 showed better SCR results.
DISCUSSION
The present study demonstrated how well the latest automated cephalometric analysis performed when compared to previously reported AIs. The latest AI showed better performance than previous AIs, and it demonstrated similar accuracy performance to that from human examiners. Park et al. suggested that deep learning-based techniques could provide improved accuracy compared to other popularly applied AI algorithms such as random forest.6 Later, Hwang et al. evaluated whether the AI in the automatic identification of cephalometric landmarks could actually exceed manual markings performed by clinical experts. In general, the answer was “yes” in terms of reproducibility.4 Although the previous studies focused on the development of AI and its reliability in terms of the Euclidean distance measures of landmark detection errors, this present study focused on its clinical relevance, ie, how accurately the new AI would perform cephalometric analyses useful for patient treatment. In addition, compared to previous AIs, the quantity of study data was almost doubled from N = 1311 to N = of 2183, which improved the power of explanation.4,6,22
The SDR results showed that the latest AI successfully detected most landmarks within 2 mm at close to 90%. Some landmarks showed less accurate results than others in terms of SDR. The reasons for this might be that some landmarks, such as Porion, Orbitale, and PNS are difficult to detect due to overlapping cranial base structures. Porion, Orbitale, and Gonion exist bilaterally and might cause errors in the process of determining the midpoint of those bilateral structures. Some landmarks, such as Point A, Point B, Gonion, and soft tissue Pogonion, were difficult to pinpoint exactly even in a magnified view. Orbitale and PNS showed greater standard deviation values when identified by the AI (Table 2). As Figure 2 shows, errors during the landmark identification appeared both on the x- and y-axes. Since Pogonion is defined as the anterior point of the chin region, it was easy to identify it on the x-axis. Consequently, the y-axis error was greater than the x-axis error. Similarly, error patterns could be interpreted for Pogonion, Gnathion, and Menton that represent the anterior, anteroinferior, and inferior points of the chin, respectively. These error patterns also were demonstrated between human examiners (Figure 2).
For an AI method to be useful in clinical orthodontic practice, analyses of angles or linear measurements derived from AI-detected landmarks are necessary. The results of SCRs demonstrated that the clinical classification performance of the latest AI was better than previously published AI technologies. In addition, the current study utilized approximately 2000 images in the training procedures for the new AI. This likely contributed to the improvement in the SCR over previous AIs.22 As would be expected, measurements consisting of landmarks with more accurate SDR values resulted in more accurate SCR values. Although not all measurements showed that the new AI was performing better than humans, taking into account the overall results, the AI performance could be considered comparable to a human's analysis. In clinical practice, identifying anatomical landmarks, tracing anatomical structures, and diagnosing patients' problems used to be a significant, time-consuming task. Consequently, saving labor, time, and effort with the help of this improved AI would be advantageous to clinicians.
Previously published algorithms can be divided into two categories: random forest23 and convolutional neural network.1,24 Random forest is a machine learning algorithm developed in 2001 and has a relatively long history. Until fairly recently, it seemed the mainstream in the development of automated cephalometric analysis systems. For example, the majority of teams participating in the IEEE ISBI challenges in 2014 and 2015 applied the random forest algorithm. A major drawback of random forest is its complexity. The decision tree that makes up the random forest is a logical structure. However, with many decision tree collections, it is difficult to intuitively identify relationships that may exist in input data. In addition, the prediction process using random forest may take more time and require more computational resources than other algorithms. In comparison, convolutional neural network is a type of deep learning that has come into the spotlight relatively recently.25 It was inspired by the natural visual recognition mechanism of living things and is known to be more suitable for image processing. The YOLO version 3 algorithm that was modified and applied in this present study was based on the convolutional neural network.19
Medical image analysis using deep learning technology could help clinicians make their diagnosis and treatment planning more efficient. Despite the ability to deliver better performance through deep learning methods, some limitations require careful attention when applying deep learning for use in clinical practice. First, deep learning structures would require a huge amount of training data and computing power. In some cases of deep learning procedures, even though the input and output values are known, their internal structure cannot be well explained. Finally, deep learning might be affected by the noise issues inherent in medical imaging. Deep learning algorithms can misread objects if even only a little noise, which would be invisible to humans, is added to the image.2
Despite the limitations, it seems that the performance of the latest deep learning algorithms continues to improve over the years. Deep learning techniques have the potential to be a great help in the development of medical image analysis, including orthodontic cephalometric analysis. In addition, if a clinician applies the deep learning AI cephalometric analysis with some manual modification, the results could be even better.
CONCLUSIONS
The latest AI seems to have superior performance than previous AIs. It demonstrated comparable cephalometric analysis to human examiners.
It is envisioned that AI will maintain, and even improve, its effectiveness under supervision by orthodontists.
ACKNOWLEDGMENT
The data presented in the present study were part of a doctoral dissertation (HWH). This study was partly supported by grant 05-2020-0021 from the SNUDH Research Fund.
REFERENCES
Author notes
Clinical Lecturer, Department of Orthodontics, Seoul National University Dental Hospital, Seoul, Korea.
Graduate Student, Department of Orthodontics, Graduate School, Seoul National University, Seoul, Korea.
Assistant Professor and Program Director, Department of Orthodontics, University of Florida College of Dentistry, Gainesville, Fla., USA.
Professor, Department of Orthodontics and Dental Research Institute, Seoul National University School of Dentistry, Seoul, Korea.