ABSTRACT
To assess the accuracy of ChatGPT answers concerning orthodontic clear aligners.
A cross-sectional content analysis of ChatGPT generated responses to queries related to clear aligner treatment (CAT) was undertaken. A total of 111 questions were generated by three orthodontists based on a set of predefined domains and subdomains. The artificial intelligence (AI)-generated (ChatGPT) answers were extracted and their accuracy was determined independently by five orthodontists. The accuracy of answers was assessed using a prepiloted four-point scale scoring rubric. Descriptive statistics were performed.
The total mean accuracy score for the entire set was 2.6 ± 1.1. It was noted that 58% of the AI-generated answers were scored as objectively true, 18% were selected facts, 9% were minimal facts, and 15% were false. False claims included the ability of CAT to reduce the need for orthognathic surgery (4.0 ± 0.0), improve airway function (3.8 ± 0.5), achieve root parallelism (3.6 ± 0.5), alleviate sleep apnea (3.8 ± 0.5), and produce more stable results compared to fixed appliances (3.8 ± 0.5).
The overall level of accuracy of ChatGPT responses to questions concerning CAT was suboptimal and lacked citations to relevant literature. Ability of the software to offer current and precise information was limited. Therefore, clinicians and patients must be mindful of false claims and relevant facts omitted in the answers generated by ChatGPT.
INTRODUCTION
With advancement in the field of artificial intelligence (AI), ChatGPT (Chat Generative Pre-trained Transformer) was launched in November 2022 by Open AI (OpenAI LLC, San Francisco, CA, USA), as a large, AI-based language model (LLA) with capability to create human-like responses to any question input.1 It has surprised scientists and the general public with its comprehensive and detailed responses to various text input.1 ChatGPT is based on a generative pretrained transformer GPT structure that processes natural tests with the use of neural networks. Thus, the generated responses are often based on the context of the input wording.1 Its superiority lies in its ability to create accurate and highly refined answers due to its wide range of training data from internet sources.2
There are several features that make ChatGPT a powerful tool, including its ability to generate responses in different styles and languages in addition to understanding the context of the question. Its benefits can be generally summarized in its increased efficacy in automating conversations and answering questions, which saves time and resources.2 Due to the fact that it is trained on large data sets of information and conversations, it is characterized by improved accuracy compared to manual responses, outperforming other natural language processing systems (NLP).2 Despite those benefits, there are challenges associated with ChatGPT related to potential security concerns associated with risk of adversarial attacks that aim to manipulate the model and result in incorrect output. Additionally, it has limited capabilities to provide up-to-date and accurate information and/or answers to complex questions for a wide variety of topics due to its hindered ability to browse the internet and inability to access external information from the internet.2 Therefore, uncertainty still remains with regard to accessing high-quality and reliable information via ChatGPT.
Technological advancements have led healthcare professionals and patients to increasingly refer to artificial intelligence (AI) chatbots and online search engines as a convenient source for medical and dental information.3 The conversational interactions and the seemingly correct responses to many medical and dental inquiries have increased reliance on these platforms to answer many questions and reduce reliance on other professional and accurate resources for medical and dental information.3 Risks of this phenomenon have been well illustrated in the healthcare literature comprehensively, and precautionary warnings are thereby warranted.4
Clear aligner treatment (CAT) is one of the most popular and debated orthodontic development among contemporary appliances due to increased demand for esthetic orthodontic treatment.4 In a recent cross-sectional study concerning marketing claims related to clear aligners (Invisalign) made by professional accounts on social media, a total of 92 claims were identified from 50 Instagram posts.5 These claims were mainly related to acceptability of these appliances and less associated pain; only 5% of these claims were objectively true.5 ChatGPT is a convenient source for patients to access information with regard to CAT. Its accuracy, reliability, and content validity with regard to information associated with CAT in orthodontics has not been evaluated, particularly in connection with questions that dentists, orthodontists, and patients are likely to ask. Therefore, this study was designed to give content analysis of ChatGPT in providing comprehensive information about outcomes related to CAT. The hypothesis was that ChatGPT AI-generated responses about CAT would be accurate and reliable.
MATERIALS AND METHODS
A cross-sectional content analysis of ChatGPT generated responses to queries related to CAT was undertaken. An initial dataset of 150 questions was generated by two authors (AAS, AD) and reviewed by a third author (YS). These questions were initially categorized based on 14 key treatment outcome domains and their associated 54 subdomains, as described previously by Tsichlaki et al. using standardized methodology that focused on encompassing clinician- and patient-focused outcomes.6 The initial dataset of questions was refined through joint discussion, resulting in a final selection of 111 questions. Additional questions that did not fit into any of the key domains were categorized into “other.”
To assure consistency, one author (YS) collected the AI-generated answers for the 111 questions (chat.openai.com). A customized Excel sheet was created for data collection and scoring. The accuracy of the collected answers was scored independently by five orthodontists and CAT experts (AAS, AD, UM, AV, and VN). The accuracy of the answers was assessed based on the best available evidence and their clinical expertise in CAT using a modified four-point scale as follows: 1: Objectively true; 2: Selected facts; 3: Minimal Facts; 4: False (Table 1).7 Prior to scoring, a meeting was held to ensure common understanding of the scoring system among raters. Accuracy assessment was piloted by the five raters on a subset of answers (n = 10).
Statistical Analysis
Simple descriptive statistics were used to summarize the data. Outcomes for each question and each domain were examined using GraphPad prism version 10 (GraphPad Software Inc, La Jolla, CA). Average accuracy score results were summarized as mean and standard deviation for each question separately, per domain and subdomain, and for the entire dataset. Distribution of median accuracy scores was also calculated.
RESULTS
Among the 111 scored questions, the total mean accuracy score for the entire set was (2.6 ± 1.1) (Tables 2–5). Of note, 58% of the generated answers fell into the objectively true category, 18% were scored as selected facts, 9% as minimal facts, and 15% were considered false (Figure 1).
Inaccurate Information
The mean accuracy score concerning harms associated with CAT (domain 1) was 2.2 ± 0.5 (Table 2). Inaccurate information was provided by ChatGPT concerning pain associated with CAT compared to fixed appliances (3.4 ± 1.3) (Table 2).
Claims related to the effect of CAT on jaw movements (3.2 ± 1.3) and associated condylar changes were inaccurate (3.0 ± 0.0) (Table 3). Additionally, there was inaccurate information identified in soft tissue changes associated with CAT (2.6 ± 1.5), as well as occlusal and alignment changes (2.9 ± 0.3), especially those concerning the effectiveness of CAT in achieving root parallelism (3.6 ± 0.9) and torque control (3.4 ± 0.9) (Table 3). Inaccurate information was noted in ChatGPT responses to questions related to the ability of CAT to reduce need for orthognathic surgery (4.0 ± 0.0), affect airway volume (3.8 ± 0.5), alleviate sleep apnea (3.8 ± 0.5), and provide more stable outcomes compared to fixed appliances (3.8 ± 0.5) (Table 5).
Accurate Information
Responses associated with periodontal health (1.3 ± 1.6), and microbial and physiological changes (1.3 ± 0.5) in CAT were considered accurate (Table 3). Additionally, ChatGPT provided accurate answers concerning patient satisfaction with CAT (1.0 ± 1.0) and its effect on sleeping, eating, and oral health-related quality of life (1.1 ± 3.1) (Table 4), aligner breakage and loss, hygiene and wear requirements, and frequency and number of appointments (1.4 ± 0.5) (Table 4). Accurate information was also noted regarding treatment duration (1.0 ± 1.2), rate of tooth movement (1.0 ± 2.0), and cost-effectiveness (1.0 ± 1.8) (Table 4).
DISCUSSION
Based on the results of this cross-sectional study conducted 5 months into the existence of ChatGPT, the null hypothesis was rejected. Responses provided concerning CAT were short of being completely reliable. The ability to generate evidence-based answers about CAT were suboptimal, with only 58% of the answers being objectively true. False claims primarily included the ability of CAT to reduce the need for orthognathic surgery, improve airway function, achieve root parallelism, alleviate sleep apnea, and produce more stable results compared to fixed appliances. Therefore, the adequacy of ChatGPT in its current form and its use in academia and research is disputable. The lack of expert level and evidence-based opinions in the composition of many of the generated answers to the questions was often the trend, and the language used in the answers was simple (Table 6). Overall, the answers for all questions were generally lengthy and referred to trials and reviews without in-text citations (Supplementary 1). However, some ChatGPT answers to CAT queries were useful in areas of periodontal health (1.3 ± 1.6), patient satisfaction (1.0 ± 1.0), impact on daily activities (1.1 ± 3.1), in addition to oral/aligner hygiene (1.4 ± 1.3) and wear (1.3 ± 2) requirements.
ChatGPT indicated that CAT was associated with overall less pain compared to fixed appliances. However, a recent systematic review indicated that patients treated with clear aligners reported less pain in the first couple of days only and there was no difference after a week; therefore, these findings were controversial among the included studies and the certainty of evidence was low.8 In terms of function, ChatGPT generated answers indicating that CAT could affect jaw movement and cause jaw pain and difficulty in movement. Likewise, with regard to responses related to condylar changes, ChatGPT indicated the presence of a potential relationship between jaw joint health, condylar adaptation, and use of clear aligners. This was in disagreement with the available evidence that orthodontic treatment might not increase or be related to the prevalence of temporomandibular disorders.9
The AI-generated responses indicated that root parallelism and torque can be achieved with aligners. Likewise, ChatGPT stated the high efficacy in correcting rotations with CAT. Unfortunately, achieving pure root movement and complete derotation on teeth with clear aligners are onerous tasks and attempts to achieve these have been unsuccessful.10 Evidence has also shown that torquing or root movement is the least predictable of all movements with CAT.11 Large language-based algorithms perform relatively well on knowledge-based tests, but their performance is subpar on medical/dental concepts and literature. To deliver excellent performance, AI-based large language models require high-quality data but, currently, they are trained on biased data sets, which may be the reason for inaccurate answers to queries in this specific research field.
The responses related to the effects of CAT on facial and smile esthetics were flawed. The ChatGPT response explained that, if too much space was created between the upper and lower teeth, it could result in a “gummy” smile or an overbite that may detract from the overall appearance of the face. A recent study showed that smile esthetics after the use of Invisalign and fixed appliances was superior (buccal corridors, smile cant, gingival display, maxillary midline, and smile index) to fixed appliances and there was no effect on lip position.12
ChatGPT indicated the ability to reduce need for orthognathic surgery with use of clear aligners, thus failing to provide clinician/orthodontist level insight. Also, the ChatGPT generated response lacked the outcomes achieved with orthognathic surgery treated with clear aligners or fixed appliances.13 Additionally, ChatGPT lacked sources (references) for its answers. In its current format, an AI-generated large language-based model needs further algorithmic training to implement current dental/orthodontic knowledge, principles, and concepts in real-world settings.
ChatGPT seemed to be equally unreliable when it was questioned on the effects of CAT on airway changes and sleep apnea, suggesting positive effects on airway volume and relief of sleep apnea. This was erroneous and overlooked the available evidence on the lack of association between airway improvement and use of CAT in treating any malocclusion.14
ChatGPT used in this research was not a useful tool for generating answers to scientific queries. ChatGPT lacked the knowledge and expertise necessary to accurately convey simple and complex orthodontics concepts adequately. Another major problem with ChatGPT response was redundancy and plausible-sounding false information. Therefore, despite the fact that it might provide patients with some useful insights and general information with regard to their treatment, its use for teaching and research purposes is currently limited and should be avoided. Orthodontic professionals should also be aware of how patients may use these tools and provide them with appropriate guidance. An additional concern was that ChatGPT provided varied responses for the same questions when assessed at different timepoints, which might raise ethical concerns and lead to discrepancies in accessing the desired information. It is possible that, with enough repetition, more accurate answers could be generated due to the machine learning capabilities of ChatGPT. Additionally, it seems like the algorithms need to be trained and tested using published journal articles of high quality as evidence to upgrade the accuracy and capabilities of ChatGPT in answering questions from an evidence-based perspective. Relying exclusively on the currently available version of ChatGPT as a source for valid and reliable information with regard to CAT is not recommended. And false information delivery poses a risk to the profession and patients. Societies need to debunk misleading information and increase awareness among the public and professionals while using this AI tool.
On the other hand, acceptable accuracy levels were observed for answers to questions concerning knowledge, satisfaction, compliance, and cost-effectiveness. Therefore, depending on regular advances in the model’s algorithm and the influences of reinforced learning from human feedback on the system, in the future, ChatGPT might be a useful source for orthodontic patients seeking to comprehend aspects related to CAT since it provides an interactive interface for treatment-related information. However, at the moment, use of these advances cannot be an alternative to conventional means of communicating information related to orthodontic treatment to the patient.
Strengths, Limitations, and Future Directions
A strength of the study was the inclusion of 111 questions developed by three orthodontists based on outcomes relevant to clinicians and patients. However, the selection of a cohort of five orthodontists from the academic sector to rank the generated answers may have introduced response bias. It is important to note that these orthodontists are well-versed in CAT and evidence-based orthodontics and have published extensively. One limitation was that the validation of ChatGPT may not necessarily apply to other AI models. Additionally, ethical and privacy concerns related to the quality of information shared by ChatGPT should be taken into consideration. It is highly recommended to integrate high-quality orthodontic information from sources such as peer-reviewed journals. Additionally, incorporation of in-text citations into these AI algorithms can help ensure the delivery of high-quality information. Establishing a regulatory framework for orthodontic websites is an area on which the specialty should consider focusing.15,16 Future research should be done to perform content analysis of different large language-based AI models.
CONCLUSIONS
The AI-generated answers to questions related to CAT in orthodontics displayed suboptimal accuracy and lacked reference to the current evidence.
False claims were identified, especially on topics related to orthognathic surgery, airway, efficacy of tooth movement, and root control. In the future, ChatGPT could be a useful adjunct tool to improve knowledge and answer questions regarding orthodontic treatment.
However, patients and orthodontists should be aware of the limitations and ethical concerns associated with ChatGPT and actively check available evidence from trusted sources.
Attempts should be made to improve the robustness of these AI models prior to their integration in the healthcare profession.
ACKNOWLEDGMENTS
Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2023R88), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. The authors report no conflicts of interest.
SUPPLEMENTAL DATA
Supplemental Table #1 is available online.
REFERENCES
Author notes
Clinical Assistant Professor, Department of Orthodontics, University of Florida, Gainesville, FL, USA.
Assistant Professor, Department of Preventive Dental Sciences, College of Dentistry, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
Professor and Head, Brodie Craniofacial Endowed Chair, Department of Orthodontics, University of Illinois, Chicago, IL, USA.
Associate Clinical Professor, Division of Orthodontics, UConn Health, Farmington, CT, USA.
Adjunct Professor, Department of Orthodontics, Saveetha Dental College, Saveetha Institute of Medical and Technical Sciences, Chennai, India.
Professor and Chair, Department of Growth and Development, University of Nebraska Medical Center College of Dentistry; and Children’s Hospital, Omaha, NE, USA.