Artificial intelligence algorithms hold the potential to fundamentally change many aspects of society. Application of these tools, including the publicly available ChatGPT, has demonstrated impressive domain-specific knowledge in many areas, including medicine.
To understand the level of pathology domain-specific knowledge for ChatGPT using different underlying large language models, GPT-3.5 and the updated GPT-4.
An international group of pathologists (n = 15) was recruited to generate pathology-specific questions at a similar level to those that could be seen on licensing (board) examinations. The questions (n = 15) were answered by GPT-3.5, GPT-4, and a staff pathologist who recently passed their Canadian pathology licensing exams. Participants were instructed to score answers on a 5-point scale and to predict which answer was written by ChatGPT.
GPT-3.5 performed at a similar level to the staff pathologist, while GPT-4 outperformed both. The overall score for both GPT-3.5 and GPT-4 was within the range of meeting expectations for a trainee writing licensing examinations. In all but one question, the reviewers were able to correctly identify the answers generated by GPT-3.5.
By demonstrating the ability of ChatGPT to answer pathology-specific questions at a level similar to (GPT-3.5) or exceeding (GPT-4) a trained pathologist, this study highlights the potential of large language models to be transformative in this space. In the future, more advanced iterations of these algorithms with increased domain-specific knowledge may have the potential to assist pathologists and enhance pathology resident training.
Rapid technological advancements in molecular pathology and a continuously growing body of knowledge have resulted in an unprecedented increase in diagnostic complexity for pathologists.1 This presents a challenge for pathologists in training and practicing pathologists to stay up to date on the latest guidelines and provide the current standard of care in a variety of subspecialty areas. In recent years, there has been a surge of interest in developing artificial intelligence (AI)–based tools to manage large volumes of data, automate routine tasks, and enhance diagnostic accuracy, especially with the rise of digital pathology.2–5 Despite existing efforts to translate these AI-based applications into clinical practice, the widespread adoption of these tools has been limited to date.6 The development and extensive validation of AI-based applications is essential in improving the efficiency and accuracy of the diagnostic decision-making process and increasing trust in the use of AI-supported decision making in pathology.
The recent release of large language models (LLMs) with the ability to generate humanlike responses to prompts has garnered major attention.7,8 The release of Chat Generative Pretrained Transformer (ChatGPT) as a public tool by OpenAI generated significant attention in the media from both educational and scientific communities.9 ChatGPT is trained by a massive number of textual sources to learn complex language patterns and generate natural-sounding and context-specific text based on the entered prompt.10 A growing body of evidence shows that LLMs such as ChatGPT perform extremely well on standardized tests. GPT-3.5 achieved a passing score on the United States Medical Licensing Examination,11,12 while the GPT-4 model scored in the top 10% of test takers on a simulated bar exam and on graduate record examinations.9
However, there are several limitations in evaluating the performance of LLMs on standardized tests. LLMs are trained on extensive volumes of data, enabling them to acquire a broad “understanding” of various topics. While LLMs excel at pattern recognition and predicting the most probable sequence of words from previous natural-language training sets, they occasionally perform incorrect predictions and are unable to regulate incorrect outputs.13 Currently, there is limited information on how well this method of text generation can articulate intricate concepts in complex domains such as pathology. Additionally, ChatGPT may be trained on identical or similar questions from the internet,14 and therefore, it is plausible that the data produced during a standardized test may instead reflect contaminated training data.15
This study assessed how LLMs perform when applied to short-answer pathology questions, similar to what is used on standardized pathology exams. The questions from the exam were novel and generated by a group of global pathologists to avoid contamination with training sets used to develop the LLMs. The aim of this study was to evaluate the performance of GPT-3.5 and GPT-4 in the pathology domain and their possible application for pathology education. As a comparison, other LLMs such as Bard (Google LLC), perplexity, and Claude instant (Anthropic) were also run on the test set. We also evaluated the ability of human observers to recognize AI-generated outputs.
MATERIALS AND METHODS
This study was announced on X (formerly known as Twitter) by the senior author (M.J.C.) inviting practicing pathologists from around the world to participate as coauthors on the study.16 Interested pathologists were provided with instructions on how to generate and score questions. An initial survey (form 1) was sent out as a Google Form to 15 participants, who were each asked to form 1 short-answer pathology examination question with a difficulty level suited for a senior pathology resident trainee.
These questions were each used as “prompts” for ChatGPT (GPT-3.5 and GPT-4) and were unaltered as submitted by the participating pathologists. These same questions were submitted to a staff pathologist who had recently completed their residency training, who then answered them as well. The study was conducted in 2 phases: Phase 1 involved evaluating responses generated by the ChatGPT utilizing the GPT-3.5 LLM and the answering pathologist. The second phase, after a 2-month washout period, involved evaluating responses generated by ChatGPT utilizing the GPT-4 LLM, which had not been available publicly during phase 1 of the study.
Data in this study were collected using Google Forms. Form 1 had previously been sent to the participants to collect pathology examination questions/prompts. After receiving answers from GPT-3.5 and the answering staff pathologist, form 2 was sent to the participants to evaluate each answer. For each question, the 2 answers (GPT-3.5 versus pathologist) were arranged randomly for blinding purposes. Participants were asked to score the 2 answers, rate the difficulty of the question out of 5, and identify which answer was written by ChatGPT. Participants were also given the option to skip a question if it was outside of their area of expertise. At the end of the survey, participants were asked to provide qualitative feedback on the performance of GPT-3.5. After the 2-month washout period, the same prompts were input into GPT-4, and a third Google Form (form 3) was sent out to the participants to score the quality of the GPT-4-generated answers. The pathologist-generated answers were not reevaluated.
The difficulty of each question was assessed from 1 (“too easy for a resident-level exam”) to 5 (“too difficult for a resident-level exam”). The responses were scored from 1 to 5, with 5 being the perfect answer; overall performance data were also collected. Difficulty-adjusted scores for each individual question were calculated by multiplying mean score by mean perceived difficulty level. These new scores were then averaged to receive an overall difficulty-adjusted score. Data were collected on Google Forms and compiled in Microsoft Excel. Data were analyzed using GraphPad Prism 7 software, with the appropriate t test, 1-way analysis of variance (ANOVA), or 2-way ANOVA, depending on the dataset. A copy of questions and their answers from GPT-4, GPT-3.5, and the staff pathologist is provided (Supplemental Table, see supplemental digital content at https://meridian.allenpress.com/aplm in the October 2024 table of contents). As a comparison, other LLMs—Bard (Google), perplexity, and Claude instant (Anthropic)—were also run on the test set and are presented in the Supplemental Table.
RESULTS
Novel Domain-Specific Pathology Questions
A group of 15 international staff pathologists from 9 countries contributed pathology-specific questions. The question topics, which can be divided into categories as outlined in Figure 1, A, align with topics commonly covered in pathology examinations. The questions also covered a number of subspecialty areas in pathology that aligned with the varied experience of the coauthors, as shown in Figure 1, B. Questions had a mean of 89 characters with a range of 51 to 171 characters.
Performance of GPT-3.5 and GPT-4
GPT-4 produced the longest answers (mean: 1451 characters; range: 383–2652 characters), followed by GPT-3.5 (mean: 880 characters; range: 547–1609 characters). The pathologist had the shortest answers (mean: 109 characters; range: 5–372 characters).
GPT-3.5 received an average score higher than the pathologist in 9 of 15 questions, while GPT-4 scored higher than both GPT-3.5 and the pathologist in 12 of 15 questions (Figure 2, A). Upon aggregation of all scores, GPT-4 performed significantly better than GPT-3.5 (P = .03) and the pathologist (P = .003), with no statistically significant difference between GPT-3.5 and the pathologist’s performance (Figure 2, B). To assess whether GPT-3.5 and GPT-4 performed better on more difficult questions, scores were adjusted by perceived difficulty level. GPT-4 also performed significantly better than the pathologist (P = .002), but not significantly better than GPT-3.5 (Figure 2, C). Answers were also grouped by medical discipline to demonstrate the performance across distinct areas of pathology (Figure 2, D).
Identification of “GPT versus Human” Responses by Pathologists
To determine the ability of participants to identify AI-generated pathology answers, participants were asked during the initial survey with GPT-3.5 to predict which of the 2 randomized answers was written by ChatGPT versus the staff pathologist. Of the 142 total entries, the ChatGPT-generated answer was correctly identified 136 times (95.8%) (Figure 3). The 6 incorrect responses were centered around 5 separate questions.
Overall Assessment of GPT-3.5– and GPT-4–Generated Answers
To compare overall performance of GPT-3.5 and GPT-4, participants were asked to provide their overall impression of GPT-3.5 and GPT-4’s answers using a framework often utilized in residency training programs. Of 13 responses, most pathologists scored GPT-3.5’s answers as meeting expectations (Figure 4). This was repeated with GPT-4–generated answers, and an equal number of pathologists indicated that GPT-4 met or exceeded expectations. A sample of responses from GPT-4, GPT-3.5, and a staff pathologist for question 3 is shown in Table 1.
Participants were also invited to provide qualitative comments to their overall impressions of ChatGPT’s performance. Overall, while GPT-3.5 had an impressive knowledge base, participants felt that it was overly detailed and sometimes confidently incorrect (Table 2). Collectively, these data indicate that GPT-3.5 and GPT-4 have overall favorable impressions from participants, with GPT-4 having a clear and noticeable improvement in performance.
DISCUSSION
In this study, we investigated the performance of GPT-3.5 and GPT-4 using novel pathology questions, measured against a newly practicing staff pathologist. These questions were written at the level of a licensing examination that a senior pathology trainee would be expected to answer. We also assessed the identifiability of AI-generated text by GPT-3.5. The goal of this work was to assess LLMs in a pathology-specific context and to evaluate their performance with domain-specific pathology questions.
We found that GPT-3.5 performed similarly to a staff pathologist on short-answer pathology examination questions. Although GPT-3.5 scored higher than a staff pathologist on 9 of 15 questions, there was no statistically significant difference in aggregated mean scores, with the scores remaining similar after adjusting for difficulty and discipline. Although OpenAI has not publicly released GPT-4’s size dimensions and the number of parameters used in the calibration of GPT-4 in comparison with GPT-3.5, GPT-4 appears to offer measurable improvements in performance compared with GPT-3.5 in answering pathology domain-specific questions. This trend has been observed in another scenario in a study that demonstrated that GPT-4 achieves a score in the top 10% of test takers compared to the bottom 10% with GPT-3.5 in a simulated bar examination.9 This highlights the substantial pace of improvement and performance in LLMs, and underscores the impressive advancements seen in GPT-4 during a span of several months, surpassing its predecessor, GPT-3.5.
While evaluating responses generated by the LLMs, GPT-3.5 performed at a level of expertise comparable to a practicing pathologist who recently wrote their board examination. More specifically, GPT-4 consistently performed at a level that matched or exceeded the performance of a resident who has completed pathology training (Figure 2). While it has been shown that GPT-4 provided longer answers than GPT-3.5, a comprehensive and high-scoring answer extends beyond simply just length. A good answer typically incorporates elements of accuracy relevant to the prompt, conciseness and exclusion of irrelevant information, a clear and well-organized answer format, and use of appropriate pathology terminology.
With the rise of ChatGPT and other LLMs, there have been concerns related to the misuse of AI tools for cheating on academic tests and beyond and the potential for misdiagnosis when AI tools are implemented in clinical practice. In our study, we focused on whether generative responses using AI tools such as ChatGPT from pathology prompts are distinguishable from human entries by participants. Our findings suggest that ChatGPT-generated answers are reliably identifiable by participants (when provided with a human answer for comparison). The clues that enabled participants to identify an answer as being ChatGPT-generated included unnecessarily lengthy responses containing overly detailed yet irrelevant material, awkwardness of language and readability, and provision of incorrect information in the output (Table 2). As ChatGPT and other LLMs continue to gain popularity, the ability to identify sources of seemingly natural and coherent text is crucial in safeguarding against the spread of misinformation in academic learning environments.
Scores on professional and academic examinations are widely used by ChatGPT and other LLMs as benchmarks to assess the performance of these models.9 We speculate that the parameters and training material used to train these models are optimized, in part, to maximize performance on standardized tests, including those used on medical examinations. Internal measurement of cross-contamination on the evaluation dataset and pretraining data using substring matching was performed in ChatGPT for standardized tests; however, there are limitations in this approach, and although overall contamination was found to be minimal, it may have an effect on the reported results.9 In this study, we generated a novel dataset of pathology-specific questions for analysis from an international group of pathologists. This international collaboration was developed using X (formerly known as Twitter), which can be a powerful tool for collaborating with international groups for research, along with other social media platforms.17 This approach allowed us to increase representation of pathologists from around the world.
This study provides support for LLMs such as ChatGPT as a transformative educational tool in the pathology space. AI applications offer tremendous utility for medical education, such as providing instant access to information, simplifying complex pathology concepts, and providing an interactive and personalized educational experience. Further, medical knowledge, particularly in the pathology field, is proliferating at an exponential rate, which serves as a challenge to attain proficiency to practice as an independent pathologist. Generative language models such as ChatGPT can provide quick and easily accessible information about various pathology topics, and their accuracy can be furthered by being trained on well-validated medical sources. Newer-generation LLMs also have the capability to provide references for the data used in the response.18 These references can be used to further fact-check data before use in research or clinical applications. With safeguards to ensure accuracy, these LLMs may help provide more detailed or varied descriptions of complex pathology content for trainees that can be effectively integrated into educational curricula. They can also provide narrative feedback and guidance in an intuitive manner. As an example, the question “Why does vimentin have limited use in pathology?” produces an answer that highlights its limited specificity and wide expression across tissue and tumor types due to its variable expression.
Our study has several limitations. Of the questions included in the study, minor variations in wording and grammar may have led to some questions with different interpretations and responses from AI applications and the pathologist. Likewise, terminology may have been pathology-specific, resulting in misinterpretation by the AI program. Despite these potential pitfalls, we decided to avoid prompting and retraining the model, which would have likely enhanced the quality of the output in order to maintain uniformity and avoid introducing bias. Moreover, as the answers were evaluated as a freeform response, evaluators may have favored well-written, organized, and more detailed answers, even if there was minimal difference in the quality and relevance of the provided answers. There were also subjective differences in terms of scoring, as our participants were a group of international pathologists who were trained at different institutions, each with their own set of expectations. Further, it is also possible that the accurate identification of ChatGPT-generated responses compared to human responses may have been influenced by the length of responses, as ChatGPT-generated responses were generally lengthier than the responses from the pathologist. In future work, we will request the pathologist to provide more detailed and “summary-like” answers to address this potential bias in order to enhance the evaluation of identifying machine versus human responses. Finally, to maintain uniformity, we did not evaluate the ability of ChatGPT to assess multiple-choice questions and evaluate visual inputs, as this was a feature added in the latest iteration.9 As visual examination of histology slides plays a critical role in diagnosing diseases as a pathologist, future work will undoubtedly assess the performance of ChatGPT and similar generative language models against a diverse modality of pathology examination questions that include visual inputs.
Many LLMs are currently being developed by numerous competitors in this rapidly evolving space. We chose ChatGPT as it has been the most widely studied system to date. However, we explored some of these platforms and submitted the questions to these systems (Supplemental Table). It was beyond the scope of this study to compare the various LLMs, as given the rapid improvement of these models, it is likely premature to directly compare the efficacy of these various platforms at different levels of maturity. As these platforms continue to evolve, LLMs dedicated to medical information may greatly surpass these first generations of generic LLMs.
The field of natural language processing and LLMs has made considerable progress and is well poised to reshape the way pathology is learned and performed. Our study leveraged a professional network of pathologists on X (Twitter) to evaluate AI applications such as ChatGPT against human observers, with impressive performance. We anticipate that as LLMs continue to mature, there will be rich opportunities for medical AI applications to alleviate the burden of increasing complexity in medicine and offer tremendous educational opportunities for pathology trainees.
References
Author notes
Supplemental digital content is available for this article at https://meridian.allenpress.com/aplm in the October 2024 table of contents.
Wang and Lin are both considered first authors
Competing Interests
Chen is an employee of Need Inc and owns Need Inc equity. The other authors have no relevant financial interest in the products or companies described in this article.