Artificial intelligence algorithms hold the potential to fundamentally change many aspects of society. Application of these tools, including the publicly available ChatGPT, has demonstrated impressive domain-specific knowledge in many areas, including medicine.
To understand the level of pathology domain-specific knowledge for ChatGPT using different underlying large language models, GPT-3.5 and the updated GPT-4.
An international group of pathologists (n = 15) was recruited to generate pathology-specific questions at a similar level to those that could be seen on licensing (board) examinations. The questions (n = 15) were answered by GPT-3.5, GPT-4, and a staff pathologist that recently passed their Canadian pathology licensing exams. Participants were instructed to score answers on a 5-point scale and to predict which answer was written by ChatGPT.
GPT-3.5 performed at a similar level to the staff pathologist, while GPT-4 outperformed both. The overall score for both GPT-3.5 and GPT-4 was within the range of meeting expectations for a trainee writing licensing examinations. In all but one question, the reviewers were able to correctly identify the answers generated by GPT-3.5.
By demonstrating the ability of ChatGPT to answer pathology-specific questions at a level similar to (GPT-3.5) or exceeding (GPT-4) a trained pathologist, this study highlights the potential of large language models to be transformative in this space. In the future, more advanced iterations of these algorithms with increased domain-specific knowledge may have the potential to assist pathologists and enhance pathology resident training.
Supplemental digital content is available for this article. See text for hyperlink.
Wang and Lin are both considered first authors.
Chen is an employee of Need and owns Need equity. The other authors have no relevant financial interest in the products or companies described in this article.