Generative artificial intelligence (GAI) is a promising new technology with the potential to transform communication and workflows in health care and pathology. Although new technologies offer advantages, they also come with risks that users, particularly early adopters, must recognize. Given the fast pace of GAI developments, pathologists may find it challenging to stay current with the terminology, technical underpinnings, and latest advancements. Building this knowledge base will enable pathologists to grasp the potential risks and impacts that GAI may have on the future practice of pathology.
To present key elements of GAI development, evaluation, and implementation in a way that is accessible to pathologists and relevant to laboratory applications.
Information was gathered from recent studies and reviews from PubMed and arXiv.
GAI offers many potential benefits for practicing pathologists. However, the use of GAI in clinical practice requires rigorous oversight and continuous refinement to fully realize its potential and mitigate inherent risks. The performance of GAI is highly dependent on the quality and diversity of the training and fine-tuning data, which can also propagate biases if not carefully managed. Ethical concerns, particularly regarding patient privacy and autonomy, must be addressed to ensure responsible use. By harnessing these emergent technologies, pathologists will be well placed to continue forward as leaders in diagnostic medicine.
“It is clear to me that AI will never replace physicians—but physicians who use AI will replace those who don’t.”
—Jesse Ehrenfeld, MD, MPH, former president of the American Medical Association1
Generative artificial intelligence (GAI) represents a promising new frontier in artificial intelligence (AI), poised to enhance productivity and revolutionize communication and workflows in health care and science. As the name implies, GAI is a type of AI that generates new content, spanning text, speech, images, videos, code, and even music. OpenAI’s ChatGPT is arguably the most well-known GAI-powered application for text generation. There is an increasing number of other readily available GAI-powered applications and tools, such as Adobe’s Firefly2 for images, Meta’s Voicebox3 for audio clips, and Beautiful.ai’s DesignerBot4 for slide presentations. The rapid pace of GAI advancement can pose a challenge for pathologists striving to stay abreast of the latest advances and understand the implications for their work. To address this, experts from the College of American Pathologists, the Digital Pathology Association, and the Association for Pathology Informatics have collaborated on this special collection of articles in the Archives of Pathology & Laboratory Medicine. These articles aim to elucidate the benefits and limitations of these tools and their practical relevance for practicing pathologists.
This introductory manuscript provides an overview of GAI as it pertains to pathology, encompassing a discussion of pertinent technologies and historical milestones that have shaped the field’s current landscape. We provide strategies for evaluating emerging GAI models, discuss the implications and potential risks inherent in these technologies, and propose measures to mitigate risks associated with their deployment. Related articles in this series explore GAI’s applications within the realms of anatomic pathology, clinical pathology, and pathology education. Recognizing the pivotal role of pathologists as custodians of pathology and laboratory data, a dedicated manuscript addresses the ethical dimensions of GAI use in clinical practice and scholarly activities.
GAI has the transformative potential to reshape operational workflows across multiple industries, including the practice of medicine. However, it is crucial to ensure that these models are properly supervised and regulated to safeguard patient well-being, privacy, and autonomy, and to minimize biases and errors. Pathologists must guide the responsible integration of these technologies into our specialty. This requires a solid understanding of their capabilities and limitations, coupled with a pragmatic, common-sense approach to ensure continued thoughtful and responsible progress. By harnessing these emergent technologies, pathologists will be well positioned to maintain their leadership in diagnostic medicine.
METHODS AND DATA SOURCES
Information was gathered from recent studies and reviews from PubMed, along with scholarly articles from arXiv, an open-access research archive that covers computer science and other related fields. Although arXiv lacks a peer-review process, it offers companies such as Google and Meta a platform for documenting internal research and sharing it with the broader academic and technologic communities. ChatGPT-4 and Gemini were used to ensure a consistent tone across the manuscript and improve readability for the general pathologist audience. We instructed the GAIs to suggest edits in the context of the paper and not change content or ideas. The authors take full responsibility for the content of the manuscript, including the parts produced by GAI.
KEY DEFINITIONS AND CONCEPTS
Before delving into the history and applications of GAI, it is important to define several fundamental terms, beginning with AI, machine learning (ML), deep learning (DL), natural language processing (NLP), and computer vision (CV). These definitions will help the reader understand how GAI represents a unique subset of these broader technologies. The relationship among these entities is shown in Figure 1. Additional technical terms and key algorithm names will be explained in relation to the history of GAI and are summarized in the Supplemental Table (see supplemental digital content, containing 1 table and 3 figures, at https://meridian.allenpress.com/aplm in the February 2025 table of contents).
Relationship of generative artificial intelligence (AI) to machine learning, deep learning, natural language processing, and computer vision under the broad umbrella of AI.
Relationship of generative artificial intelligence (AI) to machine learning, deep learning, natural language processing, and computer vision under the broad umbrella of AI.
Artificial Intelligence
AI is a broad field within computer science focused on creating software or machines able to perform tasks typically associated with human intelligence, such as understanding natural language text, recognizing patterns in images, and making predictions from data.5 The complexity of AI tools ranges from prewritten “if-then-else” rules to advanced programs that can optimize their own performance. The diverse array of AI applications covers areas such as robotics, chatbots, translation services, facial recognition, pattern detection, and medical image analysis. Today’s AI is categorized as artificial narrow intelligence, also referred to as weak AI, capable of handling specific tasks independently but without the ability to learn outside a predefined scope (ie, without exhibiting true humanlike intelligence). In contrast, artificial general intelligence, or strong AI, is the theoretical concept of a system capable of replicating human-level intelligence across diverse activities and fields of study. General AI systems are envisioned to understand, learn, and use knowledge across multiple domains similarly to humans. However, achieving general AI remains a significant challenge, involving complex issues related to reasoning, common sense, and contextual comprehension.
Machine Learning
ML is a subset of AI that develops machines capable of learning autonomously and directly from data rather than by explicit programming. An ML algorithm consists of rules or instructions to be applied to data to accomplish a specific function. ML developers influence the speed and quality of ML by modifying external configuration settings called hyperparameters before an algorithm’s training process begins. In training, data are fed to the algorithm, allowing the computer to discover patterns, correlations, and boundaries within that data set. Parameters are internal variables whose weights the algorithm self-adjusts to enhance the accuracy of its predictions. An ML model is the resultant program, with parameter values derived from a data set, that can make decisions or predictions on new data (eg, spam detector). ML models can subsequently be taken offline to undergo further training with additional data sets for related and/or domain-specific use cases, a process referred to as fine-tuning. Harrison and colleagues6 and Shafi and Parwani7 recently provided thorough analyses of the broad applications of AI and ML for pathology.
ML algorithms can be further classified by their learning methods: supervised, unsupervised, semisupervised, and reinforcement learning, each suitable for different tasks and data types. Supervised learning uses prelabeled data sets to train algorithms to classify data or predict results with known inputs and outputs, allowing ML models to achieve high accuracy given enough data. Yet creating and labeling large data sets is labor-intensive, and there is a risk that the model might become too focused on particular aspects of the training data and fail to generalize well to new, unseen data. Unsupervised learning algorithms organize and interpret data without predefined guidance, discovering patterns, trends, or anomalies that may not be immediately evident to human observers. They can split unlabeled data into similar groups (clustering), identify relationships between variables in a data set (association), and eliminate random or redundant variables (dimensionality reduction). Although easier to obtain, unlabeled data can make the interpretation of outputs more challenging. Semisupervised learning blends a limited amount of labeled data with a larger set of unlabeled data, improving accuracy and generalization without the need for extensive labeling. In reinforcement learning, the algorithm learns by trial and error with rewards and/or penalization to reach the best possible result.
An example of a supervised learning model in pathology would use annotated data to recognize cancer areas versus benign areas. The model associates the features in the image with the associated label, allowing it to make accurate predictions on new, unlabeled data. An unsupervised model in pathology might create hierarchical clusters of digital pathology images based solely on their morphologic patterns, using features such as mitoses, necrosis, and staining artifacts. The resulting arrangement could provide insights on tumor classification schemata.
A foundation model is a more recent type of ML model, trained broadly on a vast range of unlabeled data with potential applications to many domains. Foundation models are trained on terabytes of data and contain millions to billions of optimized parameters. The foundational model serves as a reusable base upon which other models can be developed and specifically tailored for distinct tasks, eliminating the necessity of creating new models from the ground up.
Deep Learning
DL is a subset of ML algorithms that uses multilayered (ie, “deep”) neural networks to analyze and learn from data. The inner layers between the input and output, called hidden layers, contain nodes acting in an analogous way to biological neurons. Each node, or perceptron, processes inputs from the previous layer and sends outputs to the next, independently adjusting its parameters to minimize prediction errors. A greater number of hidden layers and parameters helps the network recognize more intricate patterns, albeit at higher costs for training time and energy use.
DL models have been used successfully in pathology to analyze images, detect and differentiate cell and tissue types, assist in tumor grading, and even predict disease progression and patient outcomes.8,9 Early versions of neural networks were used for the detection of abnormal cells in cervical smears. In 1998, the US Food and Drug Administration (FDA) approved the FocalPoint Slide Profiler (Becton Dickinson, Inc) as a primary automated screener for cervical smears, followed by approval in 2002 for use with SurePath slides. In 2003, the FDA approved the ThinPrep Imaging System (Hologic, Inc) as a primary screener for ThinPrep Papanicolaou slides.10 Both systems used early neural networks to detect a significantly higher proportion of cervical abnormalities than was possible with unassisted, conventional screening. Paige Prostate was the first AI-based surgical pathology product to receive de novo approval from the FDA in 2021.11 It uses a DL model to assist pathologists in detecting, grading, and quantifying prostate cancer in digitally scanned slide images of prostate biopsies for assistance and quality assurance purposes (not for screening or primary diagnosis).
Generative AI
Compared with ML and DL models that identify patterns from data sets to make predictions or classifications, GAI uses neural networks to create entirely original and diverse data that are representative of the initial training data set. It is important to remember that the primary objective of many GAI systems is to generate plausible responses, and not necessarily to be factually accurate. Approaches for producing GAI content depend on the type and complexity of the data, desired objectives, and available computational and human resources. Model developers may use various architectures such as variational autoencoders (VAEs), generative adversarial networks (GANs), and transformers (see Supplemental Table). Descriptions of these model types are provided in the Historical Evolution of GAI section below. Although VAEs and GANs laid the groundwork for modern GAI, transformers have become the dominant architecture in many state-of-the-art GAI models, including the generative pretrained transformer (GPT) series and BERT (Bidirectional Encoder Representations From Transformers). These varying approaches all share a common desirable trait in that they can generate content from user guidance that is readily provided (eg, from natural human language or submitted images), thus making them accessible to most users, including pathologists without a background in informatics.
Multimodal GAI refers to systems that combine multiple models of different types (eg, text, images, audio, and video) to perform complex tasks. These systems aim to understand and generate responses across different types of data, mimicking humanlike comprehension and interaction capabilities.
Natural Language Processing
NLP is a branch of AI that enables computers to understand and manipulate natural human language by text or speech, along with its intent and sentiment. A key task in NLP is tokenization, where text is broken down into smaller units called tokens (ie, phrases, words, or subwords) that can be analyzed for their frequency, sequence, and relationships with other tokens. Another is positional encoding, which helps models understand the order of tokens in a sequence. NLP models, such as autoregressive models, create text by leveraging probabilities derived from a text corpus to predict the most likely subsequent word based on what preceded it. This method is useful for tasks like text completion, language translation, and summarization. Prompts are user-provided cues given to the NLP to guide output generation. The art of crafting effective prompts, known as prompt engineering, has emerged as a critical skill in eliciting optimal responses from these systems.
Computer Vision
CV is a field of AI that uses computers to extract meaningful information from digital images, videos, and other visual inputs. Convolutional neural networks (CNNs) are a type of neural network architecture used in CV that hierarchically extract image features to recognize objects, classes, and categories. CNNs sweep filters (convolutions) across an input image to identify spatial patterns like edges, textures, and shapes and create a feature map representing them. CV can be used for detecting faces, optical characters, objects, scenes, and actions and for augmenting or enhancing visual data when defects or issues are present. Importantly, GAI can be applied to both NLP and CV together in tasks such as captioning images or visual question tasks. For pathologists, this might allow future GAI-enabled digital pathology systems to respond to prompts to identify or describe microscopic features of interest, such as tumor cells or infectious organisms.
HISTORICAL EVOLUTION OF GAI
The field of GAI has undergone a remarkable evolution during the past decades, progressing from simple pattern recognition to the generation of humanlike content (Figure 2). The AI journey traces its roots back to 1950, when the British mathematician and logician Alan Turing12 published his influential work “Computing Machinery and Intelligence,” wherein he pondered the question, “Can machines think?” From this seminal paper emerged the concept known as the Turing test, a benchmark for evaluating a machine’s capacity to demonstrate intelligence indistinguishable from that of a human.13 Turing’s work paved the way for subsequent milestones in AI,6,7,14 including significant advancements in language models (LMs).
Historical timeline of notable events in the development of generative artificial intelligence (GAI). 1950: Turing test introduced for demonstrating an early form of artificial intelligence. Artificial intelligence becomes established as a scientific discipline. 1960s: Creation of ELIZA, one of the first functioning GAI chatbots. 1982: Recurrent neural networks (RNNs), which consider prior information to generate sentences, are developed. 1997: Long short-term memory (LSTM), a type of RNN with a more complex architecture that efficiently processes long data sequences and identifies patterns, is created. 2013: Introduction of variational autoencoders (VAE), a new generative model. 2014: Development of generative adversarial networks (GANs), marking a significant breakthrough in GAI by being one of the first models to generate high-quality images. 2017: The transformer deep learning architecture is proposed. 2018: OpenAI introduces generative pretrained transformers (GPTs), a significant advancement in large language models. 2021: Google introduces vision image transformers (ViTs) for image recognition. 2023: Release of GPT-4 in March 2023, which is capable of generating extensive texts up to 25 000 words.
Historical timeline of notable events in the development of generative artificial intelligence (GAI). 1950: Turing test introduced for demonstrating an early form of artificial intelligence. Artificial intelligence becomes established as a scientific discipline. 1960s: Creation of ELIZA, one of the first functioning GAI chatbots. 1982: Recurrent neural networks (RNNs), which consider prior information to generate sentences, are developed. 1997: Long short-term memory (LSTM), a type of RNN with a more complex architecture that efficiently processes long data sequences and identifies patterns, is created. 2013: Introduction of variational autoencoders (VAE), a new generative model. 2014: Development of generative adversarial networks (GANs), marking a significant breakthrough in GAI by being one of the first models to generate high-quality images. 2017: The transformer deep learning architecture is proposed. 2018: OpenAI introduces generative pretrained transformers (GPTs), a significant advancement in large language models. 2021: Google introduces vision image transformers (ViTs) for image recognition. 2023: Release of GPT-4 in March 2023, which is capable of generating extensive texts up to 25 000 words.
In the late 1950s, GAI models relied on probabilistic methods such as hidden Markov models and Gaussian mixture models to generate outputs, albeit in a limited way. One of the earliest functioning examples of GAI, the ELIZA chatbot, was developed in the 1960s by Massachusetts Institute of Technology scientist Joseph Weizenbaum.15 This early example of NLP could mimic human conversations, providing scripted answers based on pattern matching and substitution of keywords it was programmed to recognize.16 Though it was a simple rule-based system, it laid the groundwork for future advancements in NLP.
Research gained considerable momentum in the 1980s with improved computational abilities and the introduction of artificial neural networks. Recurrent neural networks (RNNs) became popular in the 1980s as a way to process data sequentially in combination with some previously saved information from prior inputs via a feedback loop, making them well suited for contextual tasks. However, their capacity to handle long sentences was limited by constraints in their long-term memory. This limitation was addressed in 1997 with the introduction of long short-term memory networks, capable of retaining relevant information during extended sequences.17
By the 2000s and 2010s, advancements in data availability and computational power made DL more feasible. Among key developments in neural network architectures were autoencoders. Autoencoders consist of 2 main components: an encoder and a decoder. Encoders convert inputs to mathematical representations of the contained concepts, and decoders use these representations to generate new data that resemble the original input. VAEs, a specific type of autoencoder introduced in 2013,18 added a probabilistic element to this process, enabling the generation of diverse and varied data samples.
In 2014, an influential paper on GANs19 positioned 2 neural networks against each other in a zero-sum game, one creating synthetic data and the other evaluating those samples against real data, to construct new images that closely resembled the training data without being identical. GANs and other generative models continued to evolve using techniques for better feature extraction and more realistic outputs, with potential uses in creating synthetic tissue images for pathology. Advances in NLP also arrived with gated recurrent units, a type of RNN using a gating mechanism to control the flow of information (ie, input or forgetting certain features).20 Gated recurrent units exhibited superior long-term memory and contextual understanding (ie, relevance of certain words in sentences), enabling the neural network to learn grammatical structures autonomously.
Just a few years later, the introduction of the transformer architecture based on a parallel multi-head “attention mechanism” precipitated a paradigm shift in NLP. As described in the 2017 groundbreaking publication “Attention is All You Need” by researchers at Google, attention refers to the ability of this model to simultaneously determine the relative importance of all sequence components without regard to their recency, in contrast to RNNs and long short-term memory networks21 (Supplemental Figure 1). This helps the model assign different weights to various elements based on their relevance to a task and draw fine distinctions and nuances between words or concepts (eg, questions related to cancer registry reporting versus diagnostic standardization). Self-attention is a specialized form of attention used within transformers to understand the relationships between all elements of an input sequence, improving the model’s ability to capture context and meaning. Transformers excel at handling distant but related contextual words, mirroring human language processing and outperforming previous models in tasks such as language translation.
Transformers ushered in the era of large LMs (LLMs) trained from vast and diverse data sets from the internet, including books, articles, websites, and other text-rich sources. Such data sets provide LLMs with exposure to a wide range of topics, writing styles, and linguistic nuances, the details of which are captured by millions to billions of parameters within the networks to produce coherent and contextually appropriate responses. Notable examples include Google’s BERT, with an encoder-only transformer architecture,22 and OpenAI’s GPT series, which uses a decoder-only transformer architecture.23 Recently, the landscape has expanded to include open-source foundation models like Meta AI’s LLaMA (Large Language Model Meta AI) that can be fine-tuned by researchers for domain-specific applications.24
In 2021, Google researchers achieved another milestone by introducing the concept of vision image transformers (ViTs).25 ViTs extend the transformer architecture from words to image recognition tasks, breaking down images into small rectangular pieces, called patches, and assigning mathematical weights to pixels. The introduction of ViTs has great implications for visual fields like pathology and radiology, as this allows the extraction of information from images without any manual labeling. This might include the ability to automatically identify the spatial relationship of the tumor to the tissue edge or detect the presence of tumor cells within a lymphovascular structure. ViTs, often combined with textual models to be multimodal systems, have largely supplanted CNNs and GANs in numerous image analysis applications.
OVERVIEW OF GAI MODEL LIFE CYCLE STAGES
Model Development
Although a deep understanding of the technical intricacies of GAI model development may not be necessary for all pathologists, a general awareness of the process can empower them to critically evaluate the capabilities and limitations of these models, fostering informed decision-making in their adoption and use. Pathologists may also participate with model developers in some components of the training and/or fine-tuning process.
The first step in training a model involves compiling a comprehensive and sufficiently large data set, with enough real-world variability pertinent to the task at hand. This could encompass a large assortment of digital pathology images collected from multiple institutions at several time points, or a sizable archive of laboratory test results with patient demographics. Data cleaning, which could involve removing irrelevant or erroneous data or performing image normalization, is crucial for maintaining data set consistency and quality. Choosing the right model architecture and determining the appropriate hyperparameters26,27 are essential next steps.
Training comprises 2 phases: pretraining and fine-tuning (Supplemental Figure 2). In the initial training phase, the model learns to generate new data based on a large input data set. There is typically an iterative process where the model parameters are adjusted to reduce the errors between the model’s predicted output and the actual output. This foundational training with unsupervised or self-supervised learning techniques helps the model understand patterns and assign relevance to different features, which then allows it to generate plausible and novel outputs based on probabilities in the input data set.
Fine-tuning builds upon the foundation established during pretraining by using supervised learning techniques and high-quality, curated data sets. These data sets are often meticulously labeled by human reviewers following specific instructions to generate desired outputs. The model undergoes fine-tuning using these instruction-output pairs, running many evaluations and receiving continuous feedback in the form of corrected desired responses. In this process, developers must ensure that they do not inadvertently compromise the performance of pretrained models when new data override existing protocols, a danger referred to as “catastrophic forgetting.”28
To further optimize the model’s output and align it with human preferences (intent alignment), reinforcement learning with human feedback can be used.29 This technique involves generating multiple outputs for a given input, which are then ranked by human reviewers. The model receives a “reward” for producing the most favored output. Some models use a 2-reward system, one based on helpfulness and the other based on safety, and a linear combination of the 2 scores is used to decide the final output. This iterative process is repeated multiple times to refine the model’s performance. It is worth noting that the quality of data used for fine-tuning is paramount, often outweighing the sheer quantity of data.
Once a model has been trained and fine-tuned, it may be released to the public or to specific domains, allowing users to further customize it for their particular needs using their own data. This adaptability is one of the key strengths of GAI, enabling it to be applied to a wide range of tasks and domains.
Testing and Evaluation
Unlike traditional AI models used for classification tasks where the output is well-defined and limited (eg, malignant or benign), GAI models generate a wide array of open-ended outputs such as text, images, or even audio. Evaluating GAI outputs is challenging because of subjectivity, as multiple valid responses depend on context and user preferences. It is crucial to assess meaning and context beyond grammar, ensuring outputs accurately reflect intent. Balancing creativity and control varies by application—diversity is desired in some cases but not in critical tasks for clinical care. As such, human evaluators are vital for nuanced feedback on the coherence, relevance, fluency, suitability, bias, and fairness of GAI model outputs.
Evaluating GAI outputs typically involves a combination of automated metrics, human evaluation, and domain-specific strategies. Evaluation frameworks should be tailored to the unique features of a GAI model and the needs of each application. As researchers continue to develop new measurement techniques and benchmarks for GAI models, this section discusses some current approaches in addressing these complexities.
Precision-Based Metrics
These metrics, like BLEU and ROUGE, quantify the similarity between machine-generated text and human-provided reference text. Although useful, they have limitations and do not always capture the full essence of what makes a response “good.”
Evaluating Structured Outputs
When GAI models produce structured outputs like code or database queries, traditional metrics can be leveraged to assess their correctness and efficiency.
Human Evaluation
Human evaluators provide qualitative feedback on various aspects of the generated outputs, offering a more nuanced perspective than automated metrics alone.
Measuring Human Correction Effort
This approach quantifies how much effort it takes for a human to correct or improve the model’s output, providing insights into its overall quality.
Using Another LLM as a Judge
In this strategy, another LM acts as an evaluator, comparing the outputs of the model being tested with those of a reference model.
Perplexity
Perplexity is a performance metric that can evaluate how well a LM predicts the next word in a sentence. A lower perplexity score means the model is better at making predictions. In simple terms, it measures how “confused” the model is when trying to predict what comes next in a sequence of words.
Red Teaming and Blue Teaming
These complementary approaches involve proactively identifying vulnerabilities and attempting to induce failures in the GAI model (red teaming) and then implementing measures to make it more robust and secure (blue teaming).30
By using a combination of these strategies, researchers and practitioners can gain a deeper understanding of the strengths and weaknesses of GAI models, paving the way for the development of more reliable and trustworthy AI systems in the future.
Validation and Verification
It is essential to confirm that any GAI model deployed in a clinical setting is fit for purpose and fit for use, ensuring that it meets the specific needs and requirements of the health care environment. This involves careful validation and verification processes. The Clinical Laboratory Improvements Act defines verification as the process of confirming that a laboratory can achieve the performance specifications claimed by the manufacturer of an FDA-cleared or FDA-approved test in its clinical testing environment. Validation, on the other hand, is a more extensive process used when a laboratory develops a new test (laboratory-developed test) or significantly modifies an existing FDA-cleared or FDA-approved test. In validation, the laboratory is responsible for establishing and documenting the performance characteristics of the test. This includes defining the test’s accuracy, precision, analytical sensitivity, analytical specificity, and other key metrics to ensure that the test performs as intended in the clinical setting.
However, the FDA’s experience with generative AI, particularly in determining the correct sample size for validating models, remains limited. As of October 19, 2024, the FDA has not yet authorized a device that uses generative AI or artificial general intelligence or is powered by LLMs.31 GAI’s “black box” nature and the complexity of its algorithms pose significant challenges in establishing transparent and reliable validation protocols. The agency’s traditional methodologies for sample size determination may not fully align with the dynamic and evolving needs of GAI systems. Therefore, the FDA is still in the early stages of developing robust frameworks and guidelines to effectively assess and validate these advanced models, ensuring they meet the necessary standards for clinical use.
That said, validation of GAI models may require a much larger representative sample set than what is currently being used for medical devices not using GAI. Specifically, the data set will need to be of adequate size to display real-world variation, artifacts, and edge cases. For comparison, in 2022, Ebrahimian et al32 found that among the 66 identified clinical AI (but non-GAI) algorithms reviewed by the FDA, 45 (68%) had a total sample size of fewer than 500 patients, 10 (15%) had between 500 and 1000 patients, 9 (14%) had between 1000 and 4500 patients, and 2 (3%) had more than 4500 patients.
Any future GAI-affected workflow will likely need a justified definition of an adequate sample size of unknowns advised by a qualified statistician to demonstrate the system’s performance with local data. Diversity within the evaluation data should reflect the local laboratory setting, with potential enrichment for rare entities. Interinstitutional collaboration, synthetic data, and Bayesian or frequentist methods may be useful to address the paucity of specific samples.33 Consideration will need to be given to how ground truth is assigned for evaluation purposes, including the use of outcomes data. For accurate evaluation of the input data, local expressions of confidence may need to be examined and well understood.34–38 Strategies for dealing with disagreements among experts should be outlined as well for GAI validation or verification.39,40
Ongoing Use and Monitoring
Once a GAI application is deployed in a clinical setting, it must be carefully monitored using quality management principles already in use by clinical laboratories, including those required by regulatory and accreditation requirements (eg, International Organization for Standardization [ISO] 15189). Quality management in the laboratory is a set of coordinated activities to support and make improvements in the accuracy and reliability of test results that meet the clinical needs of patients and providers. A similar program is needed to ensure the reliability, validity, and safety of GAI outputs, as well as to monitor the trust and confidence of users in these models after deployment.
ISO/International Electrotechnical Commission [IEC] 42001 was published in 2023 as the first certifiable standard for implementation of AI management systems.41 ISO/IEC 42001 is an industry-agnostic standard that places all AI-related projects under the responsibility of organizations for governance and life cycle management. It requires organizations to demonstrate an understanding of their intended purpose for the GAI system, including the domain, application context, and needs of interested parties, and commit to the necessary leadership, plans, resources, and communications to achieve stated results, continuously improve, and mitigate undesired effects. Organizations should create policies for responsible and ethical use of the GAI system, perform impact assessments, understand the GAI system life cycle stages and relevant tasks (Table), determine data needs, and set up auditing and incident reporting tools.
Responsible Development and Implementation of Generative Artificial Intelligence (GAI) Systems by Organizations

Among the requirements from ISO/IEC 42001, laboratories and health care organizations are likely to be challenged with having the necessary expert technical and organizational resources for managing the GAI system life cycle and preparing internal data sets to assess the performance of GAI systems locally for their intended behavior.42 Because of these considerations, health care organizations may need to step into a big data custodian role and have in place sound data governance policies and reliable methods for data acquisition, data quality assessment, and data cleaning or preparation for the requirements of GAI and other AI systems. In addition, GAI introduces additional complexities because of the possibility of poor prompting, false confidence in outputs, or misinterpretation of outputs. End-user training and ongoing oversight of participant interactions are critically important.
In 2023, the National Institute of Standards and Technology (NIST) released its voluntary AI risk management framework, which refers to “govern-map-measure-manage” as a process to mitigate risks and promote responsible use of AI systems.43 Governance is designed as a cross-cutting function that informs the other 3 functions, establishing the culture, requirements, accountability, workforce, policies, and procedures for oversight of GAI systems. The map function identifies the domain-specific context to frame the risks related to the use of a GAI or AI system and categorizes the potential positive and negative impacts from expected and unexpected use along with their metrics. The information obtained from performing the map function can inform decisions about the appropriateness or need for a GAI solution. Outcomes from the map function are the basis for the measure and manage functions, which, together with policies and procedures in the govern function, assist in AI risk management activities.
The NIST risk management framework for GAI published in July 2024 described the heightened socio-technical risks unique to GAI, particularly around misuse and unsafe interactions of humans and GAI systems and the generally immature state of AI safety science today.44 The long-term performance of GAI models is less well understood than that of non-GAI systems, and people’s perceptions, emotions, and behaviors in response to GAI outputs may vary greatly. Actions for organizations to address GAI-related risks are outlined in the NIST profile and include structured public feedback, field testing, red teaming, provenance data tracking, content moderation, and incident disclosure reporting, among others. Key concepts for ethical and responsible use of AI emphasize human centricity, social responsibility, and sustainability, directed through existing and updated governance tools and protocols. Both ISO/IEC 42001 and NIST AI risk management frameworks are available resources for health care organizations that are considering how to implement GAI systems with appropriate safeguards.
UNDERSTANDING RISKS OF GAI
As socio-technical systems mimicking human-crafted content such as language, GAI outputs carry features of complexity, autonomy, opacity, and unpredictability. Understanding the specific risks associated with GAI can help teams and organizations to identify suitable applications for use.
Hallucinations and the Need for Accuracy in Clinical Applications
Inherent to all GAI systems is a tendency to produce confabulations, colloquially known as hallucinations, where information is invented that is false or even nonsensical. For these reasons, GAI may be a good-enough solution for taking notes in a meeting, distributing cases within a pathology department, or automating certain clerical or drafting tasks (Figure 3; Supplemental Figure 3), but may be inadequate when perfect factual accuracy is required, such as in generating a patient report or making treatment recommendations.
Generative artificial intelligence (AI) offers a broad array of applications that can significantly streamline administrative and creative tasks.
Generative artificial intelligence (AI) offers a broad array of applications that can significantly streamline administrative and creative tasks.
One approach to optimizing the output of GAI is through a process called retrieval-augmented generation, which connects GAI models to an authoritative knowledge base outside of their training data sources, which can then be cited. This empowers users to cross-check the model’s claims, fostering greater trust and confidence in its outputs. In essence, retrieval-augmented generation transforms GAI into an intelligent interface for searching and identifying relevant pieces of information from vast repositories of data, opening doors for applications like augmented customer support, employee training, and improved operational productivity. Importantly, organizations can choose to prioritize vendors who demonstrate a commitment to digital content traceability and explainability in their GAI model development process.
Challenge of Bias
A significant risk associated with GAI is its propensity to learn and perpetuate biases present in its training data, particularly when trained on massive amounts of unlabeled data from the internet. For example, if the majority of information on the internet associates men with careers like engineering and women with careers like nursing, the GAI model will likely reflect these stereotypes in its responses. This happens because GAI does not comprehend ideal future states; it simply identifies and reproduces patterns found in its training data. GAI’s fundamentally retrospective analysis prevents the inclusion of desired deviations. Another potential risk is that if the training data lack diversity and representation, it can lead to homogenized and narrow institutional or culture-normative responses. In the context of pathology, this could manifest as the model overlooking certain diagnoses in children, simply because pediatric cases of specific diseases are less frequently reported online compared with adult cases. Organizations might seek detailed documentation from model developers concerning the provenance and history of data used in training and fine-tuning GAI applications. In addition, organizations can establish clear guidelines to help with third-party transparency and risk management regarding the collection and use of data for GAI model inputs.
Deployment Environment
As with all AI systems, it is important to consider the deployment environment—whether cloud-based or locally installed models—and its implications for user control, data integrity, and operational reliability. The primary downside of a cloud-based AI deployment environment is having to rely on a third-party provider and any related external suppliers for their security practices and compliance with regulations. In addition, there may be issues with network latency and connectivity. A locally installed AI environment provides more direct control over security measures, but it may be vulnerable to physical attacks and requires a higher level of internal technical expertise for cybersecurity and configuration management. Both types of environments require ongoing monitoring for data breaches and may experience scalability limitations, misconfiguration challenges, and inadequate or unsecure access controls.
Malicious Attacks
During their construction, GAI models are susceptible to malicious cyberattacks like data poisoning, where adversaries deliberately introduce corrupted or misleading data into the training set to manipulate the model’s behavior and purposefully bias the outputs. After deployment, GAI models are vulnerable to prompt injection attacks, where malicious actors manipulate input prompts to gain unauthorized access to sensitive data, execute harmful code, or disrupt the model’s intended function. For instance, an attacker could engineer a prompt that causes the model to reveal its internal programming or other confidential information. Robust input validation and prompt engineering strategies can mitigate prompt injection risks. The risk of ransomware attacks on GAI models used for critical systems must also be recognized, given that the health care sector is one of the most affected by such threats. Models might be unavailable for weeks or months, potentially causing major operational issues depending on their deployment context.
User Interactions and Unintended Consequences
The interaction between users and GAI models can result in unintended consequences. Users may overestimate the objectivity and accuracy of these systems, even in the face of conflicting evidence, which can lead to incorrect decision-making (automation bias). Furthermore, variations in user interpretation of GAI outputs may cause misunderstandings or misinterpretations. Another concern is that users may manipulate prompts to circumvent safety measures intended to block the creation of harmful or unsuitable content.
Privacy and security risks are inherent to GAI systems. If personally identifiable information is included in training or fine-tuning data sets, there is a chance that a model may memorize and inadvertently regenerate a small fraction of this information, risking confidentiality breaches.45 Models may also infer private information not disclosed by the user or included in training data by combining information from disparate sources.46 Even if the inferences are not correct, this could have a negative impact on an individual. Additionally, sensitive data may be inadvertently collected during user interactions with deployed GAI systems, resulting in unintended exposure or misuse. An organization’s potential exposure for privacy violations can be reduced by using data minimization principles to collect and store only data necessary for completing business tasks.47
Organizations should prioritize education and training of users to familiarize them with the limitations of GAI and how to use it safely and effectively. Organizations should establish clear policies governing the appropriate use of GAI tools within the organization and conduct periodic audits to ensure compliance and identify potential issues.
Trust and Ethical Considerations
Transparency and explainability are fundamental in the adoption of AI models, besides being trustworthy and accountable.48,49 Trust could be lost if patients realize that AI tools have been used without their explicit consent or after their explicit request that it not be used.50,51 If the values of the patient are not specifically incorporated into physician decision-making, the use of an AI algorithm may lead to decisions counter to those values.52 Physicians may lose trust if they find that the “black boxes” contain biases that affect specific subgroups of patients.53 There are also ethical considerations related to overreliance on systems that may be flawed and that can lead to unintended maleficent patient outcomes.54
The scope of these GAI-associated ethical considerations is being discussed by governments, tech companies, and organizations at all levels; at this point of GAI emergence, it is challenging to define when, where, and for what cases GAI algorithms should be applied.55 Many are attempting to create a set of guiding principles that will bolster the development of trustworthy systems and implementations for AI systems in health care and laboratory medicine, some at the highest levels.56–61 The need for ethics to be in the front of our minds in all of our diagnostic decisions has never been greater.44
APPLICATIONS OF GAI IN PATHOLOGY
Whatever the shortcomings, GAI is here to stay. It is up to physicians to create the underpinnings of a trustworthy framework in which GAI models are implemented. If trained correctly and implemented safely, GAI models may have the potential to reduce human errors, biases, and variability in clinical diagnoses and disease management. In the United States, we know from the seminal Institute of Medicine report To Err is Human that between 44 000 and 98 000 Americans die every year because of preventable medical errors.62 It is estimated that diagnostic errors affect more than 12 million Americans, or at least 1 in 20 US adults, each year.63 Most diagnostic errors have the potential for serious harm, are not related to rare or unusual diseases, and frequently involve problems with history taking and ordering the right diagnostic tests,64 issues that may be amenable to improvement with GAI tools. These may also allow for meaningful increases in the scale and scope of the provisioning of health care services for marginalized, poor, and/or remote populations with difficulty accessing traditional hospital or clinic settings.
Even with their limitations, GAI models already have the potential to streamline a wide range of time-consuming tasks, such as creating and editing images and presentations, managing schedules and email, writing and analyzing code, and automating administrative tasks (Figure 3). Supplemental Figure 3 provides examples of the numerous commercially available applications that are currently available for these purposes. Some possible applications for the field of pathology are described below.
Automated Annotation and Image Analysis
GAI models may be able to automatically annotate and analyze digital pathology images, identifying and classifying cells, tissues, and morphologic features of interest. This can expedite the diagnostic process, reduce interobserver variability, and potentially uncover subtle patterns that might be missed by human observers.
Simplified and Translated Pathology Reports
GAI models may be able to generate clear and concise summaries of pathology reports in a patient’s native language and/or make diagnostic features easier to understand for nonspecialists. They may assist pathologists in their report preparation, improving report clarity and consistency and facilitating communication with clinicians and patients.
Drug Discovery and Development
GAI tools can support drug discovery by generating novel molecular structures with desired properties. This can accelerate the identification of potential drug candidates and streamline the drug development process. A notable example is Google DeepMind’s AlphaFold (https://alphafold.com), which predicts highly accurate 3D protein structures from amino acid sequences.
Summaries and Key Findings of Patient Charts
Medical charts contain a vast amount of information, including patient histories, diagnoses, laboratory results, treatment plans, and physician notes. Managing and interpreting these data can be both time-consuming and prone to human error. GAI models may be able to analyze the content and highlight information based on its relevance and significance to a prompt from a pathologist. They may also be able to generate a coherent and concise summary that encapsulates the essential information, making it easier to quickly grasp the patient’s current status and needs.
Specific applications of GAI in pathology will be explored in greater depth in the associated articles of this special section.
CONCLUDING REMARKS
The field of GAI is moving at a rapid pace and is poised to reshape many aspects of health care. Data generated within the realm of anatomic and clinical pathology are increasingly being used to develop models for diagnostics that predict patient outcomes, treatment plans, and disease prognoses, extending its impact far beyond the confines of pathology itself. Patients, too, are likely to embrace GAI tools to gain a better understanding of their health care journey.
The swift evolution of novel GAI capabilities has made it difficult to apply existing regulatory and safety frameworks, presenting challenges for agencies tasked with overseeing the evaluation of these tools and models in health care. The rapid pace also makes it difficult to gain consensus on best practices that can be formalized as safety regulations. It is therefore imperative that pathologists proactively engage with this technology, fostering a collective understanding of its implementation and regulation with their own working environment. To ensure the accurate and ethical use of GAI, pathologists should prioritize the use of interpretable and explainable AI methods. They should insist on functionalities that acknowledge uncertainty, such as withholding answers when the correct output is unclear (called unsolvable problem detection65 ) and/or providing confidence scores alongside AI-generated results, thus empowering health care providers to make well-informed decisions.
It is crucial to remember that although GAI models can be powerful tools, they are not infallible. They are built on mathematical algorithms and historical data, lacking the empathy and nuanced understanding that characterize human clinicians. As we’ve explored, these models can inherit biases from their training data and might struggle with rare diseases or atypical presentations. The expertise of pathologists and other clinicians remains essential, and maintaining a “human-in-the-loop” approach is paramount to ensure proper oversight of GAI-generated insights, ultimately safeguarding optimal patient care.
As the field evolves, there will be many opportunities for pathologists to actively participate in the integration of GAI applications into teaching, research, and clinical care. Pathologists who are interested in GAI are encouraged to work closely with the information technology and data analytics staff at their institution to allow for an open exchange of ideas and facilitate patient-centered innovation, and consider getting involved in the Association for Pathology Informatics, the College of American Pathologists, and the Digital Pathology Association. Interested pathologists should also educate their students, residents, and fellows about the safe and responsible use of AI. Lastly, pathologists should share their experiences through training materials, presentations, abstracts, and published research, and consider supporting collaborative AI research.
We would like to thank the staff from the Association for Pathology Informatics, the College of American Pathologists, and the Digital Pathology Association for their assistance in coordinating this cross-organizational, multiauthored effort. OpenAI’s GPT-4 and Google’s Gemini were used to improve the flow and wording of this manuscript. The authors take full responsibility for the content of the manuscript, including the parts produced by GAI.
References
Author notes
Supplemental digital content is available for this article at https://meridian.allenpress.com/aplm in the February 2025 table of contents.
Singh and Kim are co–first authors. Gu is now located at Computational Pathology & AI Center of Excellence, School of Medicine, University of Pittsburgh, Pennsylvania, and Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania.
Competing Interests
Hoekstra is an employee of the College of American Pathologists. The authors have no relevant financial interest in the products or companies described in this article.