Machine learning applications in the pathology clinical domain are emerging rapidly. As decision support systems continue to mature, laboratories will increasingly need guidance to evaluate their performance in clinical practice. Currently there are no formal guidelines to assist pathology laboratories in verification and/or validation of such systems. These recommendations are being proposed for the evaluation of machine learning systems in the clinical practice of pathology.
To propose recommendations for performance evaluation of in vitro diagnostic tests on patient samples that incorporate machine learning as part of the preanalytical, analytical, or postanalytical phases of the laboratory workflow. Topics described include considerations for machine learning model evaluation including risk assessment, predeployment requirements, data sourcing and curation, verification and validation, change control management, human-computer interaction, practitioner training, and competency evaluation.
An expert panel performed a review of the literature, Clinical and Laboratory Standards Institute guidance, and laboratory and government regulatory frameworks.
Review of the literature and existing documents enabled the development of proposed recommendations. This white paper pertains to performance evaluation of machine learning systems intended to be implemented for clinical patient testing. Further studies with real-world clinical data are encouraged to support these proposed recommendations. Performance evaluation of machine learning models is critical to verification and/or validation of in vitro diagnostic tests using machine learning intended for clinical practice.
Institution of machine learning (ML) in the pathology clinical domain has gained momentum and is rapidly advancing. As ML-based clinical decision support (CDS) systems continue to mature, laboratories will increasingly need support to evaluate their performance in clinical practice. The medical domain of pathology has traditionally contributed vast nonimaging data to clinical information systems (eg, chemistry analyzers, hematology results, microbiology species, bioinformatic molecular pipelines, mass spectrometry). With pathology continuing to undergo digital transformation, a myriad of digital imaging data (eg, whole slide imaging, digital peripheral blood smears, digital plate reading, imaging mass spectrometry) are being generated by devices in pathology laboratories and used for clinical decision making.1 Image-based CDS systems in pathology have emerged for a variety of use cases, including white blood cell classification in peripheral blood smears, screening of gynecologic liquid-based cytology specimens, serum protein electrophoresis interpretation, microbiology automated culture plate evaluation, parasite classification in stool samples, and detection of prostatic carcinoma in whole slide images of tissue section specimens.2–7 However, ML-based CDS systems are not limited to imaging; they also include the analysis of nonimaging data. Use cases include evaluation of preanalytic testing errors,8,9 patient misidentification via wrong blood in tube errors,10,11 redundant analyte testing,12,13 establishing reference intervals,14 amino acid profiles/mass spectrometry,15,16 and autoverification.17,18 Additionally, electronic health records can leverage laboratory instrument data and other patient information, including multimodal data,19 to train ML models for prediction of specific patient outcomes (eg, sepsis prediction,20 acute kidney injury,21 patient length of stay,22 etc). The vision for clinical use of ML-based CDS systems is apparent; however, evidence that these systems are safe and efficacious for widespread clinical use is lacking. Patient care is a safety-critical application, where errors may have significant negative consequences. ML-based systems intended to support or guide clinical decisions require a level of assurance beyond those needed for research or other nonclinical applications.23 The use of CDS in pathology must be assured with sufficient evidence to be fit for purpose (ie, diagnostic testing), have defined performance characteristics, be integrated into clinical workflows (eg, with verification and/or validation), and vigilantly require ongoing monitoring.
In principle, CDS using ML-based systems can either be designed for use with pathologists in the loop as a human-managed workflow, or autonomously to directly influence an intended use. Autonomous ML-based systems in pathology can be defined as any artificial intelligence (AI)–based methods that operate to perform any critical part of a pathologist’s clinical duties (eg, diagnostic reporting) without agreement and oversight of a qualified pathology healthcare professional. Evidence to support clinical use of autonomous ML-based systems is lacking. Furthermore, professional societies in the field of radiology have recommended the US Food and Drug Administration (FDA) refrain from authorizing autonomous ML-based software as a medical device (SaMD) until sufficient data are available to advocate for these devices to be safe for patient care.24 As pathologists practice medicine using ML-based systems, the combined competencies of the human and computational model infers a term called “augmented intelligence.”25,26 The American Medical Association (AMA) has propagated this term to represent the use of ML-based systems that supplement, rather than substitute, human healthcare providers. The concept of augmented intelligence is based upon achieving an enhanced performance when the ML-based model is used in conjunction with the trained healthcare professional, compared to their individual performance alone. Examples of augmented intelligence in pathology include assistive tools that may impact pathologists’ diagnostic capabilities, such as identifying suspicious regions of interest on a slide for review, probabilistic-based protein electrophoresis interpretation, or tumor methylation–based classification to support clinical decision making.7,27,28
Development of formal guidelines to deploy ML-based models in medical practice is challenging. Medical practice has arguably been founded on knowledge-based expert systems. Human-derived knowledge has been encoded as rules and relies on information available from experts in a specified domain. Decisions made through these rule-based systems and the decision-making process underlying the algorithm can be easily traced. Rule-based expert systems use human expert knowledge to solve real-world problems that would normally necessitate human intelligence. Rule-based expert systems encoded in knowledge bases have been developed since the 1970s.29–32 The advancement in technology and computing resources coupled with a digital transformation of the pathology field has demonstrated a need to provide further guidance for systems that incorporate emerging technologies such as ML-based CDS. Typically, when organizations put forth formal validation guidelines, they are based on extensive meta-analysis and expert review. The College of American Pathologists (CAP) Pathology and Laboratory Quality Center for Evidence-Based Guidelines has proffered guidelines for clinical implementation of whole slide imaging, digital quantitative image analysis of human epidermal growth factor receptor 2 (HER2) immunohistochemical stains, and molecular testing.33–35 The final recommended guidelines encompass substantial evidence-based data including systematic literature review, strength of evidence framework, open comment feedback, expert panel consensus, and advisory panel review. Due to the relative few ML-based applications currently used in clinical settings, the evidence available to formulate guidelines is lacking. Additionally, ML systems differ fundamentally from rule-based expert systems, conventional digital image processing or computer vision techniques, or other traditional software or hardware implementations in the laboratory. Conventional digital image processing (ie, image analysis) uses relatively defined image properties that are explainable and reproducible; relying on manually engineered features (eg, cell size, contours, entropy, skewness, brightness, contrast, and other Markovian features).36–40 In contrast, ML-based systems “learn” features or patterns from the training data, with the resultant ML model having potential inherent biases. ML models are generally tested on data not used in the training of the model (ie, unseen held-out test data) to estimate the generalization performance.41 In addition, explainable ML models may aid in performance evaluation, safeguard against bias, facilitate regulatory evaluation, and offer insights into the augmented intelligence decision-making process.42,43
The increasing utilization of ML-based approaches has facilitated the development of ML models relevant for many applications pertinent to pathology practice. Implementing ML-based CDS will transform pathology practice. ML-based systems may be authorized for an intended use by regulatory bodies for clinical applications, or they may be deployed clinically by a laboratory (eg, local deployment site), as a laboratory-developed test (LDT). There are commercially available systems, including some authorized by the FDA. Pathologists need to know how to evaluate the analytical and clinical performance of ML models and understand how ML-based CDS systems integrate with the patient care paradigm. Currently, data on real-world clinical use is insufficient to develop formal evidence-based guidelines, as clinically deployed benchmarks of ML model performance evaluations are still emerging; thus, no published guidelines exist for clinical evaluation of ML-based systems in the pathology and laboratory medicine domain. This paper provides proposed recommendations relating to performance evaluation of an ML-based system intended for clinical use.
Scope of Guidance
This guidance document intends to (1) provide information on performance evaluation of systems that utilize ML derived from laboratory-generated data and that is intended to guide pathology clinical reporting of patient samples; (2) discuss aspects pertinent to evaluating the performance of a pretrained, ML model intended to be verified and/or validated at a deployment laboratory site; (3) discuss ML models, either procured commercially or established as LDTs, that include ML in any part of the test and are used in reporting of patient samples; and (4) provide a proposed framework to inform performance evaluation of ML-based software devices with a focus on verification and validation (eg, analytical, clinical) of a given ML model implemented at a laboratory using real-world data from the local deployment site.
These ML models may be used in conjunction with imaging-based microscopy in surgical pathology or laboratory medicine or use clinical nonimaging data. The goal of performance evaluation is to assess the model’s ability to generalize the features learned from the training data and render predictions from the local deployment site laboratory data. To improve the generalizability of the ML model, additional training (eg, recalibration) using data retrieved from the deployment-site laboratory can be performed prior to verification and/or validation, and ultimately before use in a clinical setting as an LDT.
This guidance document provides recommendations on performance evaluation of clinical tests that utilize ML in any part of the preanalytic, analytic, or postanalytic workflow. Outside of the scope of this document are (1) ML models deployed outside the clinical laboratory and pathology, even if they use laboratory data (eg, electronic medical record); (2) development of an ML-based model; (3) providing guidance on study design for model training and testing;44 (4) guidance on local site retuning of parameters or retraining using deployment-site local data; and (5) software that solely retrieves information, organizes, or optimizes processes. ML basics will not be reviewed in detail; readers may reference a prior publication introducing ML in pathology.45 The authors encourage publication of additional literature to generate data to strengthen these recommendations and future formal guideline creation. The generation of evidence-based data will enable model credibility and build trust surrounding the predictive capability of a given model.
EVALUATION CONSIDERATIONS
Pathology and laboratory medicine healthcare professionals are familiar with incorporating new diagnostic testing procedures and technologies into clinical practice. This includes new laboratory instrumentation, deployment of digital pathology systems, onboarding and optimization of new immunohistochemistry clones, and the required verification and/or validation of new procedures and technologies before clinical testing on patient samples. If certain clinical tests are not commercially available, laboratories have conventionally generated their own tests and testing protocols, referred to as an LDT, to support emergent clinical needs and clinical decision making. ML models are similarly available commercially through industry entities or can also be deployed as LDTs if being used for patient testing.
As laboratories historically have evaluated, and continue to evaluate, performance of new tests being offered clinically, laboratories should pursue performance evaluation of ML models based on intended use cases; the evaluation should be carried out in the same fashion as the test will be clinically implemented. Performance evaluation should include inputs related to the intended use case and assessed similarly to clinical practice (eg, imaging modalities, tissue type, procedure, stain, task, etc). Each component of the ML-based system should be evaluated in the assessment as the system will be clinically deployed. At the completion of any change to the evaluated system, reevaluation should be conducted, to ensure the system is working as intended and performance characteristics are defined.
Terminologies and definitions relating to ML performance evaluation, such as validation, vary based on the respective domains (eg, clinical healthcare, regulatory, ML science). Table 1 provides a glossary of terms as they are defined and used in this document. Laboratory evaluation of an unmodified FDA-authorized (eg, cleared, approved) ML model, which has already undergone external validation, may only require limited verification showing that the deployment laboratory-use case fits those defined in the approved use and that performance matches that predicted in the regulatory review. Using a modified FDA-authorized ML model not in compliance with the stipulations of the intended use requires additional development of an LDT, assertion of the performance characteristics, and approval from the laboratory medical director. The Clinical Laboratory Improvement Act of 1988 (CLIA)46 and Amendment regulations are federal standards to provide certification and oversight to laboratories performing clinical testing in the United States. CLIA regulations require LDTs to be developed in a certified clinical laboratory and to evaluate the performance characteristics prior to reporting patient test results.47 The laboratory should continue to follow mandated regulatory standards for maintaining quality systems for nonwaived testing46 (Figure 1).
In evaluating the performance characteristics of a test, clinical verification and validation procedures are commonplace in the laboratory. Table 2 reflects the definitions for verification and validation based on various laboratory, standards, and regulatory organizations. LDTs currently do not require FDA authorization; however, the CLIA requires the laboratory to establish performance characteristics relating to the analytical validity prior to reporting test results from patient samples. The LDT is limited to the deploying CLIA-certified laboratory and may not be meaningful, or used clinically, by other nonaffiliated laboratories. In the pathology literature, analytical validation studies using ML-based models predominate; however, relatively few have included comprehensive true clinical validation of an end-to-end system with a pathology healthcare professional using these tools and reporting results compared to a reference standard.27,48–54
Risk Assessment
The CAP supports the regulatory framework that proposes a risk-based stratification of ML models.55 The scope of the performance evaluation should be guided by a comprehensive risk assessment to evaluate both potential sources of variability and error and the potential for patient harm. The risk assessment must include failure mode effects analysis, a process to identify the sources, frequency, and impact of potential failures and errors for a testing process. Aspects of the risk assessment include the following:
Preanalytic, analytic, and postanalytic phases of testing (see Predeployment Considerations section);
Intended medical uses of the software and impact if inaccurate results are reported (clinical risk);
System components (environment, specimen, instrumentation, reagents, and testing personnel); and
Manufacturer/developer instructions and recommendations for the intended use (if applicable).
The laboratory director must consider the laboratory's clinical, ethical, and legal responsibilities for providing accurate and reliable patient test results. Published data and information may be used to supplement the risk assessment but are not substitutes for the laboratory's own studies and evaluation. Representative testing personnel from the laboratory should be involved in the risk assessment and performance evaluation.56
There are different risk-stratification approaches of ML models. Risk classification schema from the SaMD risk categorization framework is based on the significance of the information provided by SaMD toward healthcare decisions (eg, inform or drive clinical management, diagnose, or treat) and the severity of the healthcare situation or condition itself (eg, nonserious, serious, critical).55,57 A risk-stratification approach put forth by the American Society of Mechanical Engineers (ASME), the risk assessment matrix of ASME V&V40-2018, compares the combination of the model influence and the consequence the output decision might infer.58 The model’s influence can be defined as the measure of contribution to the resultant decision compared to other available information, or power of the model output in the decision-making process. The risk categorization of the ML-based CDS software may also depend on the characteristics of the underlying ML model (eg, static model versus adaptive/continuous-learning model) (Figure 2).
Predeployment Considerations
Prior to clinical laboratory deployment of ML-based CDS software, pathology personnel should be familiar with the laboratory integration points and be aware that preanalytic variables may have a downstream impact on the model outputs. Furthermore, laboratory processes can impact the performance of the software system and an integrated clinical workflow design is needed for the deployment site. This ensures receipt of data and output by the model are based on the intended clinical workflow.
Model Development and Properties
Understanding the model training and testing environment is helpful to evaluate the model for the intended deployment context. The intended use of a model will facilitate scoping the performance evaluation in the accepted and expected patient sample cohort.59 ML models range from detection tasks (eg, localization, segmentation) to classification, quantification, regression (eg, continuous variable), or clustering, amongst others. Dataset curation is one of the earliest steps of model development, a portion of which becomes the training data from which the selected model extracts features and on which it bases the outputs. Developers should provide the medical rationale (eg, clinical utility) of the model. The sample size and distribution of data used for model training, validation, selection, and the testing set should be described, including the composition of the training, validation, and test sets as well as the data sampling strategies and other weighting factors (eg, class weights used in the loss function). In addition, the description of the data should also cover any inclusion and exclusion criteria that were applied during curation of the datasets, or how missing data may have been handled during the model development process. If data were annotated, understanding how and by whom those annotations were provided can provide insights into model development and affect subsequent performance evaluation design. Furthermore, comprehending the reference standard used for initial model training and testing can provide confidence that the model is appropriate for evaluation in the clinical setting and fits the proposed use case for laboratories. The laboratory should be provided with the stand-alone performance of the model for each included sample type as it was tested; this includes accuracy, precision (eg, positive predictive value), recall (eg, sensitivity), specificity, and any additional information related to the reportable range or values.
Data Characteristics (Case Mix and Data Compatibility)
The difference in data distributions between the data samples that are used to train the model versus those used for performance evaluation at a deployment laboratory site can be analyzed through data compatibility. The case cohorts and data compatibility are essential to estimate generalizable performance of the pretrained model to the deployment laboratory. As ML models are trained, specific data are used and should be explicitly stated in the operating procedures. The deployment site laboratory can infer cohort compatibility based on appropriate measures related to the intended patient and specimen characteristics. These should include the data properties from both clinical and technical standpoints. Clinical parameters include but are not limited to patient demographics, specimen type, specimen collection body site, specimen collection method (eg, biopsy, resection), stains and other additives (fixatives, embedding media), analyte, variant, additive, container (eg, tube type), and the diagnosis. Technical parameters include data type, file extension, instrumentation hardware and software versions, relevant data acquisition parameters, imaging resolution, units, output value set (eg, classes, reportable range), and visualizations. Including a thorough predeployment evaluation of the clinical and technical parameters will ensure the ML model will work as intended.
Ethics and Ethical Bias
Bias is inherent in almost all aspects of medical practice, including ML-based systems. Different types of bias can affect the model outputs and performance characterization and 2 predominant forms of bias are described. Algorithmic bias refers to systematic and unfair discrimination that may be present in a model. FAIR (findability, accessibility, interoperability, and reusability) guiding principles were described to support meaningful data management and stewardship, and can help mitigate bias.60–62 Representation bias (eg, evaluation bias, sample bias) refers to an ML model that omits a specific subgroup or characteristic within the training data compared to the eventual data the model will be used on clinically. Studies showing underrepresented populations in medical data are emerging and may impact model performance in various subclasses.63–65 Even within open-access medical databases, predictive modeling shows the ability to generate site-specific signatures from histopathology features and accurately predict the respective institution that submitted the data.66 Measurement bias refers to the model being trained on proxy data or features rather than the ideal intended target; for instance, a model trained on data from a single instrument, and planned for deployment at a site with a similar instrument, but from a different vendor.67 Measurement bias can also be related to the data annotation process, where inconsistently labeled data may cause bias in the model outputs. A definitive reference standard should be used and qualified during the ML model performance evaluation (see Reference Standard section). Bias should be considered in the evaluation process and include equal and equitable review of the inputs and outputs. Algorithmic bias in ML-based models has been shown to have demographic inequities.68 Equality ensures identical assets are provided for each group regardless of differences; equity acknowledges differences exist between the groups, and respectfully allocates weights to each group based on their needs. Evaluation of ethics across multiple domains has engendered the need for describing transparency, justice and fairness, nonmaleficence, responsibility, and privacy as part of ML model development and deployment.69,70
Laboratory Influences
Due to the variation in laboratory protocols and practices, inherent biases and respective influences exist, which the preanalytic processes may embed within the laboratory’s data. These variables can impact the downstream model development and deployment. To ensure appropriate model performance, the ML model should ideally be trained across the range of all possible patient samples to which the model will eventually be exposed in clinical practice. While this is not always possible, when evaluating model performance in deployment sites, careful attention should be given to patient sample subgroups—especially those with clinical significance. Variations in sample collection, preparation, processing, and handling may affect model outputs if the model has not been exposed to the observed laboratory data variation during model training.
While it is recommended to ensure good laboratory practices71–73 in order to sustain appropriate patient care, as well as maintain consistent data inputs for model predictions, there are inherent intralaboratory and interlaboratory variations that need to be taken into account. Laboratories should implement approved procedures to support a robust quality control (QC) process to mitigate and minimize defects. QC defects may impact ML models and it is arguably more important to remediate ML models compared to human evaluation since humans may be less sensitive to such artifacts. Contrarily, the QC process itself may introduce bias by removing artifacts commonly encountered in specimen preparation, testing, and result reporting. Furthermore, it is likely that QC protocols and operating procedures vary considerably across laboratories, leading to the potential for additional generalizability challenges. Pathologists, and ultimately the laboratory directors, are responsible for assessments in the predeployment phase and through ongoing monitoring of ML models implemented for clinical testing. Representative examples of preanalytic variables and their potential impact on model outcomes are listed in Table 3.
Evaluation Data Sourcing
Quality CDS systems in pathology result from thoughtful design in the developmental stages and careful attention to components of clinical workflow both upstream and downstream. The laboratory must first implement, verify, and validate the digital workflow including additional hardware and software, and be in accordance with pertinent regulatory guidelines.
Comprehensive development of the ML model with an inclusive and diverse training dataset is critical to the performance of the deployed tool. Proper verification or establishment of performance specifications using a deployment site’s real-world laboratory data as a substrate is required. The deployment site’s data (eg, laboratory instrument data, imaging data, sequencing data, etc) should reflect the spectrum of sample types and characteristics typically seen during clinical patient testing at the deployment site. The data should be readily accessible to complete the needed performance characterization of the ML model. All samples used in verification and/or validation of the pipeline should be documented as part of the clinical performance evaluation dataset. The evaluation data used for performance characterization should be further quarantined; it should not be included in any training or tuning of the ML model.
Using Real-World Data
Laboratories intending to use ML-based models in the clinical workflow will need to perform evaluation studies on their deployment site data. Based on the laboratory blueprint and intended uses, intralaboratory and interlaboratory variables need to be assessed. This may require clinical informaticists or technical staff (eg, information technology staff) involvement to oversee the evaluation. Laboratory information systems can be queried for patient specimens to identify appropriate retrospective data for determination of performance specifications. Prospective data can be used for evaluation if the ML model results are not clinically reported prior to verification and/or validation. It is imperative that all clinically meaningful variations are included in the diverse dataset for performance evaluation so that the model may be exposed to and evaluate performance in respective subgroups.
Some ML-based models in pathology have been developed using open-access databases. National Institutes of Health databases, such as The Cancer Genome Atlas, have pioneered data sharing and governance for supporting research through a data commons.74,75 However, literature shows open databases are not without challenges. Open datasets have limitations (in batch effects relating to local laboratory idiosyncrasies) that introduce biases in ML models.66,76–79 Private data used for model training may also suffer from local site biases and transparency, and may not generalize well to deployment site data if the local sample characteristics differ from those trained in the model. Generalizability of the learned features in the model to the deployment laboratory data will vary depending on the similarity of the represented data distribution.
Intended Use and Indications for Use
For each clinically deployed ML model, a clinical utility that can be described through clinical needs and use cases should exist. Once the intended use of a pretrained model has been defined, the clinical pipeline from sample collection to clinical reporting should be determined for clinical verification and/or validation. Regulatory terms are helpful in understanding “what for” and “how” (eg, intended use) versus “who,” “where,” and “when” (eg, indications for use) an ML model is used in the clinical setting. The clinical utility determines the “why.” The intended use is essentially the claimed model purpose, whereas the indications for use are the specific reasons or situations the laboratory will use the ML model for testing. The model performance evaluation should, therefore, be limited to the intended use and evaluated based on the indications for use. The clinical verification and/or validation procedures should be applicable to both of these terms, including the time point of use (eg, primary diagnosis, consultative, or quality assurance) and using a similar distribution of specimen types to be examined with enrichment of variant subtypes that are expected to be reproducibly and accurately detected in the deployment site’s clinical practice. These samples should be selected to evaluate the performance of the model using appropriate evaluation metrics. While not required, evaluating the ML model with out-of-domain samples (eg, those not indicated as part of the intended use or indications of use) may be appropriate to further characterize model performance and establish model credibility and trust. Out-of-domain testing can support establishing “classification ranges” to the validated model input data (eg, a model intended to detect prostate cancer in prostate biopsies is tested on a prostate biopsy with lymphoma and identifies a focus “suspicious for cancer”).
Reference Standard
To appropriately evaluate the performance of a given ML model, a ground truth is needed to compare against the outputs of the ML model. The medical domain may contain noisy data, such that data trained on archival specimens may have different diagnostic criteria, patient populations, disease prevalence, or treatments compared to current practice. Pathology data include diseases that may have been reclassified, or diagnostic criteria that may have changed over time. This temporal data heterogeneity is also inherent within pathology reports, electronic medical records, and other information systems in relation to data quality. These factors should be considered when using historical data inputs for verification and/or validation if they do not accurately and precisely reflect current pathology practice. Patients who have the same disease can demonstrate varying characteristics (eg, age, sex, demographics, comorbidities) that differ based on clinical setting. Prevalence of disease may also differ between geographic areas. Furthermore, certain disease states are challenging even to expert interpretation, and cause label noise during the evaluation of the model. Traditional histopathology diagnosis or laboratory value reporting have been regarded as ground truth; however, based on the contextual use of the model, these may only represent a proxy for the ideal target of the model. For imaging data, the diagnosis may also only exist in the form of narrative free-text, which may contain nuances and may not be readily comparable with model outputs to establish whether the predictions are accurate.
Accuracy in certain settings may be applicable as concordance. The performance evaluation may need to evaluate the reference data to ensure “inaccurate” model predictions are not in fact errors in ground truth through discordance analysis. A paradox exists where pathologists and laboratory data are clinically reported and fundamentally provide the basis of medical decision making; however, with any assessment there may be inherent human biases to the results.80 This is especially true when correlated with visual imaging quantification.81–83 Regardless, pathologists’ diagnostic analyses have been utilized as the prime training labels for ML-based models. Furthermore, there exists a spectrum of confidence for the ground truth of a given test. For instance, predicting HER2 amplification from the morphology of a hematoxylin-eosin slide can be directly compared to in situ hybridization results of the same tissue block. Tumor heterogeneity aside, there can be relative high confidence of comparing the model output and HER2 probe signals. Similarly, a regression model predicting sepsis from laboratory instrument inputs can be correlated to sepsis diagnostic criteria.84–86 However, for challenging diagnoses or those with high interobserver or interinstrument variability,87–101 a consensus or adjudication review panel may be appropriate for establishing the reference standard for a given evaluation cohort. The reference standard may differ for various models depending on their intended use, output, and interaction in the pathology workflow. Furthermore, the reference standard requires comparable labels for comparison to the model outputs. Weakly supervised techniques may obviate manual human expert curation by knowing the presence of a tumor or not from synoptic reporting data elements linked directly to the digital data. However, large numbers of data elements linked to their respective metadata may be required.102 Some reference data, such as diagnostic free-text, may not be structured and easily exportable. It may require manual curation to identify referential data points to which to compare. Additionally, development of a true reference standard will require effort to ensure the labels are accurate. For example, in surgical pathology, the diagnostic report is conventionally described at the specimen part level—inclusive of findings across all slides of the surgical specimens. The digital slides that are used for evaluation of the model thus may need to be carefully selected, otherwise they may not be representative of the features being evaluated in the ML model. The final utilized reference standard and testing protocol are recommended to be accepted and approved by the laboratory director prior to starting the performance evaluation process, ensuring mitigation of potential errors with use of the CDS tool.
PERFORMANCE EVALUATION
Performance evaluation of any test deployed in a clinical laboratory for patient testing is an accreditation and regulatory requirement under CLIA 1988.46 Pathology departments follow manufacturer instructions, regulatory requirements, guidelines, and established best practices to perform, develop, or provide clinical testing. Following regulatory requirements is necessary prior to patient testing, and performance characteristics should be specified for a given clinical test. Generalizability of an ML model is tested through performance evaluation, to ensure the model renders accurate predictions given the deployment site’s data, compared to the data that were used to train the model. Differences in data representation (eg, units, class distribution, interlaboratory differences) may bias the trained model and performance at the deployment site may be low (eg, underfit) due to the data mismatch. While there is literature supporting the hypothesis that models trained using data from multiple sites may have higher performance, it is the responsibility of the deployment site laboratory to verify and/or validate the model using the real-world data expected to represent the case mix and distribution of the implementing laboratory (eg, single site versus receiving samples from other sites).103–105
Software systems can be evaluated using white-box or black-box testing. In white-box testing, evaluation is performed through inspection of the internal workings or structures of the system and the details of the model design. In black-box testing, the evaluation is entirely based on the examination of the systems’ functionality without any knowledge of design details. Evaluation is performed by passing an input to the system and comparing the returned output with the established reference standard. The system performance is then characterized using evaluation metrics that quantify the difference between the returned value (eg model output) and the expected value (eg reference ground truth label) aggregated following multiple samples. In most cases, evaluation of ML models for clinical practice relies on black-box testing, because the internal workings of the evaluated models cannot be inspected (eg in cases of proprietary software) or are highly complex and difficult to interpret (eg, deep neural network models with innumerable parameters and many nonlinearities). The goal of ML model evaluation is to derive an estimate of the generalization error, which represents a measure of the ability of a model trained on a specific set of samples in the training set to generalize and make accurate predictions for unseen data in the validation or test set. It is important to note that performance evaluation only provides an estimate of the generalization error: the true generalization error remains unknown, as the total data to test the model is infinite. The performance estimation depends on the size and composition of the evaluation set and how closely samples in the training data resemble the patient or specimen population for which the model will be used in clinical practice. Additionally, the performance characterization will be impacted by how closely the clinical test procedure resembles the intended use.
Clinical Verification and Validation
Performance evaluation in a laboratory consists of at least 2 concepts: verification and validation. Verification contains the root word “verus” (eg, true), whereas the root word of validation is “validus” (eg, strong). Clinical verification is the process of evaluating whether the ML model is trained correctly; clinical validation evaluates the ML model using laboratory real-world data to ensure it performs as expected. This is not to be confused with the technical ML concept of validation that is intended to find data anomalies or detect differences between existing training and new data, then subsequently adjusting hyperparameters of the model to optimize performance (eg, tuning). While beyond the scope of this document, satisfactory software code verification should be performed by the model developers prior to clinical verification or validation. Clinical validation confirms the performance characteristics of the ML model using the deployment laboratory’s real-world data. Validation is needed for those ML models that have been internally developed by the laboratory (i.e, LDT) or for ML models that have been modified from published documentation by the original manufacturer or process developer (eg, modified FDA authorized ML model) or an unmodified ML model used beyond the manufacturer’s statement of intended use.
Clinical Verification
Clinical verification focuses on development methodologies (eg, technical requirements). Performance of the software system may be tested in silico using a dedicated testing environment and a curated test dataset. This is similar to the regulatory analytical validity, where ML scientists work in conjunction with subject matter experts to evaluate performance of a trained model on a specified dataset. Clinical verification may not specifically involve prospective clinical patient results reporting; instead, it enables the implementing laboratory visibility of the model performance by analyzing the model outputs compared to the known specified reference standard. Curation and labeling of the dataset should be performed in close collaboration with, and under the supervision of, pathologists to ensure that the dataset is representative and ground truth labels are correct. The verification process may be used for an unmodified FDA-authorized ML model where the specifications of the manufacturer are used as directed. Verification of an unmodified FDA-authorized ML model should demonstrate that the performance characteristics (eg, accuracy, precision, reportable range, and reference intervals/range) match those of the manufacturer.
By predetermining the dataset, the prevalence of the medical condition in the evaluation dataset is artificially determined compared to the real-world clinical setting. To avert selection bias, it is imperative to select datasets for clinical verification performance evaluation that reflect the degree of variation of the intended condition(s) seen by the deployment site laboratory and compare the performance characteristics of the ML model established by the manufacturer or developer. Quantifying the extent of relatedness to the verification samples is intended to mitigate selection bias during dataset selection to evaluate the performance of the ML model. Selected specimens can range from being “near identical” to being completely unrelated. Determining the relatedness of the ML model’s manufacturer dataset compared to the deployment laboratory is helpful to deduce the results of the performance evaluation and can infer the generalizability of the ML model.106 Distribution of the sample characteristics include preanalytic and predictor factors such as patient characteristics, laboratory processes, and hardware used to process the samples (eg, stainers, whole slide scanners, chemistry analyzers, genomic sequencers).
Clinical Validation
Clinical validation focuses on clinical requirements to evaluate the ML model in a clinical production environment. Clinical validation requires end-to-end testing including each component of the complete integrated clinical workflow, using noncurated, real-world data. Clinical validation intends to demonstrate the acceptability of the ML model being used in a clinical setting. In the case of a CDS system, clinical validation generally compares pathologist test resulting with and without access to the model outputs. Clinical validation is akin to the regulatory concept of clinical validity, or the ML concept of evaluating the performance of an external “validation” dataset. The ML concept for internal (technical) validation using k-fold cross validation or split-sample validation are out of scope of this paper as they are generally used for model selection during model development. Prior to, or as part of, the clinical validation, a controlled reader study may be performed where all components of the model are tested with pathologists in the loop. Clinical validation is used to confirm with objective evidence that a modified FDA-authorized ML model or LDT delivers reliable results for the intended application. The validating laboratory (eg, deployment site) is required to establish the performance characteristics that best captures the clinical requirements, which may include accuracy, precision, analytical sensitivity, analytical specificity, reportable range, reference interval, and any other performance characteristic deemed necessary for appropriate evaluation of the test system. Clinical validation can be performed as a clinical diagnostic cohort study where consecutive prospective patients are being evaluated in the appropriate medical domain of the intended use of the ML model (eg, satisfying eligibility criteria based on intended use/indications of use). This ensures real-world clinical data from the deployment laboratory are included in the performance characterization and the natural spectrum and prevalence of the medical condition will be observed. Establishing a “limit of detection” for quantitative and qualitative assessment can help provide validation across appropriate clinically meaningful subgroups (eg, stratifying tumor detection by size, flow cytometry gating thresholds). The requirement to define reportable ranges as part of the test results can be exemplified for quantitative results (ie 0%–100% for prediction of estrogen receptor nuclear-positive tumor cells) or specified values for output classification (ie, Gleason patterns for prostatic acinar carcinoma classification). Reportable range definitions should be included in the standard operating and guide validation procedures. Defining the reportable range will facilitate analysis of evaluation metrics and resultant performance characteristics of the ML model. Clinical utility demonstration and validation can be further analyzed where performance characteristics prove the clinical or diagnostic outcome enhances patient outcomes.107–112
Clinical verification and/or validation are required prior to clinical patient testing for all nonwaived tests, methods, or instrument system prior to use in patient testing.113,114 Furthermore, due to the diversity of diagnostic tests—with FDA-authorized tests, modified FDA-authorized tests, or LDTs—it is not possible to recommend a single experimental design for all verification and/or validation of all ML models. However, the guidance in this document attempts to provide recommendations for evaluation procedures and metrics to support verification and validation of clinical tests that use ML models.
Evaluation Metrics
Laboratories have routinely evaluated performance of instruments, methodologies, and moderate-complexity to high-complexity tests. A variety of performance evaluation metrics are available to laboratorians based on the appropriate model (Table 4). While these evaluation metrics will be briefly discussed here, a more in-depth explanation of these metrics was previously published.45 The ideal metrics used for evaluating the performance of an ML model are guided by the data type of the generated output (eg, continuous values of a quantitative measurement generated by a regression model versus categorical values of a qualitative evaluation generated by a classification model). For outputs on an interval or ratio scale, such as proportion (percentage) of tumor-infiltrating lymphocytes, quantitative evaluation aims to estimate systematic and proportionate bias between the test method and a reference method (eg, human reader). Qualitative evaluation applies to outputs on a nominal or categorical scale (eg, presence or absence of tumor), as well as to clinical diagnostic thresholds or cutoffs applied to interval or ratio outputs (eg, probabilistic output of a logistic regression model). The data outputs of an ML model can be either nominal or ordinal in a discriminative classification task, where the output is a specified label. In a discriminative regression task, the outputs can be either an interval or a ratio, and the output value is a number that can be a continuous variable (eg, time to recurrence) or ratio (eg, serum free light chains [k/λ]). The appropriate evaluation metrics should be selected based upon the specified model intended to be clinically deployed and the variables (independent or dependent) defining the output results. For classification models, discrimination and calibration are used to evaluate the performance.
Discrimination is the ability of a predictive model to separate the data into respective classes (eg, with disease, without disease); calibration evaluates the closeness of agreement between the model’s observed output and predicted output probability. Discrimination is typically quantified using the area under receiver operating characteristic (AUROC), or concordance statistic (C-statistic). In isolation, discrimination is inadequate to evaluate the performance of an ML model. For instance, a model may accurately discriminate the relative quantification of Ki67 nuclear stain positivity of specimen A to be “double” that of specimen B (ie, discriminatory relative quantification could be 2% and 1%, respectively); however, their true, absolute Ki67 nuclear stain positivity could be 30% and 15%. Alternatively, a model showing excellent calibration can still misrepresent the relative risks (eg, discrimination). For example, a model predicting glomerular filtration rate in 2 patients shows a 50% absolute increase in creatinine between 2 patients, but may not be able to discriminate the risk of patient A’s creatinine rising from 0.4 mg/dL to 0.6 mg/dL and patient B’s 1.2 mg/dL to 1.8 mg/dL. This model is arguably of no use as patient A is within normal limits, and patient B should be evaluated for being at risk to develop acute kidney injury. This model would be considered to have poor discrimination.
Calibration measures how similar the predicted outputs are to the true, observed (eg, absolute), output classes. It is intended to fine tune the model performance based on the probability of the actual output and the expected output. For a given set of input data, deployment site calibration can be used to change the model parameters and optimize the model output data to match the reference standard more closely. While calibration is used for performance evaluation, guidance on how to perform local site calibration or retraining of the entire ML model is out of the scope for this proposed framework as the comparative performance of the calibrated model may deviate from the outputs of the original model. However, as a performance metric, calibration should be analyzed for the final iterated model intended to be clinically deployed. Calibration documentation of performance should be assessed as part of the verification and/or validation prior to clinical deployment. If the ML model is to be updated, using local site calibration, the tuning of hyperparameters of the ML model should be reflected as a new version of the model, and should consider the final trained model as the model intended for clinical deployment. Calibration is evaluated as a graphical representation called a reliability diagram. Reliability diagrams compare the model’s observed predictions to the expected reference standard. Calibration can also be evaluated for logistic regression models using the Pearson χ2 statistic, or the Hosmer-Lemeshow test.
Performance evaluation of ML models may show good discrimination and poor calibration, or vice versa. A performant ML model should show high levels of discrimination and calibration across all classes. It is also possible that the model shows poor calibration amongst a subset of classes (eg, shows high calibration for Gleason pattern 3, but poor calibration for Gleason pattern 5). Calibration, or goodness of fit, has been termed the Achilles heel of predictive modeling: while critically important in evaluation, it is commonly overlooked or misunderstood, potentially leading to poor model performance.115 Poorly calibrated models can either underestimate or overestimate the intended outcome. If model calibration is poor, it may render the CDS tool ineffective or clinically harmful.115–117 The deployment site laboratory should not report clinical (patient) samples until performance characterization has been completed; ML model updating should be considered in case of poor discrimination or calibration, including retraining of the model with sufficient data, if appropriate. While it is the responsibility of the ML model developer to ensure appropriate discrimination and calibration, laboratories may perform local site calibration, model updating, and subsequent verification and/or validation to ensure patient testing is accurate.118,119
Most literature supports analyzing accuracy as a performance metric to show the ratio between the number of correctly classified samples and the overall number of samples. When datasets are imbalanced (eg, increased number of samples in one class compared to the other classes), accuracy as defined above may not be considered the most reliable measure, as it can overestimate the performance of the classifier’s ability to discriminate between the majority class and the less prevalent class (eg, minority class). A prime example is prostate adenocarcinoma grading in surgical pathology. Gleason pattern 5 is less prevalent than Gleason pattern 3 and may be less represented in the verification and/or validation cases. Gleason pattern 5, while less common, has significant prognostic implications for patients, and a model intended to grade prostate adenocarcinoma can appear to have excellent performance if the validation cases are imbalanced and do not include samples representing the higher Gleason pattern.120,121 For highly imbalanced datasets, where one class is disproportionately underrepresented, maximizing the accuracy of a prediction tool favors optimization of specificity over sensitivity; where false positives are minimized at the expense of false negatives (accuracy paradox). This has led some investigators to recommend use of alternative metrics. The index of balanced accuracy (IBA), is a performance metric for skewed class distributions that favors classifiers with better results for the positive (eg, tumor present), and generally most important class. IBA represents a trade-off between a global performance measure and an index that reflects how balanced the individual accuracies are: high values of the IBA are obtained when the accuracies of both classes are high and balanced.122 Sampling techniques for verification dataset curation can be considered where oversampling of the minority class or undersampling of the majority class can provide a better distribution for evaluation. While the Cohen kappa, originally developed to test interrater reliability, has been used for assessing classifier accuracy in multiclass classification, there are some notable limitations that limit its utility as a performance metric, especially when there is class imbalance.123,124 The Matthews correlation coefficient, introduced in the binary setting and later generalized to the multiclass case, is now recognized as a reference performance measure, especially for unbalanced datasets.125–127 Designing an appropriate validation study should include a diverse and inclusive representation of real-world data that the model could be exposed to at the deployment laboratory. It is the responsibility of the laboratory to ensure sufficient samples are included for verification and/or validation of a given model and that the appropriate evaluation metrics are used.
Evaluation Metrics for Imaging
In evaluating models for ML-based CDS software for computer-assisted diagnosis, free-response receiver operating characteristic curves are an appropriate evaluation metric. The model output is deemed correct when the detection and localization of the appropriate label is accurate. For instance, a model intended to detect parasites in peripheral blood smear slides can miss a region that indeed has red blood cells with Plasmodium falciparum, and separately label a different region of normal red blood cells incorrectly as being positive for parasitic infection—the model has produced both a false positive and a false negative. If the validating laboratory considered only the slide label output, it would appear to show as if the slide-level output of the model identified the slide as having a parasitic infection (eg, true positive), however in reality the model should be penalized for the inaccuracies. Model evaluation should allow multiple false negative or false positive labels in a given sample.51,128 Another evaluation metric for ML models that perform image segmentation is the Dice similarity coefficient.129 Segmentation is the pixel-wise classification in an image of a given label or class. For instance, if a model is intended to identify and segment mitoses in an image, given a verification dataset that has an expert pathologist’s manual annotation of mitoses, the model’s performance can be evaluated by measuring the intersection over union (eg, degree of overlap) between the model segmentation output and the human expert manual annotation.130
The ideal metrics used for evaluating the performance of an ML model are guided by the data type of the generated output (eg, continuous values of a quantitative measurement generated by a regression model versus categorical values of a qualitative evaluation generated by a classification model). For outputs on an interval or ratio scale, such as proportion (percentage) of tumor-infiltrating lymphocytes, quantitative evaluation aims to estimate systematic and proportionate bias between the test method and a reference method (eg, human reader). Qualitative evaluation applies to outputs on a nominal or categorical scale (eg, presence or absence of tumor), as well as to clinical diagnostic thresholds or cutoffs applied to interval or ratio outputs (eg, probabilistic output of a logistic regression model). The data outputs of an ML model can be either nominal or ordinal in a discriminative classification task, where the output is a specified label. In a discriminative regression task, the outputs can be either an interval or a ratio, and the output value is a number that can be a continuous variable (eg, time to recurrence) or ratio (eg, serum free light chains [κ/λ]).
Classification Test Statistics
To evaluate ML models with classification outputs, arguably the best-known evaluation metric is the confusion matrix for binary classification outputs (disease versus no disease) where sensitivity, specificity, accuracy, positive predictive value, negative predictive value, and receiver operating characteristic (ROC) curve calculations can be computed (Figure 3). While for ML-based classification models the final outputs may be represented as a binary output (eg, tumor versus no tumor), a preceding continuous variable (eg, probability range from 0 to 1) with cut-point threshold is applied to convert it into a binary result. The sensitivity and specificity of the model vary depending on where the operating point is set (eg, higher threshold = decreased sensitivity and increased specificity; lower threshold = increased sensitivity and decreased specificity). Sensitivity (also referred to as “true positive rate” or “recall”) is the ratio of true positive predictions to all positive cases and represents a method’s ability to correctly identify positive cases. Specificity is the ratio of true negative predictions to all negative cases and represents a method’s ability to correctly identify negative cases. Positive predictive value and negative predictive value are directly related to prevalence. Positive predictive value (also referred to as precision) is the probability that, following a positive test result, an individual will truly have the disease or condition, and is the ratio of true positive predictions to all positive predictions. Negative predictive value is the probability that, following a negative test result, an individual will truly not have the disease/condition, and is the ratio of true negative predictions to all negative predictions. The overall or naive accuracy represents the proportion of all cases that were correctly identified. ROC curves are graphical representations plotting the true positive rate (sensitivity) and false positive rate (1 − specificity) at different classification thresholds. The ROC curve is plotted to illustrate the diagnostic ability of a binary classifier system as its discrimination threshold or operating point is varied. The AUROC curve is a single scalar measure of the overall performance of a binary classifier across all possible thresholds, ranging from 0 (no accuracy) to 1 (perfect accuracy). A model with an AUROC curve of 1 (perfect accuracy) does not necessarily portend high performance of the ML model in clinical practice; the AUROC curve weights sensitivity and specificity equally. If a specific intended use of a model is for screening, this may have a lower AUROC value and still maintain high sensitivity despite having lower specificity. Given that a certain operating point is needed when iterating and versioning an ML model, the sensitivity and specificity values for the selected operating-point threshold should be considered when measuring the model’s performance. The F-score (eg F1-score) is a combined single metric (eg, harmonic mean) of the sensitivity (eg, recall) and positive predictive value (eg, precision) of a model. For multiclass or multilabel classification problems, pairwise AUROC analyses are needed since the single AUROC curves are intended for evaluating binary outputs. For multiclass instances, one method allows an arbitrary class to be compared against the collective combination of all other classes at the same time. For instance, one class is considered the positive class, and all other classes are considered in the negative class. This enables a “binary” classification using AUROC by using the “one versus rest” technique. Another evaluation metric for multiclass models termed “one versus one” analyzes the discrimination of each individual class against each other, with all permutations of individual classes compared against each other. In a 3-class output (eg, tumor grading of low, moderate, and high), there would be 6 separate one-versus-one comparisons. In both one versus rest and one versus one, the average AUC for each class would result in the final model accuracy performance. Generally accepted values of AUROC less than 0.6 represent poor discrimination, where 0.5 is due to chance (eg random assignment). Discrimination AUROC of 0.6 to 0.75 can be classified as helpful discrimination, and values greater than 0.75 as clearly useful discrimination.131,132
Positive percent agreement and negative percent agreement are the preferred terms for sensitivity and specificity, respectively, when a nonreference standard—such as subjective expert assessment—is used to determine the “ground truth.” In other words, this approach measures how many positives/negatives a test identifies that are in agreement with another method or rater used on the same samples.133
Standalone verification performance of the model should undergo analysis of discordant cases (false negatives, false positives) to ensure the model outputs are checked against the reference standard. Discordance analysis should be performed for all false negatives and all false positives. Additionally, a minimum sufficient number of true positives and true negatives should also be evaluated to mitigate evidence of selection bias. For clustering models, these include evaluating the longest centroid-dataset pairs, and a sufficient number of centrally clustered data points. Discordance analysis enables adjudication of false negatives and false positives that may be labeled inappropriately during the ground truth process, or where true positives predicted in the positive class are, in fact, false positives. The discordance analysis will result in a final true positive, true negative, false positive, and false negative rate for verification and/or validation documentation. Discordance analysis should be performed, documented, and available for review.
Sample Size Estimations for Binary Classification
Since the laboratory is solely interested in knowing whether the locally determined accuracy is significantly lower than the manufacturer’s claimed accuracy, a one-sided Zα score of 1.645 is used instead of the 2-sided Zα/2 of 1.96 at an α level of 0.05, and Zβ = 0.84 at β = 0.20. Due to the properties of the binomial distribution, as the accuracy claim decreases, the sample size increases for a given tolerance level or deviation from the claimed accuracy.
Although sample size calculations for accuracy do not depend on the prevalence of the entity of interest in the population sampled, sample sizes for verifying sensitivity and specificity do. Sample sizes for sensitivity and specificity can also be derived using Equation 1; for sensitivity, the calculated estimate represents the total number of positive samples and for specificity, the calculated estimate represents the total number of negative samples. For example, using Equation 2 above and targeting a prevalence of 50% in the accuracy evaluation cohort, one would require at least 75 positive samples and at least 75 negative samples to have 95% confidence and 80% power to detect a difference of 7.5% or more lower than claimed sensitivity and specificity of 97.5%: that is, 94 cases in total. An imbalanced sample with prevalence of 20% (20% positive, 80% negative) would require at least 92 positive cases and a total sample size of 460 to verify the sensitivity claim; the sample size for verification of the specificity claim would be 115 to ensure at least 92 negative cases.
Assuming an α level of 0.05 and β = 0.20, a sample size of at least 150 (75 positive and 75 negative cases) should be considered sufficient for verification of binary classification tasks, when the accuracy, sensitivity, and/or specificity claim is at least 95% or higher.
It should be noted that these are naïve assumptions and the actual number needed may depend on additional considerations for determining the estimated sample size for establishing diagnostic accuracy. These include, but are not limited to, study design (single reader versus multiple readers), the number of diagnostic methods being evaluated (single versus 2 or more), how cases will be identified (retrospective versus prospective), the estimated prevalence of the diagnosis of interest in the population being sampled (prospective design) or the actual prevalence in the study cohort (retrospective design), the summary measure of accuracy that will be employed, the conjectured accuracy of the diagnostic method(s), and regulatory clearance of the model.
Regression Model Test Statistics
For regression models, the outputs are measured on an interval or ratio scale rather than categorically. The purpose of the quantitative accuracy study is to determine whether there are systematic differences between the test method and reference method. This type of evaluation typically uses a form of regression analysis to estimate the slope and intercept with their associated error(s), and the coefficient of determination (R2), where applicable. Regression analysis is a technique that can measure the relation between 2 or more variables with the goal of estimating the value of 1 variable as a function of 1 or more other variables. Data analysis proceeds in a stepwise fashion, first using exploratory data analysis techniques, such as scatterplots, histograms, and difference plots, to visually identify any systematic errors or outliers.134 Specific tests for outliers such as a model-based, distance-based test, or influence-based test may also be performed.135
The purpose of the method comparison study for quantitative methods is to estimate constant and/or proportional systematic errors. There are 2 recognized approaches for method comparison of quantitative methods: Bland-Altman analysis and regression analysis. In Bland-Altman analysis, the absolute or relative difference between the test and comparator methods is plotted on the y-axis versus the average of the results by the 2 methods on the x-axis.136,137 Since the true value of a sample is unknown in many cases, except when a gold standard or reference method exists, using the average of the test and comparator results as the estimate of the true value is usually recommended. While visual assessment of the difference plot on its own does not provide sufficient information about the systematic error of the test method, t test statistics can be utilized to make a quantitative estimate of systematic error; however, this bias estimate is reliable only at the mean of the data if there is any proportional error present.138–140
Regression techniques are now generally preferred over t test statistics for calculating the systematic error at any decision level, as well as getting estimates of the proportional and constant components. However, it is important to note that the slope and intercept estimates derived from regression can be affected by lack of linearity, presence of outliers, and a narrow range of test results; choosing the right regression method and ensuring that basic assumptions are met remain paramount. If there is a constant standard deviation across the measuring interval, ordinary least squares (OLS) regression and/or constant standard deviation Deming regression can be used. If instead, the data exhibit proportional difference variability, then the assumptions for either OLS or constant standard deviation Deming regression are not met and instead, a constant coefficient of variation Deming regression should be performed. If there is mixed variability (standard deviation and coefficient of variation), then Passing-Bablok regression, a robust nonparametric method that is insensitive to the error distribution and data outliers, should be used. An assumption in OLS regression is that the reference method values are measured without error and that any difference between reference method and test method values is assignable to error in the test method; this assumption is seldom valid for clinical laboratory data unless a defined reference method or standard exists and, therefore, OLS regression is not appropriate in most cases for performance evaluation of quantitative outputs. In Deming regression, the errors between methods are assigned to both reference and test methods in proportion to the variances of the methods141 (Figure 4, A through C). Two special situations arise when the measurand is either a count variable (eg, inpatient length of stay in days) or a percentage bounded by the open interval (0, 1). For the former, Poisson or negative binomial regression techniques should be used depending on the dispersion of the data and for the latter, β regression techniques should be considered. Statistical techniques are also available for evaluation of models comparing ordinal to continuous variables, such as ordinal regression, but these are beyond the scope of this paper.
Precision: Repeatability and Reproducibility
Precision refers to how closely test results obtained by replicate measurements on the same or similar conditions agree.142 Precision studies for ML models can be conducted similarly to test methodology evaluations in clinical laboratories; approaches for precision evaluation have been extensively described by the Clinical and Laboratory Standards Institute (CLSI). Repeatability reflects the variability among replicate measurements of a sample under experimental conditions held as constant as possible, that is, using the same operator(s), same instrument, same operating conditions, and same location over a short period of time (eg, typically considered within a day or a single run). Within-laboratory precision (an “intermediate” precision type) incorporates run-to-run and day-to-day sources of variation using a single instrument; these precision types may also include operator-to-operator variability and calibration cycle-to-cycle variability for procedures that require frequent calibration. As such, repeatability and within-laboratory precision are evaluated via a single-site study. Sometimes confused with “within-laboratory precision,” reproducibility (or between-laboratory reproducibility) refers to the precision between the results obtained at different laboratories, such as multiple laboratories operating under the same CLIA license. Reproducibility, while not always needed for single-lab precision evaluation, may be beneficial when a model is being deployed at more than one site.
Evaluation of repeatability and within-laboratory precision and reproducibility can be conducted as separate experiments (“simple precision” design). For the simple repeatability study, samples are tested in multiple replicates within a single day by 1 operator and 1 instrument under controlled conditions. For simple within-laboratory precision studies, the same samples are run once a day for several days within the same site, varying the instrument and/or operator as appropriate and per standard operating procedure. In a complex precision design, a single experiment is performed with replicate measurements for several days and across several instruments or operators, where applicable. Detailed statistical techniques for analyzing complex precision studies are described elsewhere (for quantitative studies using nested repeated-measured ANOVA [regression outputs], see EP05–Evaluation of Precision of Quantitative Measurement Procedures; for qualitative studies [classification outputs], see ISO 16140).143,144 These techniques provide estimates for repeatability and between-run and between-day precision, as well as within-laboratory precision or reproducibility depending on the study design. They allow a laboratory to rigorously establish the precision profile of a modified FDA-approved or laboratory-developed model or verify the precision claims of a manufacturer for an unmodified FDA-approved model.
For an unmodified FDA-authorized model, the objective is to verify the manufacturer’s precision claim, such as the proportion of correct replicates and/or the proportion of samples that have 100% concordance across replicates or days in the manufacturer’s precision experiment. For example, in a repeatability experiment, the manufacturer processes 1 whole slide image each from 35 tumor biopsies and 36 benign biopsies in 3 replicates using 1 scanner or operator in the same analytical run. For biopsies containing tumor, 99.0% (104 of 105) (95% CI: 94.8%–99.8%) of all scans and 97.1% (34 of 35) of all slides produced correct results, while for benign biopsies, 94.4% (102 of 108) (95% CI: 88.4%–97.4%) of all scans and 88.9% (32 of 36) of all slides produced correct results. An acceptable number of replicates for the repeatability study can be derived from Equation 1, but 60 replicates per model class should be sufficient in most cases allowing for a 10% tolerance below the manufacturer’s stated precision claim. Therefore, a user can verify the manufacturer’s stated repeatability claim by analyzing a set of slides from tumor biopsies and benign biopsies (n = 10 each) obtained at their local institution in replicates (n = 6) using 1 scanner or operator in the same analytical run.
If the observed precision profile meets or exceeds the manufacturer’s claims, the precision study can be accepted. For precision studies where most of the ML model outputs pass, but some fail, the medical director should review the study design and results to evaluate potential confounding causes of the imprecision. Failure to meet the stated claims does not mean that the observed precision is necessarily worse than the manufacturer’s claim, but it should require a 1-sample or 2-sample test of proportions to compare the observed precision profile to that described by the manufacturer. Alternatively, the laboratory may choose to qualitatively evaluate the precision profile of the model. The precision study may still be accepted if the laboratory director deems it appropriate; written justification and rationale for accepting the precision performance will be needed. Precision study results where all samples show 100% concordance across replicates, or where no more than 1 sample shows discordance, should be acceptable for most use cases. When discordance across replicates, runs, days, instruments, or operators is observed, the medical director should review the study design and results to evaluate confounding variables as a potential cause of the imprecision and may decide to repeat samples or runs, or the entire precision study if appropriate. In this scenario, close ongoing monitoring of the ML model is advised. For precision studies where many results failed or showed a high level of imprecision, the deployment site may reject the ML model for use as a clinical test, troubleshoot laboratory influences affecting the output results, or design a new study to better evaluate precision of the model.144,145
CHANGE CONTROL AND CHANGE MANAGEMENT
Change control for ML-based testing in the laboratory is defined as the process of how changes to requirements for laboratory processes are managed. Change management is closely related and defined as the process for adjusting to changes for a new implementation. The goal of change management is to control the life cycle of any modifications. It is the fourth pillar of process management as defined by the CLSI, which describes the pillars as (1) analyzing and documenting laboratory processes, (2) verifying or validating a laboratory process, (3) implementing controls for monitoring process performance, and (4) change control and change management to approved laboratory processes.146 Change control encompasses alterations to preanalytical, analytical, and postanalytical variables. In the context of ML models in pathology, most alterations can be divided into changes in physical components (eg, reagents such as stains, specimen collection and transportation parameters, fixation times, specimen types, or instruments such as whole slide scanners, nucleic acid sequencers, chemistry and coagulation analyzers) and digital dependencies (eg, data inputs, algorithm code, file formats, resolution, software interoperability). Laboratory management should have a structured process for assessing changes in laboratory processes that involve or impact ML models deployed in the laboratory. Change management can become necessary for a variety of reasons including identification of a problem in the quality control process, an external or internal need to update instruments, equipment, or reagents, or to respond to the changing needs of the end user (Table 5).
The purpose of creating a change management system is to ensure that changes to the system do not adversely affect the safety and quality of laboratory testing. For laboratory tests that employ ML systems, the ML model performance has a critical impact on the overall test performance and patient safety. Any changes to a previously verified and validated model should require performance characterization of the updated model. In addition to evaluating how changes may affect clinical functionality or performance, model changes can also be interpreted through the lens of risk assessment, with proposed changes stratified by risk. Modifications should be evaluated with particular attention to the potential for changes that may introduce a new risk or modify an existing risk that could result in significant harm, with controls in place to prevent harm. In addition, changes to the model or processes that the model is dependent on need to be transparent and communicated to all personnel and stakeholders. A successful change management strategy should reduce the failed changes and disruptions to service, meet regulatory requirements through auditable evidence of change management, and identify opportunities for improvement in the change process. Documentation for change control depends on the type of change, the scope and anticipated impact of the change, and the risk associated with the change. The documentation and verification should be completed before the change is clinically implemented (examples of documentation to support the change management process are listed in Table 6). CLIA regulations for proficiency testing require periodic reevaluation (eg, twice annually), even if there are no changes, to verify tests or assess accuracy.
One of the main challenges with ML-based systems is that they can have many dependencies that may not be readily apparent to everyone in the organization. It is recommended to identify all possible dependencies in the initial implementation and to update this documentation when process changes are made. Having the process team participate in the preparation and proposed modifications of a flow diagram can help ensure inclusion of the steps in the upstream and downstream workflows, especially when data inputs or processes are supplied by different personnel or different divisions in a laboratory. Implementation of ML-based CDS systems with dependencies across the laboratory will mean that the wider laboratory (eg, the histology laboratory, molecular pathology laboratory, chemistry laboratory, etc) should consider whether a new or revised process will affect any other processes including any deployed ML-based CDS systems. If so, a timeline needs to be established for the wider systemic review and revision of the associated process flows. Team members from each of the impacted processes for a new or changed process should be included and verify the updated flow chart. Once all documents associated with the revised process are prepared, validation and/or verification is needed to determine the performance characteristics of the revised test. With any revision of the model, the model version should be documented for auditing and tracking of performance. Verification is also performed when a laboratory intends to use a validated process in a similar manner for a different application. For example, consider a scenario where a laboratory has a validated ML-based CDS system on a specific whole slide scanning device for primary diagnosis. The laboratory now intends to install a new whole slide scanner from a different vendor and wants to use the same ML model with the new scanners. Reassessing the performance characteristics of the ML-based system is recommended after any model parameters have been changed to iterate a priori and assess the performance as the final model would be clinically deployed. Recent draft guidance from the FDA entitled, “Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence/Machine Learning (AI/ML)-Enabled Device Software Functions” speaks to these concepts.147 Ideally, predetermined change control plans should delineate manufacturer versus local medical director responsibilities relative to validation and verification. For the medical director, the predetermined change control plan should include both the verification process and a set of target verification metrics that have a clear relationship to clinical performance and are tailored to what is practical for local sites including data volume and number of cases.
Deployment Site Ongoing Monitoring
After a model has been duly verified and/or validated and appropriate instruction for use (eg, standard operating procedure) is in place, it is the responsibility of the laboratory to ensure the model performance is monitored for stability and reliability over time. Changes in the input data may result in an alteration of the model performance. The time it takes for the model to deteriorate in performance is determined by factors pertaining to the upstream and downstream processes or data that correspond to the model. Model performance may shift (eg, sudden loss of performance), drift (eg, gradual loss of performance), or show cyclical or recurring change depending on these influences. Shift may be obvious to the laboratory and can be exemplified by an update in the laboratory information system where the units of a specific input value have changed, and the ML model using that numerical data for predictions is now significantly different and shows inferior performance. In this scenario, the results of any test including an ML system with shift should not be reported until the updated system undergoes verification as part of change management with updating laboratory documentation as necessary. Certain changes causing shift or drift may require retraining or calibration of a new ML model and subsequent verification and/or validation of the updated model. Drift can be more challenging to identify, especially for very small incremental changes over time; it usually refers to changes in the data distribution that the model was originally trained with and validated for. Drift may occur due to changes in patient demographics, updated practice patterns, or new workflows105,112,148–150 The loss in performance may show a loss of discrimination or calibration; however, depending on the change of the data, there may be loss of calibration without a significant change in discrimination, or vice versa.151–153 Changes in healthcare can introduce transformed patient populations, specimen collection, data inputs, and clinical workflows.154–160 Vigilance in laboratory testing and performance monitoring is crucial to patient safety, medical decision making, and good laboratory practice. When the performance of a verified and/or validated model has deteriorated (eg, shift, drift), clinical testing using the ML system should be halted until the performance characterizations have been remediated. Performance monitoring techniques such as calibration drift detection have been suggested and will likely play an important role for ongoing monitoring of clinically deployed ML models.161 Additional ongoing monitoring of model stability includes recognition of adversarial attacks on ML models, predominantly for image classification.162 These attacks are adversarial images (eg, perturbations, skew vectors) that can be applied to the medical image that are indiscernible to the human eye, but may cause drastic changes in model performance.163 With appropriately diverse training data, decreasing the disparity between the development and target data, as well as the model architecture, has been shown to minimize the effect of adversarial attacks on the ML model performance.164,165
HUMAN-AI (COMPUTER) INTERACTION
Often the primary method for evaluating the performance of ML-based software is a standalone or in silico evaluation of the ML-based CDS system’s output compared to a reference standard (eg, ground truth) on a series of samples that are not part of the system’s training or tuning set (ie, on which the model was not trained). As described above, these evaluations can provide useful overall statistical metrics (ie, sensitivity, specificity, positive predictive value, correlation, and mean square error analysis). Validation performance evaluations educate the user about the augmented workflow of the pathologist’s use of the ML model during clinical routine workflow. While analytical validity is critical to establishing benchmark model performances, as augmented intelligence, a human interpreter (eg, pathologist) is still required to finalize and authenticate a patient’s report. Standalone (eg, in silico) evaluations fall short of capturing how the human ingests, interprets, and incorporates the results of the ML model when evaluating a sample. As there are disruptive clinical distractions, the ML-based CDS system may also have disruptive, time-consuming visualizations, workflows, or false results that may undermine the mental demand of such systems on users in the medical domain. The effect of this on patient care has been shown to be significant in several studies.166,167 Moreover, user certainty (eg, digital confidence), uncertainty (eg, digital hesitancy), and comfort with the interface may likewise have a profound effect on issuing a timely and accurate result.168–170 Implicit in this evaluation is the impact of the stress-related tasks related to the human-computer interaction.171 Presentation interaction complexity, display features, explainability of the model outputs, and information density may influence user-experienced stress and, hence, also reliability and adoption.
Studies have examined the human-computer interaction, capturing various metrics, such as improvement in accuracy, patient outcomes, and productivity. In surgical pathology, this has been most studied for ML-based CDS systems related to prostate cancer diagnosis (detection, grading, or assessment of perineural invasion). The use of these systems has resulted in changing various aspects for the diagnostic interpretation, although few studies assess CDS utility in a setting that mimics diagnostic practice.52,118,164 Generally, improvements are seen with the use of these highly accurate CDS systems. Optimization of the human-computer interaction is essential to ensuring that the user and the ML model show optimal performance when used together and exceed the performance than either the ML model or user alone. Optimal performance of the human-AI interface can be evaluated in terms of accuracy and productivity. Central to this optimization are themes including trust, understanding the model, explainability, and usability (eg, user interface, visualizations).
Trust
Rigorous evaluation of ML-based systems is needed to build trust or provide evidence not to trust the evaluated system. Regardless, ML systems can be met by skepticism from users.172 Distrust may stem from a lack of understanding, feeling threatened by the technology, or perhaps feeling that automation may negatively impact quality of care. In contrast, excess trust in CDS systems can lead to overreliance on the technology and user complacency. User complacency might result in a convergence of the human’s performance characteristics with that of the ML system. This consequence is arguably desirable if an ML model outperforms the human for a given task; however, it is rightfully worrisome until proven otherwise. Instead, an approach of vigilance is recommended to support assistive workflows. The user (eg, pathologist) can correct or adjust the model output if incorrect or accept the output if deemed correct. For a model with high performance, this supervision may result in the most optimal result for the patient in that the combined benefits of human and machine will be incorporated into the final diagnostic assessment. Trust and, ultimately, confidence will result from a high-performance system (eg, minimal errors). The key to achieving this level of human vigilance relies on the user understanding the strengths and weaknesses of the system.
Understanding the Model
Both diagnosticians and the CDS systems designed to aid them are imperfect. Thoughtfully designed standalone studies of ML models, with involvement of the diagnosticians intended to use them, can provide invaluable information as to how human weaknesses or strengths can be informed by the tool. For example, an ML model designed to detect invasive prostatic acinar carcinoma might show excellent standalone overall performance, but when tested against mimics of carcinoma, mimics of benign prostatic tissue, or other diagnostically challenging scenarios, the standalone performance might deteriorate. These situations may be less frequent, however they represent the true diversity of conditions in which a human pathologist could mitigate potential error introduced by the ML model. Outside of biologically relevant situations that pose a challenge for a pathologist, the tool should be evaluated in scenarios that expose possible bias or errors, such as in a variety of ages, races, and ethnicities, and on a sample set with a range of quality.173,174 Additionally, the model developer may have selected an operating point to support a screening task workflow, and thus the output predictions may have a perceived high false positive rate. Knowing the model’s operating point and intended use can allow the user to further understand the strengths and weaknesses of the ML system.
Explainability
The complexity of an ML model can impact the ability of humans to understand how it works and hinder the ability to predict the conditions in which the model may fail (eg, vulnerabilities to preanalytical variables or artifacts). The complexity of the model should be considered in terms of the number of parameters of the model and the extent to which nonlinear relationships are being modeled. For models that have highly nonlinear or multivariate properties, it can be challenging to relate the model’s behavior to the corresponding attributes that humans have studied in the medical domain. Explainability for ML systems is critical to the safety, approval, and acceptance of CDS systems.
Many ML models learn low-level features and the interpretability of those models using high-level interrogative processes can provide insights that humans can understand regarding how the model output was produced. Model transparency does not necessarily mean knowing the data or what features were used to train the model, but rather visibility into the model itself and relationships between the model and the outputs. ML systems may be considered a “black box” where the model’s internal workings are not directly interpretable and cannot be communicated to the pathologist in an understandable way. For example, a model that uses a deep convolutional neural network to estimate percentage of nuclear immunohistochemical DAB stain positivity may be too complex for the pathologist to thoroughly understand the mechanism by which it predicts an image to be “positive” but such a model may also perform better than an image analysis-based tool that explicitly segments cells and measures pixel intensity on a cell-by-cell basis. Failures of the image analysis software due to poor staining or crush artifacts can be more easily understood and remediated, whereas failures of the deep network may have a less obvious solution. Explainability for ML models include (1) returning exemplars to the pathologist to evaluate visually (eg, saliency map) or (2) quantifying image attributes (eg, number of cells in an image patch) and mathematically relating such measurements to the output of the model. For fair interpretability, Shapley values, adapted from coalition game theory, provide insights to the contribution of a feature compared to the difference between the actual prediction and the average prediction, given all features learned in the model. Newer tools to estimate Shapley values, such as Shapley additive explanation (SHAP) values are model agnostic (eg, classification and regression), openly supported, and easily interrogate an existing pretrained model. SHAP is a tool that can provide local accuracy (eg, differentiating between the expected ML model output and the output of a given instance), “missingness” (eg, missing features can be attributed), and consistency (eg, if an ML model changes and the weight of a given feature is modified, it should be directly proportional to the corresponding Shapley value). Shapley values can provide meaningful information showing the effects of the ML model output for the prediction of that instance175,176 (Figure 5, A and B). Other approaches for model explainability such as individual conditional expectation, local interpretable model-agnostic explanations, testing with concept activation vectors, or counterfactual explanations offering varying fidelity based on the use case.177–179
There is also controversy surrounding the definitive need to make ML models humanly interpretable, as it is conceivable that some image features utilized by the model are intrinsically not interpretable by humans, and if they were, there would not be a need to use complex models in the first place. These analyses can be useful companions to ML performance metrics as they can help a pathologist surmise how the model works. However, these methods are not without limitations. Providing examples to the pathologist relies on a qualitative human assessment that makes assumptions from observed correlations but does not establish causality. Further work in explainable AI is needed to help bridge the gap between human interpretation and model behavior.43,180,181 In addition, user interfaces and visualization methods designed to present the results of such analyses to pathologists in an intuitive fashion must continue to be developed.
User Interface and Model Output Visualization
Most end users will not directly interact with the ML model in isolation, but rather with a graphical user interface or output visualization. An intuitive user interface is integral to evaluate an ML model output effectively and efficiently. Interface design is critical to minimize improper use of the CDS system.182 Careful consideration of user-required inputs and methods ensures both ergonomic and psychometric appropriateness. User interface and data presentation are also crucial components to efficient evaluation of ML model outputs. For instance, for detection of low-grade squamous intraepithelial lesion in a liquid-based cytology specimen it may be more helpful to have a gallery of suspicious cells for ease of review than a binary slide-level label that does not provide the user with pixel-wise awareness of where the cells of interest are within the image. Similarly, tumor purity quantification of a molecular extraction specimen could provide numerical counts for each cell class, or highlight (eg, segment) the cells of each class for the user to view and assess. There are various visualizations to direct the user’s attention to specific representations of the model outputs (eg, arrows, crosshairs, heat maps). Additional studies reviewing user interface/user experience in using such tools are important to distinguish user perception from design principles. For instance, a large blinking red arrow around a region of interest may influence users to overcall a diagnosis compared to a static neutral-colored outline. The visualization type, if applicable, should be documented in the operating procedures. Not all model outputs will rely on specified visualization tools; clinical verification and/or validation should document whether human-AI interaction will be required to interpret the results. Ideally, the ML developer will have studied the tool in a setting that examines the human-AI interaction, and through user-centered design principles, provide the most optimal user interface and visualization to support the use case of the model (Figure 6, A through F).
PERSONNEL TRAINING AND COMPETENCY EVALUATION
The training plan for an ML-based CDS system should include all professionals and supporting personnel who will interact with the system. This may include administrative, technical, and professional personnel. The scope of training provided should be tailored to the level of responsibility, delegated tasks, and amount of interaction anticipated for a given trainee. In general, closest attention should be given to those responsible for case selection, dataset or input field selection, and decision support to use or reject the model outputs. Clerical or supporting technical personnel should be aware of inputs that may impact model outputs (eg, specimen source driving analysis by a specific ML model), particularly if such data are not automatically transferred from the laboratory information systems.
Verification and/or validation datasets should include sufficient variation to ensure acceptable performance across a range of possible inputs or classes to the model. Additionally, personnel training materials should include case materials that fall close to critical decision thresholds and also include cases to illustrate the model failure modes (eg, image digitization artifacts, insufficient data, improper input data, case selection, use of model output, reporting). For those responsible for reporting results, a clinical competency assessment to use the tool should be demonstrated by successful completion of training cases and satisfactory concordance on a representative spectrum of competency cases. Competency assessment should replicate the procedures planned for patient cases and include cases that should be rejected for evaluation.183
Training and competency materials can be prepared and provided by the manufacturer or derived from laboratory-validated cases from the user institution files. If manufacturer-provided training materials are used, the deployment laboratory should ensure that their in-house procedures and outputs are comparable to the materials provided by the manufacturer by tangible quality metrics.184,185 Good laboratory practices dictate that a tool should not be placed into use prior to completion of a written, lab director–approved procedure. Additionally, the training exercises and documentation should clearly delineate how to revert to downtime procedures in the event of model drift, shift, or other error. If any part of the model is consistently producing errors, the part of the testing process where the ML model is being used should cease immediately. The performance evaluation should include scenarios present within the intended clinical environment. This should also confirm that safety elements work properly. Confirmation of acceptable failure behavior in the clinical environment should be established, with plans for failsafe procedures in the event the model cannot be used. Definitive understanding of the failure modes including inappropriate use of inputs or errors in the model outputs should be remediated prior to clinical use. Training and competency assessment records should be retained in a retrievable manner for the full period of employment, or use of the tool. Standard operating procedures should reflect retraining triggers clearly; retraining and education should be performed whenever procedures change significantly, such as an expansion of case selection criteria. Documentation of training and competency assessment of appropriate users should be retained similarly to other laboratory procedures as per good laboratory practice.78,186
CONCLUSIONS
ML models have enabled newfound functionality and workflows in pathology. Significant numbers of ML models are commercially available and organizations with computational pathology resources may also develop ML models that can be introduced into clinical practice. ML-based models in pathology include imaging-based and non–image-based methodologies. These proposed recommendations are based on the available evidence and literature, and the authors hope to encourage additional peer-reviewed literature to support adoption of these novel technologies in clinical practice (Table 7). The combination of pathologists or others and ML-based systems creates a paradigm of augmented intelligence, where the pathologist is assisted by ML from patient testing and reporting to enhance cognitive performance and clinical decision making. High-performance ML models will facilitate specific tasks for pathologists in an assistive fashion; the role of the ML model is limited to providing competent support to the healthcare provider.
Performance evaluation of ML models is critical to verification and/or validation of ML systems intended for clinical reporting of patient samples. Appropriate evaluation metrics based on the ML model should be used to evaluate the performance characteristics of the model prior to clinical testing. The importance of incorporating laboratory director oversight and using local data for verification and/or validation cannot be overstated. Measuring the similarity between the development dataset and the verification and/or validation dataset can help assure performance characteristics should generalize as indicated. Changes to the model (eg, retraining with additional data, intended use/indications of use, significant changes to the user interface) should require re-verification of the revised model and documentation of the model version. Additionally, any change to the workflow used to generate the input data needs to be considered. Understanding the preanalytic variables for biospecimen quality may impact model performance characteristics. Furthermore, monitoring for performance defects over time (eg, shift, drift) should be conducted to detect relevant data effects of the ML models. If a model is found to have deteriorated performance, the portion of the clinical test where the ML model is incorporated should immediately be stopped from use and remediated appropriately.
Pathology laboratories should be responsible for determining the clinical utility of the ML model prior to verifying and/or validating and for monitoring performance over time. Proper documentation of training and competency should be completed prior to clinical use of the ML-based system. As pathologists verify and validate ML models, they must learn about the scope of application, and its strengths and limitations, in preparation for a future that will incorporate powerful new tools and new management responsibilities.
The authors thank James Harrison, MD, for guidance on this concept paper and valuable commentary, Mary Kennedy and Kevin Schap for administrative support, and the College of American Pathologists committees and members who otherwise contributed to these recommendations.
References
Author notes
All authors are members from the Machine Learning Workgroup, Informatics Committee, Digital and Computational Pathology Committee, and Council on Informatics and Pathology Innovation of the College of American Pathologists, except Souers, who is an employee of the College of American Pathologists. Hanna is a consultant for PaigeAI, PathPresenter, and VolastraTX. Krishnamurthy is a consultant on the breast pathology faculty advisory board for Daiichi Sankyo, Inc. and AstraZeneca and serves as a scientific advisory board member for AstraZeneca. Krishnamurthy received an investigator-initiated sponsored research award from PathomIQ Research, sponsored research funding from IBEX Research, and an investigator-initiated sponsored research award from Caliber Inc. Raciti has stock options at Paige (<1%) and has employment and stock compensation at Janssen. Mays’ affiliation with the MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE’s concurrence with, or support for, the positions, opinions, or viewpoints expressed by the author. The other authors have no relevant financial interest in the products or companies described in this article. Hanna is now located at the Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania.