Context.—

Machine learning applications in the pathology clinical domain are emerging rapidly. As decision support systems continue to mature, laboratories will increasingly need guidance to evaluate their performance in clinical practice. Currently there are no formal guidelines to assist pathology laboratories in verification and/or validation of such systems. These recommendations are being proposed for the evaluation of machine learning systems in the clinical practice of pathology.

Objective.—

To propose recommendations for performance evaluation of in vitro diagnostic tests on patient samples that incorporate machine learning as part of the preanalytical, analytical, or postanalytical phases of the laboratory workflow. Topics described include considerations for machine learning model evaluation including risk assessment, predeployment requirements, data sourcing and curation, verification and validation, change control management, human-computer interaction, practitioner training, and competency evaluation.

Data Sources.—

An expert panel performed a review of the literature, Clinical and Laboratory Standards Institute guidance, and laboratory and government regulatory frameworks.

Conclusions.—

Review of the literature and existing documents enabled the development of proposed recommendations. This white paper pertains to performance evaluation of machine learning systems intended to be implemented for clinical patient testing. Further studies with real-world clinical data are encouraged to support these proposed recommendations. Performance evaluation of machine learning models is critical to verification and/or validation of in vitro diagnostic tests using machine learning intended for clinical practice.

Institution of machine learning (ML) in the pathology clinical domain has gained momentum and is rapidly advancing. As ML-based clinical decision support (CDS) systems continue to mature, laboratories will increasingly need support to evaluate their performance in clinical practice. The medical domain of pathology has traditionally contributed vast nonimaging data to clinical information systems (eg, chemistry analyzers, hematology results, microbiology species, bioinformatic molecular pipelines, mass spectrometry). With pathology continuing to undergo digital transformation, a myriad of digital imaging data (eg, whole slide imaging, digital peripheral blood smears, digital plate reading, imaging mass spectrometry) are being generated by devices in pathology laboratories and used for clinical decision making.1  Image-based CDS systems in pathology have emerged for a variety of use cases, including white blood cell classification in peripheral blood smears, screening of gynecologic liquid-based cytology specimens, serum protein electrophoresis interpretation, microbiology automated culture plate evaluation, parasite classification in stool samples, and detection of prostatic carcinoma in whole slide images of tissue section specimens.2–7  However, ML-based CDS systems are not limited to imaging; they also include the analysis of nonimaging data. Use cases include evaluation of preanalytic testing errors,8,9  patient misidentification via wrong blood in tube errors,10,11  redundant analyte testing,12,13  establishing reference intervals,14  amino acid profiles/mass spectrometry,15,16  and autoverification.17,18  Additionally, electronic health records can leverage laboratory instrument data and other patient information, including multimodal data,19  to train ML models for prediction of specific patient outcomes (eg, sepsis prediction,20  acute kidney injury,21  patient length of stay,22  etc). The vision for clinical use of ML-based CDS systems is apparent; however, evidence that these systems are safe and efficacious for widespread clinical use is lacking. Patient care is a safety-critical application, where errors may have significant negative consequences. ML-based systems intended to support or guide clinical decisions require a level of assurance beyond those needed for research or other nonclinical applications.23  The use of CDS in pathology must be assured with sufficient evidence to be fit for purpose (ie, diagnostic testing), have defined performance characteristics, be integrated into clinical workflows (eg, with verification and/or validation), and vigilantly require ongoing monitoring.

In principle, CDS using ML-based systems can either be designed for use with pathologists in the loop as a human-managed workflow, or autonomously to directly influence an intended use. Autonomous ML-based systems in pathology can be defined as any artificial intelligence (AI)–based methods that operate to perform any critical part of a pathologist’s clinical duties (eg, diagnostic reporting) without agreement and oversight of a qualified pathology healthcare professional. Evidence to support clinical use of autonomous ML-based systems is lacking. Furthermore, professional societies in the field of radiology have recommended the US Food and Drug Administration (FDA) refrain from authorizing autonomous ML-based software as a medical device (SaMD) until sufficient data are available to advocate for these devices to be safe for patient care.24  As pathologists practice medicine using ML-based systems, the combined competencies of the human and computational model infers a term called “augmented intelligence.”25,26  The American Medical Association (AMA) has propagated this term to represent the use of ML-based systems that supplement, rather than substitute, human healthcare providers. The concept of augmented intelligence is based upon achieving an enhanced performance when the ML-based model is used in conjunction with the trained healthcare professional, compared to their individual performance alone. Examples of augmented intelligence in pathology include assistive tools that may impact pathologists’ diagnostic capabilities, such as identifying suspicious regions of interest on a slide for review, probabilistic-based protein electrophoresis interpretation, or tumor methylation–based classification to support clinical decision making.7,27,28 

Development of formal guidelines to deploy ML-based models in medical practice is challenging. Medical practice has arguably been founded on knowledge-based expert systems. Human-derived knowledge has been encoded as rules and relies on information available from experts in a specified domain. Decisions made through these rule-based systems and the decision-making process underlying the algorithm can be easily traced. Rule-based expert systems use human expert knowledge to solve real-world problems that would normally necessitate human intelligence. Rule-based expert systems encoded in knowledge bases have been developed since the 1970s.29–32  The advancement in technology and computing resources coupled with a digital transformation of the pathology field has demonstrated a need to provide further guidance for systems that incorporate emerging technologies such as ML-based CDS. Typically, when organizations put forth formal validation guidelines, they are based on extensive meta-analysis and expert review. The College of American Pathologists (CAP) Pathology and Laboratory Quality Center for Evidence-Based Guidelines has proffered guidelines for clinical implementation of whole slide imaging, digital quantitative image analysis of human epidermal growth factor receptor 2 (HER2) immunohistochemical stains, and molecular testing.33–35  The final recommended guidelines encompass substantial evidence-based data including systematic literature review, strength of evidence framework, open comment feedback, expert panel consensus, and advisory panel review. Due to the relative few ML-based applications currently used in clinical settings, the evidence available to formulate guidelines is lacking. Additionally, ML systems differ fundamentally from rule-based expert systems, conventional digital image processing or computer vision techniques, or other traditional software or hardware implementations in the laboratory. Conventional digital image processing (ie, image analysis) uses relatively defined image properties that are explainable and reproducible; relying on manually engineered features (eg, cell size, contours, entropy, skewness, brightness, contrast, and other Markovian features).36–40  In contrast, ML-based systems “learn” features or patterns from the training data, with the resultant ML model having potential inherent biases. ML models are generally tested on data not used in the training of the model (ie, unseen held-out test data) to estimate the generalization performance.41  In addition, explainable ML models may aid in performance evaluation, safeguard against bias, facilitate regulatory evaluation, and offer insights into the augmented intelligence decision-making process.42,43 

The increasing utilization of ML-based approaches has facilitated the development of ML models relevant for many applications pertinent to pathology practice. Implementing ML-based CDS will transform pathology practice. ML-based systems may be authorized for an intended use by regulatory bodies for clinical applications, or they may be deployed clinically by a laboratory (eg, local deployment site), as a laboratory-developed test (LDT). There are commercially available systems, including some authorized by the FDA. Pathologists need to know how to evaluate the analytical and clinical performance of ML models and understand how ML-based CDS systems integrate with the patient care paradigm. Currently, data on real-world clinical use is insufficient to develop formal evidence-based guidelines, as clinically deployed benchmarks of ML model performance evaluations are still emerging; thus, no published guidelines exist for clinical evaluation of ML-based systems in the pathology and laboratory medicine domain. This paper provides proposed recommendations relating to performance evaluation of an ML-based system intended for clinical use.

Scope of Guidance

This guidance document intends to (1) provide information on performance evaluation of systems that utilize ML derived from laboratory-generated data and that is intended to guide pathology clinical reporting of patient samples; (2) discuss aspects pertinent to evaluating the performance of a pretrained, ML model intended to be verified and/or validated at a deployment laboratory site; (3) discuss ML models, either procured commercially or established as LDTs, that include ML in any part of the test and are used in reporting of patient samples; and (4) provide a proposed framework to inform performance evaluation of ML-based software devices with a focus on verification and validation (eg, analytical, clinical) of a given ML model implemented at a laboratory using real-world data from the local deployment site.

These ML models may be used in conjunction with imaging-based microscopy in surgical pathology or laboratory medicine or use clinical nonimaging data. The goal of performance evaluation is to assess the model’s ability to generalize the features learned from the training data and render predictions from the local deployment site laboratory data. To improve the generalizability of the ML model, additional training (eg, recalibration) using data retrieved from the deployment-site laboratory can be performed prior to verification and/or validation, and ultimately before use in a clinical setting as an LDT.

This guidance document provides recommendations on performance evaluation of clinical tests that utilize ML in any part of the preanalytic, analytic, or postanalytic workflow. Outside of the scope of this document are (1) ML models deployed outside the clinical laboratory and pathology, even if they use laboratory data (eg, electronic medical record); (2) development of an ML-based model; (3) providing guidance on study design for model training and testing;44  (4) guidance on local site retuning of parameters or retraining using deployment-site local data; and (5) software that solely retrieves information, organizes, or optimizes processes. ML basics will not be reviewed in detail; readers may reference a prior publication introducing ML in pathology.45  The authors encourage publication of additional literature to generate data to strengthen these recommendations and future formal guideline creation. The generation of evidence-based data will enable model credibility and build trust surrounding the predictive capability of a given model.

Pathology and laboratory medicine healthcare professionals are familiar with incorporating new diagnostic testing procedures and technologies into clinical practice. This includes new laboratory instrumentation, deployment of digital pathology systems, onboarding and optimization of new immunohistochemistry clones, and the required verification and/or validation of new procedures and technologies before clinical testing on patient samples. If certain clinical tests are not commercially available, laboratories have conventionally generated their own tests and testing protocols, referred to as an LDT, to support emergent clinical needs and clinical decision making. ML models are similarly available commercially through industry entities or can also be deployed as LDTs if being used for patient testing.

As laboratories historically have evaluated, and continue to evaluate, performance of new tests being offered clinically, laboratories should pursue performance evaluation of ML models based on intended use cases; the evaluation should be carried out in the same fashion as the test will be clinically implemented. Performance evaluation should include inputs related to the intended use case and assessed similarly to clinical practice (eg, imaging modalities, tissue type, procedure, stain, task, etc). Each component of the ML-based system should be evaluated in the assessment as the system will be clinically deployed. At the completion of any change to the evaluated system, reevaluation should be conducted, to ensure the system is working as intended and performance characteristics are defined.

Terminologies and definitions relating to ML performance evaluation, such as validation, vary based on the respective domains (eg, clinical healthcare, regulatory, ML science). Table 1 provides a glossary of terms as they are defined and used in this document. Laboratory evaluation of an unmodified FDA-authorized (eg, cleared, approved) ML model, which has already undergone external validation, may only require limited verification showing that the deployment laboratory-use case fits those defined in the approved use and that performance matches that predicted in the regulatory review. Using a modified FDA-authorized ML model not in compliance with the stipulations of the intended use requires additional development of an LDT, assertion of the performance characteristics, and approval from the laboratory medical director. The Clinical Laboratory Improvement Act of 1988 (CLIA)46  and Amendment regulations are federal standards to provide certification and oversight to laboratories performing clinical testing in the United States. CLIA regulations require LDTs to be developed in a certified clinical laboratory and to evaluate the performance characteristics prior to reporting patient test results.47  The laboratory should continue to follow mandated regulatory standards for maintaining quality systems for nonwaived testing46  (Figure 1).

Figure 1.

Representative workflow for use of a clinically verified/validated machine learning model in pathology practice. A test order is first placed as required in the pathology laboratory. The specimen is analyzed on the designated laboratory instrument, and data output(s) are generated. These data can be used as inputs for a machine learning model. In conjunction with a pathologist, a test result is generated and will finally be electronically verified into a clinical patient report.

Figure 1.

Representative workflow for use of a clinically verified/validated machine learning model in pathology practice. A test order is first placed as required in the pathology laboratory. The specimen is analyzed on the designated laboratory instrument, and data output(s) are generated. These data can be used as inputs for a machine learning model. In conjunction with a pathologist, a test result is generated and will finally be electronically verified into a clinical patient report.

Close modal
Table 1.

Definitions

Definitions
Definitions

In evaluating the performance characteristics of a test, clinical verification and validation procedures are commonplace in the laboratory. Table 2 reflects the definitions for verification and validation based on various laboratory, standards, and regulatory organizations. LDTs currently do not require FDA authorization; however, the CLIA requires the laboratory to establish performance characteristics relating to the analytical validity prior to reporting test results from patient samples. The LDT is limited to the deploying CLIA-certified laboratory and may not be meaningful, or used clinically, by other nonaffiliated laboratories. In the pathology literature, analytical validation studies using ML-based models predominate; however, relatively few have included comprehensive true clinical validation of an end-to-end system with a pathology healthcare professional using these tools and reporting results compared to a reference standard.27,48–54 

Table 2.

Variation of Verification and Validation Terminologies by Different Domains

Variation of Verification and Validation Terminologies by Different Domains
Variation of Verification and Validation Terminologies by Different Domains

Risk Assessment

The CAP supports the regulatory framework that proposes a risk-based stratification of ML models.55  The scope of the performance evaluation should be guided by a comprehensive risk assessment to evaluate both potential sources of variability and error and the potential for patient harm. The risk assessment must include failure mode effects analysis, a process to identify the sources, frequency, and impact of potential failures and errors for a testing process. Aspects of the risk assessment include the following:

  1. Preanalytic, analytic, and postanalytic phases of testing (see Predeployment Considerations section);

  2. Intended medical uses of the software and impact if inaccurate results are reported (clinical risk);

  3. System components (environment, specimen, instrumentation, reagents, and testing personnel); and

  4. Manufacturer/developer instructions and recommendations for the intended use (if applicable).

The laboratory director must consider the laboratory's clinical, ethical, and legal responsibilities for providing accurate and reliable patient test results. Published data and information may be used to supplement the risk assessment but are not substitutes for the laboratory's own studies and evaluation. Representative testing personnel from the laboratory should be involved in the risk assessment and performance evaluation.56 

There are different risk-stratification approaches of ML models. Risk classification schema from the SaMD risk categorization framework is based on the significance of the information provided by SaMD toward healthcare decisions (eg, inform or drive clinical management, diagnose, or treat) and the severity of the healthcare situation or condition itself (eg, nonserious, serious, critical).55,57  A risk-stratification approach put forth by the American Society of Mechanical Engineers (ASME), the risk assessment matrix of ASME V&V40-2018, compares the combination of the model influence and the consequence the output decision might infer.58  The model’s influence can be defined as the measure of contribution to the resultant decision compared to other available information, or power of the model output in the decision-making process. The risk categorization of the ML-based CDS software may also depend on the characteristics of the underlying ML model (eg, static model versus adaptive/continuous-learning model) (Figure 2).

Figure 2.

Risk classification schema. Risk stratification adapted from the FDA SaMD framework proposal and ASME V&V40-2018 approach. Red designates machine learning–based clinical decision support software that would be categorized as high risk. Green shows those that would be categorized as low risk. Abbreviations: ASME, American Society of Mechanical Engineers; FDA, US Food and Drug Administration; SaMD, software as a medical device.

Figure 2.

Risk classification schema. Risk stratification adapted from the FDA SaMD framework proposal and ASME V&V40-2018 approach. Red designates machine learning–based clinical decision support software that would be categorized as high risk. Green shows those that would be categorized as low risk. Abbreviations: ASME, American Society of Mechanical Engineers; FDA, US Food and Drug Administration; SaMD, software as a medical device.

Close modal

Predeployment Considerations

Prior to clinical laboratory deployment of ML-based CDS software, pathology personnel should be familiar with the laboratory integration points and be aware that preanalytic variables may have a downstream impact on the model outputs. Furthermore, laboratory processes can impact the performance of the software system and an integrated clinical workflow design is needed for the deployment site. This ensures receipt of data and output by the model are based on the intended clinical workflow.

Model Development and Properties

Understanding the model training and testing environment is helpful to evaluate the model for the intended deployment context. The intended use of a model will facilitate scoping the performance evaluation in the accepted and expected patient sample cohort.59  ML models range from detection tasks (eg, localization, segmentation) to classification, quantification, regression (eg, continuous variable), or clustering, amongst others. Dataset curation is one of the earliest steps of model development, a portion of which becomes the training data from which the selected model extracts features and on which it bases the outputs. Developers should provide the medical rationale (eg, clinical utility) of the model. The sample size and distribution of data used for model training, validation, selection, and the testing set should be described, including the composition of the training, validation, and test sets as well as the data sampling strategies and other weighting factors (eg, class weights used in the loss function). In addition, the description of the data should also cover any inclusion and exclusion criteria that were applied during curation of the datasets, or how missing data may have been handled during the model development process. If data were annotated, understanding how and by whom those annotations were provided can provide insights into model development and affect subsequent performance evaluation design. Furthermore, comprehending the reference standard used for initial model training and testing can provide confidence that the model is appropriate for evaluation in the clinical setting and fits the proposed use case for laboratories. The laboratory should be provided with the stand-alone performance of the model for each included sample type as it was tested; this includes accuracy, precision (eg, positive predictive value), recall (eg, sensitivity), specificity, and any additional information related to the reportable range or values.

Data Characteristics (Case Mix and Data Compatibility)

The difference in data distributions between the data samples that are used to train the model versus those used for performance evaluation at a deployment laboratory site can be analyzed through data compatibility. The case cohorts and data compatibility are essential to estimate generalizable performance of the pretrained model to the deployment laboratory. As ML models are trained, specific data are used and should be explicitly stated in the operating procedures. The deployment site laboratory can infer cohort compatibility based on appropriate measures related to the intended patient and specimen characteristics. These should include the data properties from both clinical and technical standpoints. Clinical parameters include but are not limited to patient demographics, specimen type, specimen collection body site, specimen collection method (eg, biopsy, resection), stains and other additives (fixatives, embedding media), analyte, variant, additive, container (eg, tube type), and the diagnosis. Technical parameters include data type, file extension, instrumentation hardware and software versions, relevant data acquisition parameters, imaging resolution, units, output value set (eg, classes, reportable range), and visualizations. Including a thorough predeployment evaluation of the clinical and technical parameters will ensure the ML model will work as intended.

Ethics and Ethical Bias

Bias is inherent in almost all aspects of medical practice, including ML-based systems. Different types of bias can affect the model outputs and performance characterization and 2 predominant forms of bias are described. Algorithmic bias refers to systematic and unfair discrimination that may be present in a model. FAIR (findability, accessibility, interoperability, and reusability) guiding principles were described to support meaningful data management and stewardship, and can help mitigate bias.60–62 Representation bias (eg, evaluation bias, sample bias) refers to an ML model that omits a specific subgroup or characteristic within the training data compared to the eventual data the model will be used on clinically. Studies showing underrepresented populations in medical data are emerging and may impact model performance in various subclasses.63–65  Even within open-access medical databases, predictive modeling shows the ability to generate site-specific signatures from histopathology features and accurately predict the respective institution that submitted the data.66 Measurement bias refers to the model being trained on proxy data or features rather than the ideal intended target; for instance, a model trained on data from a single instrument, and planned for deployment at a site with a similar instrument, but from a different vendor.67  Measurement bias can also be related to the data annotation process, where inconsistently labeled data may cause bias in the model outputs. A definitive reference standard should be used and qualified during the ML model performance evaluation (see Reference Standard section). Bias should be considered in the evaluation process and include equal and equitable review of the inputs and outputs. Algorithmic bias in ML-based models has been shown to have demographic inequities.68 Equality ensures identical assets are provided for each group regardless of differences; equity acknowledges differences exist between the groups, and respectfully allocates weights to each group based on their needs. Evaluation of ethics across multiple domains has engendered the need for describing transparency, justice and fairness, nonmaleficence, responsibility, and privacy as part of ML model development and deployment.69,70 

Laboratory Influences

Due to the variation in laboratory protocols and practices, inherent biases and respective influences exist, which the preanalytic processes may embed within the laboratory’s data. These variables can impact the downstream model development and deployment. To ensure appropriate model performance, the ML model should ideally be trained across the range of all possible patient samples to which the model will eventually be exposed in clinical practice. While this is not always possible, when evaluating model performance in deployment sites, careful attention should be given to patient sample subgroups—especially those with clinical significance. Variations in sample collection, preparation, processing, and handling may affect model outputs if the model has not been exposed to the observed laboratory data variation during model training.

While it is recommended to ensure good laboratory practices71–73  in order to sustain appropriate patient care, as well as maintain consistent data inputs for model predictions, there are inherent intralaboratory and interlaboratory variations that need to be taken into account. Laboratories should implement approved procedures to support a robust quality control (QC) process to mitigate and minimize defects. QC defects may impact ML models and it is arguably more important to remediate ML models compared to human evaluation since humans may be less sensitive to such artifacts. Contrarily, the QC process itself may introduce bias by removing artifacts commonly encountered in specimen preparation, testing, and result reporting. Furthermore, it is likely that QC protocols and operating procedures vary considerably across laboratories, leading to the potential for additional generalizability challenges. Pathologists, and ultimately the laboratory directors, are responsible for assessments in the predeployment phase and through ongoing monitoring of ML models implemented for clinical testing. Representative examples of preanalytic variables and their potential impact on model outcomes are listed in Table 3.

Table 3.

Representative Preanalytic Variables and Potential Measurement Impact

Representative Preanalytic Variables and Potential Measurement Impact
Representative Preanalytic Variables and Potential Measurement Impact

Evaluation Data Sourcing

Quality CDS systems in pathology result from thoughtful design in the developmental stages and careful attention to components of clinical workflow both upstream and downstream. The laboratory must first implement, verify, and validate the digital workflow including additional hardware and software, and be in accordance with pertinent regulatory guidelines.

Comprehensive development of the ML model with an inclusive and diverse training dataset is critical to the performance of the deployed tool. Proper verification or establishment of performance specifications using a deployment site’s real-world laboratory data as a substrate is required. The deployment site’s data (eg, laboratory instrument data, imaging data, sequencing data, etc) should reflect the spectrum of sample types and characteristics typically seen during clinical patient testing at the deployment site. The data should be readily accessible to complete the needed performance characterization of the ML model. All samples used in verification and/or validation of the pipeline should be documented as part of the clinical performance evaluation dataset. The evaluation data used for performance characterization should be further quarantined; it should not be included in any training or tuning of the ML model.

Using Real-World Data

Laboratories intending to use ML-based models in the clinical workflow will need to perform evaluation studies on their deployment site data. Based on the laboratory blueprint and intended uses, intralaboratory and interlaboratory variables need to be assessed. This may require clinical informaticists or technical staff (eg, information technology staff) involvement to oversee the evaluation. Laboratory information systems can be queried for patient specimens to identify appropriate retrospective data for determination of performance specifications. Prospective data can be used for evaluation if the ML model results are not clinically reported prior to verification and/or validation. It is imperative that all clinically meaningful variations are included in the diverse dataset for performance evaluation so that the model may be exposed to and evaluate performance in respective subgroups.

Some ML-based models in pathology have been developed using open-access databases. National Institutes of Health databases, such as The Cancer Genome Atlas, have pioneered data sharing and governance for supporting research through a data commons.74,75  However, literature shows open databases are not without challenges. Open datasets have limitations (in batch effects relating to local laboratory idiosyncrasies) that introduce biases in ML models.66,76–79  Private data used for model training may also suffer from local site biases and transparency, and may not generalize well to deployment site data if the local sample characteristics differ from those trained in the model. Generalizability of the learned features in the model to the deployment laboratory data will vary depending on the similarity of the represented data distribution.

Intended Use and Indications for Use

For each clinically deployed ML model, a clinical utility that can be described through clinical needs and use cases should exist. Once the intended use of a pretrained model has been defined, the clinical pipeline from sample collection to clinical reporting should be determined for clinical verification and/or validation. Regulatory terms are helpful in understanding “what for” and “how” (eg, intended use) versus “who,” “where,” and “when” (eg, indications for use) an ML model is used in the clinical setting. The clinical utility determines the “why.” The intended use is essentially the claimed model purpose, whereas the indications for use are the specific reasons or situations the laboratory will use the ML model for testing. The model performance evaluation should, therefore, be limited to the intended use and evaluated based on the indications for use. The clinical verification and/or validation procedures should be applicable to both of these terms, including the time point of use (eg, primary diagnosis, consultative, or quality assurance) and using a similar distribution of specimen types to be examined with enrichment of variant subtypes that are expected to be reproducibly and accurately detected in the deployment site’s clinical practice. These samples should be selected to evaluate the performance of the model using appropriate evaluation metrics. While not required, evaluating the ML model with out-of-domain samples (eg, those not indicated as part of the intended use or indications of use) may be appropriate to further characterize model performance and establish model credibility and trust. Out-of-domain testing can support establishing “classification ranges” to the validated model input data (eg, a model intended to detect prostate cancer in prostate biopsies is tested on a prostate biopsy with lymphoma and identifies a focus “suspicious for cancer”).

Reference Standard

To appropriately evaluate the performance of a given ML model, a ground truth is needed to compare against the outputs of the ML model. The medical domain may contain noisy data, such that data trained on archival specimens may have different diagnostic criteria, patient populations, disease prevalence, or treatments compared to current practice. Pathology data include diseases that may have been reclassified, or diagnostic criteria that may have changed over time. This temporal data heterogeneity is also inherent within pathology reports, electronic medical records, and other information systems in relation to data quality. These factors should be considered when using historical data inputs for verification and/or validation if they do not accurately and precisely reflect current pathology practice. Patients who have the same disease can demonstrate varying characteristics (eg, age, sex, demographics, comorbidities) that differ based on clinical setting. Prevalence of disease may also differ between geographic areas. Furthermore, certain disease states are challenging even to expert interpretation, and cause label noise during the evaluation of the model. Traditional histopathology diagnosis or laboratory value reporting have been regarded as ground truth; however, based on the contextual use of the model, these may only represent a proxy for the ideal target of the model. For imaging data, the diagnosis may also only exist in the form of narrative free-text, which may contain nuances and may not be readily comparable with model outputs to establish whether the predictions are accurate.

Accuracy in certain settings may be applicable as concordance. The performance evaluation may need to evaluate the reference data to ensure “inaccurate” model predictions are not in fact errors in ground truth through discordance analysis. A paradox exists where pathologists and laboratory data are clinically reported and fundamentally provide the basis of medical decision making; however, with any assessment there may be inherent human biases to the results.80  This is especially true when correlated with visual imaging quantification.81–83  Regardless, pathologists’ diagnostic analyses have been utilized as the prime training labels for ML-based models. Furthermore, there exists a spectrum of confidence for the ground truth of a given test. For instance, predicting HER2 amplification from the morphology of a hematoxylin-eosin slide can be directly compared to in situ hybridization results of the same tissue block. Tumor heterogeneity aside, there can be relative high confidence of comparing the model output and HER2 probe signals. Similarly, a regression model predicting sepsis from laboratory instrument inputs can be correlated to sepsis diagnostic criteria.84–86  However, for challenging diagnoses or those with high interobserver or interinstrument variability,87–101  a consensus or adjudication review panel may be appropriate for establishing the reference standard for a given evaluation cohort. The reference standard may differ for various models depending on their intended use, output, and interaction in the pathology workflow. Furthermore, the reference standard requires comparable labels for comparison to the model outputs. Weakly supervised techniques may obviate manual human expert curation by knowing the presence of a tumor or not from synoptic reporting data elements linked directly to the digital data. However, large numbers of data elements linked to their respective metadata may be required.102  Some reference data, such as diagnostic free-text, may not be structured and easily exportable. It may require manual curation to identify referential data points to which to compare. Additionally, development of a true reference standard will require effort to ensure the labels are accurate. For example, in surgical pathology, the diagnostic report is conventionally described at the specimen part level—inclusive of findings across all slides of the surgical specimens. The digital slides that are used for evaluation of the model thus may need to be carefully selected, otherwise they may not be representative of the features being evaluated in the ML model. The final utilized reference standard and testing protocol are recommended to be accepted and approved by the laboratory director prior to starting the performance evaluation process, ensuring mitigation of potential errors with use of the CDS tool.

Performance evaluation of any test deployed in a clinical laboratory for patient testing is an accreditation and regulatory requirement under CLIA 1988.46  Pathology departments follow manufacturer instructions, regulatory requirements, guidelines, and established best practices to perform, develop, or provide clinical testing. Following regulatory requirements is necessary prior to patient testing, and performance characteristics should be specified for a given clinical test. Generalizability of an ML model is tested through performance evaluation, to ensure the model renders accurate predictions given the deployment site’s data, compared to the data that were used to train the model. Differences in data representation (eg, units, class distribution, interlaboratory differences) may bias the trained model and performance at the deployment site may be low (eg, underfit) due to the data mismatch. While there is literature supporting the hypothesis that models trained using data from multiple sites may have higher performance, it is the responsibility of the deployment site laboratory to verify and/or validate the model using the real-world data expected to represent the case mix and distribution of the implementing laboratory (eg, single site versus receiving samples from other sites).103–105 

Software systems can be evaluated using white-box or black-box testing. In white-box testing, evaluation is performed through inspection of the internal workings or structures of the system and the details of the model design. In black-box testing, the evaluation is entirely based on the examination of the systems’ functionality without any knowledge of design details. Evaluation is performed by passing an input to the system and comparing the returned output with the established reference standard. The system performance is then characterized using evaluation metrics that quantify the difference between the returned value (eg model output) and the expected value (eg reference ground truth label) aggregated following multiple samples. In most cases, evaluation of ML models for clinical practice relies on black-box testing, because the internal workings of the evaluated models cannot be inspected (eg in cases of proprietary software) or are highly complex and difficult to interpret (eg, deep neural network models with innumerable parameters and many nonlinearities). The goal of ML model evaluation is to derive an estimate of the generalization error, which represents a measure of the ability of a model trained on a specific set of samples in the training set to generalize and make accurate predictions for unseen data in the validation or test set. It is important to note that performance evaluation only provides an estimate of the generalization error: the true generalization error remains unknown, as the total data to test the model is infinite. The performance estimation depends on the size and composition of the evaluation set and how closely samples in the training data resemble the patient or specimen population for which the model will be used in clinical practice. Additionally, the performance characterization will be impacted by how closely the clinical test procedure resembles the intended use.

Clinical Verification and Validation

Performance evaluation in a laboratory consists of at least 2 concepts: verification and validation. Verification contains the root word “verus” (eg, true), whereas the root word of validation is “validus” (eg, strong). Clinical verification is the process of evaluating whether the ML model is trained correctly; clinical validation evaluates the ML model using laboratory real-world data to ensure it performs as expected. This is not to be confused with the technical ML concept of validation that is intended to find data anomalies or detect differences between existing training and new data, then subsequently adjusting hyperparameters of the model to optimize performance (eg, tuning). While beyond the scope of this document, satisfactory software code verification should be performed by the model developers prior to clinical verification or validation. Clinical validation confirms the performance characteristics of the ML model using the deployment laboratory’s real-world data. Validation is needed for those ML models that have been internally developed by the laboratory (i.e, LDT) or for ML models that have been modified from published documentation by the original manufacturer or process developer (eg, modified FDA authorized ML model) or an unmodified ML model used beyond the manufacturer’s statement of intended use.

Clinical Verification

Clinical verification focuses on development methodologies (eg, technical requirements). Performance of the software system may be tested in silico using a dedicated testing environment and a curated test dataset. This is similar to the regulatory analytical validity, where ML scientists work in conjunction with subject matter experts to evaluate performance of a trained model on a specified dataset. Clinical verification may not specifically involve prospective clinical patient results reporting; instead, it enables the implementing laboratory visibility of the model performance by analyzing the model outputs compared to the known specified reference standard. Curation and labeling of the dataset should be performed in close collaboration with, and under the supervision of, pathologists to ensure that the dataset is representative and ground truth labels are correct. The verification process may be used for an unmodified FDA-authorized ML model where the specifications of the manufacturer are used as directed. Verification of an unmodified FDA-authorized ML model should demonstrate that the performance characteristics (eg, accuracy, precision, reportable range, and reference intervals/range) match those of the manufacturer.

By predetermining the dataset, the prevalence of the medical condition in the evaluation dataset is artificially determined compared to the real-world clinical setting. To avert selection bias, it is imperative to select datasets for clinical verification performance evaluation that reflect the degree of variation of the intended condition(s) seen by the deployment site laboratory and compare the performance characteristics of the ML model established by the manufacturer or developer. Quantifying the extent of relatedness to the verification samples is intended to mitigate selection bias during dataset selection to evaluate the performance of the ML model. Selected specimens can range from being “near identical” to being completely unrelated. Determining the relatedness of the ML model’s manufacturer dataset compared to the deployment laboratory is helpful to deduce the results of the performance evaluation and can infer the generalizability of the ML model.106  Distribution of the sample characteristics include preanalytic and predictor factors such as patient characteristics, laboratory processes, and hardware used to process the samples (eg, stainers, whole slide scanners, chemistry analyzers, genomic sequencers).

Clinical Validation

Clinical validation focuses on clinical requirements to evaluate the ML model in a clinical production environment. Clinical validation requires end-to-end testing including each component of the complete integrated clinical workflow, using noncurated, real-world data. Clinical validation intends to demonstrate the acceptability of the ML model being used in a clinical setting. In the case of a CDS system, clinical validation generally compares pathologist test resulting with and without access to the model outputs. Clinical validation is akin to the regulatory concept of clinical validity, or the ML concept of evaluating the performance of an external “validation” dataset. The ML concept for internal (technical) validation using k-fold cross validation or split-sample validation are out of scope of this paper as they are generally used for model selection during model development. Prior to, or as part of, the clinical validation, a controlled reader study may be performed where all components of the model are tested with pathologists in the loop. Clinical validation is used to confirm with objective evidence that a modified FDA-authorized ML model or LDT delivers reliable results for the intended application. The validating laboratory (eg, deployment site) is required to establish the performance characteristics that best captures the clinical requirements, which may include accuracy, precision, analytical sensitivity, analytical specificity, reportable range, reference interval, and any other performance characteristic deemed necessary for appropriate evaluation of the test system. Clinical validation can be performed as a clinical diagnostic cohort study where consecutive prospective patients are being evaluated in the appropriate medical domain of the intended use of the ML model (eg, satisfying eligibility criteria based on intended use/indications of use). This ensures real-world clinical data from the deployment laboratory are included in the performance characterization and the natural spectrum and prevalence of the medical condition will be observed. Establishing a “limit of detection” for quantitative and qualitative assessment can help provide validation across appropriate clinically meaningful subgroups (eg, stratifying tumor detection by size, flow cytometry gating thresholds). The requirement to define reportable ranges as part of the test results can be exemplified for quantitative results (ie 0%–100% for prediction of estrogen receptor nuclear-positive tumor cells) or specified values for output classification (ie, Gleason patterns for prostatic acinar carcinoma classification). Reportable range definitions should be included in the standard operating and guide validation procedures. Defining the reportable range will facilitate analysis of evaluation metrics and resultant performance characteristics of the ML model. Clinical utility demonstration and validation can be further analyzed where performance characteristics prove the clinical or diagnostic outcome enhances patient outcomes.107–112 

Clinical verification and/or validation are required prior to clinical patient testing for all nonwaived tests, methods, or instrument system prior to use in patient testing.113,114  Furthermore, due to the diversity of diagnostic tests—with FDA-authorized tests, modified FDA-authorized tests, or LDTs—it is not possible to recommend a single experimental design for all verification and/or validation of all ML models. However, the guidance in this document attempts to provide recommendations for evaluation procedures and metrics to support verification and validation of clinical tests that use ML models.

Evaluation Metrics

Laboratories have routinely evaluated performance of instruments, methodologies, and moderate-complexity to high-complexity tests. A variety of performance evaluation metrics are available to laboratorians based on the appropriate model (Table 4). While these evaluation metrics will be briefly discussed here, a more in-depth explanation of these metrics was previously published.45  The ideal metrics used for evaluating the performance of an ML model are guided by the data type of the generated output (eg, continuous values of a quantitative measurement generated by a regression model versus categorical values of a qualitative evaluation generated by a classification model). For outputs on an interval or ratio scale, such as proportion (percentage) of tumor-infiltrating lymphocytes, quantitative evaluation aims to estimate systematic and proportionate bias between the test method and a reference method (eg, human reader). Qualitative evaluation applies to outputs on a nominal or categorical scale (eg, presence or absence of tumor), as well as to clinical diagnostic thresholds or cutoffs applied to interval or ratio outputs (eg, probabilistic output of a logistic regression model). The data outputs of an ML model can be either nominal or ordinal in a discriminative classification task, where the output is a specified label. In a discriminative regression task, the outputs can be either an interval or a ratio, and the output value is a number that can be a continuous variable (eg, time to recurrence) or ratio (eg, serum free light chains [k/λ]). The appropriate evaluation metrics should be selected based upon the specified model intended to be clinically deployed and the variables (independent or dependent) defining the output results. For classification models, discrimination and calibration are used to evaluate the performance.

Table 4.

Representative Evaluation Metrics for Performance Specification Based on Application

Representative Evaluation Metrics for Performance Specification Based on Application
Representative Evaluation Metrics for Performance Specification Based on Application

Discrimination is the ability of a predictive model to separate the data into respective classes (eg, with disease, without disease); calibration evaluates the closeness of agreement between the model’s observed output and predicted output probability. Discrimination is typically quantified using the area under receiver operating characteristic (AUROC), or concordance statistic (C-statistic). In isolation, discrimination is inadequate to evaluate the performance of an ML model. For instance, a model may accurately discriminate the relative quantification of Ki67 nuclear stain positivity of specimen A to be “double” that of specimen B (ie, discriminatory relative quantification could be 2% and 1%, respectively); however, their true, absolute Ki67 nuclear stain positivity could be 30% and 15%. Alternatively, a model showing excellent calibration can still misrepresent the relative risks (eg, discrimination). For example, a model predicting glomerular filtration rate in 2 patients shows a 50% absolute increase in creatinine between 2 patients, but may not be able to discriminate the risk of patient A’s creatinine rising from 0.4 mg/dL to 0.6 mg/dL and patient B’s 1.2 mg/dL to 1.8 mg/dL. This model is arguably of no use as patient A is within normal limits, and patient B should be evaluated for being at risk to develop acute kidney injury. This model would be considered to have poor discrimination.

Calibration measures how similar the predicted outputs are to the true, observed (eg, absolute), output classes. It is intended to fine tune the model performance based on the probability of the actual output and the expected output. For a given set of input data, deployment site calibration can be used to change the model parameters and optimize the model output data to match the reference standard more closely. While calibration is used for performance evaluation, guidance on how to perform local site calibration or retraining of the entire ML model is out of the scope for this proposed framework as the comparative performance of the calibrated model may deviate from the outputs of the original model. However, as a performance metric, calibration should be analyzed for the final iterated model intended to be clinically deployed. Calibration documentation of performance should be assessed as part of the verification and/or validation prior to clinical deployment. If the ML model is to be updated, using local site calibration, the tuning of hyperparameters of the ML model should be reflected as a new version of the model, and should consider the final trained model as the model intended for clinical deployment. Calibration is evaluated as a graphical representation called a reliability diagram. Reliability diagrams compare the model’s observed predictions to the expected reference standard. Calibration can also be evaluated for logistic regression models using the Pearson χ2 statistic, or the Hosmer-Lemeshow test.

Performance evaluation of ML models may show good discrimination and poor calibration, or vice versa. A performant ML model should show high levels of discrimination and calibration across all classes. It is also possible that the model shows poor calibration amongst a subset of classes (eg, shows high calibration for Gleason pattern 3, but poor calibration for Gleason pattern 5). Calibration, or goodness of fit, has been termed the Achilles heel of predictive modeling: while critically important in evaluation, it is commonly overlooked or misunderstood, potentially leading to poor model performance.115  Poorly calibrated models can either underestimate or overestimate the intended outcome. If model calibration is poor, it may render the CDS tool ineffective or clinically harmful.115–117  The deployment site laboratory should not report clinical (patient) samples until performance characterization has been completed; ML model updating should be considered in case of poor discrimination or calibration, including retraining of the model with sufficient data, if appropriate. While it is the responsibility of the ML model developer to ensure appropriate discrimination and calibration, laboratories may perform local site calibration, model updating, and subsequent verification and/or validation to ensure patient testing is accurate.118,119 

Most literature supports analyzing accuracy as a performance metric to show the ratio between the number of correctly classified samples and the overall number of samples. When datasets are imbalanced (eg, increased number of samples in one class compared to the other classes), accuracy as defined above may not be considered the most reliable measure, as it can overestimate the performance of the classifier’s ability to discriminate between the majority class and the less prevalent class (eg, minority class). A prime example is prostate adenocarcinoma grading in surgical pathology. Gleason pattern 5 is less prevalent than Gleason pattern 3 and may be less represented in the verification and/or validation cases. Gleason pattern 5, while less common, has significant prognostic implications for patients, and a model intended to grade prostate adenocarcinoma can appear to have excellent performance if the validation cases are imbalanced and do not include samples representing the higher Gleason pattern.120,121  For highly imbalanced datasets, where one class is disproportionately underrepresented, maximizing the accuracy of a prediction tool favors optimization of specificity over sensitivity; where false positives are minimized at the expense of false negatives (accuracy paradox). This has led some investigators to recommend use of alternative metrics. The index of balanced accuracy (IBA), is a performance metric for skewed class distributions that favors classifiers with better results for the positive (eg, tumor present), and generally most important class. IBA represents a trade-off between a global performance measure and an index that reflects how balanced the individual accuracies are: high values of the IBA are obtained when the accuracies of both classes are high and balanced.122  Sampling techniques for verification dataset curation can be considered where oversampling of the minority class or undersampling of the majority class can provide a better distribution for evaluation. While the Cohen kappa, originally developed to test interrater reliability, has been used for assessing classifier accuracy in multiclass classification, there are some notable limitations that limit its utility as a performance metric, especially when there is class imbalance.123,124  The Matthews correlation coefficient, introduced in the binary setting and later generalized to the multiclass case, is now recognized as a reference performance measure, especially for unbalanced datasets.125–127  Designing an appropriate validation study should include a diverse and inclusive representation of real-world data that the model could be exposed to at the deployment laboratory. It is the responsibility of the laboratory to ensure sufficient samples are included for verification and/or validation of a given model and that the appropriate evaluation metrics are used.

Evaluation Metrics for Imaging

In evaluating models for ML-based CDS software for computer-assisted diagnosis, free-response receiver operating characteristic curves are an appropriate evaluation metric. The model output is deemed correct when the detection and localization of the appropriate label is accurate. For instance, a model intended to detect parasites in peripheral blood smear slides can miss a region that indeed has red blood cells with Plasmodium falciparum, and separately label a different region of normal red blood cells incorrectly as being positive for parasitic infection—the model has produced both a false positive and a false negative. If the validating laboratory considered only the slide label output, it would appear to show as if the slide-level output of the model identified the slide as having a parasitic infection (eg, true positive), however in reality the model should be penalized for the inaccuracies. Model evaluation should allow multiple false negative or false positive labels in a given sample.51,128  Another evaluation metric for ML models that perform image segmentation is the Dice similarity coefficient.129  Segmentation is the pixel-wise classification in an image of a given label or class. For instance, if a model is intended to identify and segment mitoses in an image, given a verification dataset that has an expert pathologist’s manual annotation of mitoses, the model’s performance can be evaluated by measuring the intersection over union (eg, degree of overlap) between the model segmentation output and the human expert manual annotation.130 

The ideal metrics used for evaluating the performance of an ML model are guided by the data type of the generated output (eg, continuous values of a quantitative measurement generated by a regression model versus categorical values of a qualitative evaluation generated by a classification model). For outputs on an interval or ratio scale, such as proportion (percentage) of tumor-infiltrating lymphocytes, quantitative evaluation aims to estimate systematic and proportionate bias between the test method and a reference method (eg, human reader). Qualitative evaluation applies to outputs on a nominal or categorical scale (eg, presence or absence of tumor), as well as to clinical diagnostic thresholds or cutoffs applied to interval or ratio outputs (eg, probabilistic output of a logistic regression model). The data outputs of an ML model can be either nominal or ordinal in a discriminative classification task, where the output is a specified label. In a discriminative regression task, the outputs can be either an interval or a ratio, and the output value is a number that can be a continuous variable (eg, time to recurrence) or ratio (eg, serum free light chains [κ/λ]).

Classification Test Statistics

To evaluate ML models with classification outputs, arguably the best-known evaluation metric is the confusion matrix for binary classification outputs (disease versus no disease) where sensitivity, specificity, accuracy, positive predictive value, negative predictive value, and receiver operating characteristic (ROC) curve calculations can be computed (Figure 3). While for ML-based classification models the final outputs may be represented as a binary output (eg, tumor versus no tumor), a preceding continuous variable (eg, probability range from 0 to 1) with cut-point threshold is applied to convert it into a binary result. The sensitivity and specificity of the model vary depending on where the operating point is set (eg, higher threshold = decreased sensitivity and increased specificity; lower threshold = increased sensitivity and decreased specificity). Sensitivity (also referred to as “true positive rate” or “recall”) is the ratio of true positive predictions to all positive cases and represents a method’s ability to correctly identify positive cases. Specificity is the ratio of true negative predictions to all negative cases and represents a method’s ability to correctly identify negative cases. Positive predictive value and negative predictive value are directly related to prevalence. Positive predictive value (also referred to as precision) is the probability that, following a positive test result, an individual will truly have the disease or condition, and is the ratio of true positive predictions to all positive predictions. Negative predictive value is the probability that, following a negative test result, an individual will truly not have the disease/condition, and is the ratio of true negative predictions to all negative predictions. The overall or naive accuracy represents the proportion of all cases that were correctly identified. ROC curves are graphical representations plotting the true positive rate (sensitivity) and false positive rate (1 − specificity) at different classification thresholds. The ROC curve is plotted to illustrate the diagnostic ability of a binary classifier system as its discrimination threshold or operating point is varied. The AUROC curve is a single scalar measure of the overall performance of a binary classifier across all possible thresholds, ranging from 0 (no accuracy) to 1 (perfect accuracy). A model with an AUROC curve of 1 (perfect accuracy) does not necessarily portend high performance of the ML model in clinical practice; the AUROC curve weights sensitivity and specificity equally. If a specific intended use of a model is for screening, this may have a lower AUROC value and still maintain high sensitivity despite having lower specificity. Given that a certain operating point is needed when iterating and versioning an ML model, the sensitivity and specificity values for the selected operating-point threshold should be considered when measuring the model’s performance. The F-score (eg F1-score) is a combined single metric (eg, harmonic mean) of the sensitivity (eg, recall) and positive predictive value (eg, precision) of a model. For multiclass or multilabel classification problems, pairwise AUROC analyses are needed since the single AUROC curves are intended for evaluating binary outputs. For multiclass instances, one method allows an arbitrary class to be compared against the collective combination of all other classes at the same time. For instance, one class is considered the positive class, and all other classes are considered in the negative class. This enables a “binary” classification using AUROC by using the “one versus rest” technique. Another evaluation metric for multiclass models termed “one versus one” analyzes the discrimination of each individual class against each other, with all permutations of individual classes compared against each other. In a 3-class output (eg, tumor grading of low, moderate, and high), there would be 6 separate one-versus-one comparisons. In both one versus rest and one versus one, the average AUC for each class would result in the final model accuracy performance. Generally accepted values of AUROC less than 0.6 represent poor discrimination, where 0.5 is due to chance (eg random assignment). Discrimination AUROC of 0.6 to 0.75 can be classified as helpful discrimination, and values greater than 0.75 as clearly useful discrimination.131,132 

Figure 3.

Contingency table and selected metrics. Common evaluation metrics and corresponding calculations for a classification model shown in a contingency table. A contingency table is a visual representation showing the numbers of observations for 2 categorical variables, to analyze their relationship and associations. Abbreviations: FN, false negative; FP, false positive; NPV, negative predictive value; PPV, positive predictive value; TN, true negative; TP, true positive.

Figure 3.

Contingency table and selected metrics. Common evaluation metrics and corresponding calculations for a classification model shown in a contingency table. A contingency table is a visual representation showing the numbers of observations for 2 categorical variables, to analyze their relationship and associations. Abbreviations: FN, false negative; FP, false positive; NPV, negative predictive value; PPV, positive predictive value; TN, true negative; TP, true positive.

Close modal

Positive percent agreement and negative percent agreement are the preferred terms for sensitivity and specificity, respectively, when a nonreference standard—such as subjective expert assessment—is used to determine the “ground truth.” In other words, this approach measures how many positives/negatives a test identifies that are in agreement with another method or rater used on the same samples.133 

Standalone verification performance of the model should undergo analysis of discordant cases (false negatives, false positives) to ensure the model outputs are checked against the reference standard. Discordance analysis should be performed for all false negatives and all false positives. Additionally, a minimum sufficient number of true positives and true negatives should also be evaluated to mitigate evidence of selection bias. For clustering models, these include evaluating the longest centroid-dataset pairs, and a sufficient number of centrally clustered data points. Discordance analysis enables adjudication of false negatives and false positives that may be labeled inappropriately during the ground truth process, or where true positives predicted in the positive class are, in fact, false positives. The discordance analysis will result in a final true positive, true negative, false positive, and false negative rate for verification and/or validation documentation. Discordance analysis should be performed, documented, and available for review.

Sample Size Estimations for Binary Classification

The size of the accuracy evaluation dataset is determined by the scope of the performance evaluation (eg, validation versus verification). The sample size calculation for verification of the accuracy (or sensitivity or specificity) of a method relies on knowledge of the predetermined accuracy (or sensitivity or specificity) from the package insert (denoted as P0). In comparing the method’s accuracy to P0, the null and alternative hypotheses are represented by the following:
where P1 is the value of accuracy (or sensitivity or specificity) under the alternative hypothesis. A power-based sample size formula for comparison of a proportion to a fixed value can be applied for evaluation of a single diagnostic method using the normal approximation of the binomial distribution. With (1 − α)% confidence level and (1 − β)% power for detecting an effect of P1 – P0, the required sample size for cases is obtained from the following:
For example, if the laboratory wishes to compare the locally determined accuracy of a software or algorithm to the vendor’s claim of 95%, the sample size required to have 95% confidence and 80% power to detect a difference of 5% or more lower than the claimed accuracy of 95% would be calculated as follows:

Since the laboratory is solely interested in knowing whether the locally determined accuracy is significantly lower than the manufacturer’s claimed accuracy, a one-sided Zα score of 1.645 is used instead of the 2-sided Zα/2 of 1.96 at an α level of 0.05, and Zβ = 0.84 at β = 0.20. Due to the properties of the binomial distribution, as the accuracy claim decreases, the sample size increases for a given tolerance level or deviation from the claimed accuracy.

Although sample size calculations for accuracy do not depend on the prevalence of the entity of interest in the population sampled, sample sizes for verifying sensitivity and specificity do. Sample sizes for sensitivity and specificity can also be derived using Equation 1; for sensitivity, the calculated estimate represents the total number of positive samples and for specificity, the calculated estimate represents the total number of negative samples. For example, using Equation 2 above and targeting a prevalence of 50% in the accuracy evaluation cohort, one would require at least 75 positive samples and at least 75 negative samples to have 95% confidence and 80% power to detect a difference of 7.5% or more lower than claimed sensitivity and specificity of 97.5%: that is, 94 cases in total. An imbalanced sample with prevalence of 20% (20% positive, 80% negative) would require at least 92 positive cases and a total sample size of 460 to verify the sensitivity claim; the sample size for verification of the specificity claim would be 115 to ensure at least 92 negative cases.

Assuming an α level of 0.05 and β = 0.20, a sample size of at least 150 (75 positive and 75 negative cases) should be considered sufficient for verification of binary classification tasks, when the accuracy, sensitivity, and/or specificity claim is at least 95% or higher.

When establishing an accuracy claim (P) for a classification model, a precision-based sample size calculation is applicable using a 2-sided Zα score of 1.96 is recommended (α = 0.05) and a prespecified confidence interval width (e) for the accuracy metric:
For example, during development, a binary classification model for prostate cancer identification in hematoxylin-eosin whole slide images demonstrated an accuracy of 94.2%. If the laboratory wishes to establish an accuracy claim for the model of 95% ± 2.5%, then the required sample size would be:

It should be noted that these are naïve assumptions and the actual number needed may depend on additional considerations for determining the estimated sample size for establishing diagnostic accuracy. These include, but are not limited to, study design (single reader versus multiple readers), the number of diagnostic methods being evaluated (single versus 2 or more), how cases will be identified (retrospective versus prospective), the estimated prevalence of the diagnosis of interest in the population being sampled (prospective design) or the actual prevalence in the study cohort (retrospective design), the summary measure of accuracy that will be employed, the conjectured accuracy of the diagnostic method(s), and regulatory clearance of the model.

Regression Model Test Statistics

For regression models, the outputs are measured on an interval or ratio scale rather than categorically. The purpose of the quantitative accuracy study is to determine whether there are systematic differences between the test method and reference method. This type of evaluation typically uses a form of regression analysis to estimate the slope and intercept with their associated error(s), and the coefficient of determination (R2), where applicable. Regression analysis is a technique that can measure the relation between 2 or more variables with the goal of estimating the value of 1 variable as a function of 1 or more other variables. Data analysis proceeds in a stepwise fashion, first using exploratory data analysis techniques, such as scatterplots, histograms, and difference plots, to visually identify any systematic errors or outliers.134  Specific tests for outliers such as a model-based, distance-based test, or influence-based test may also be performed.135 

The purpose of the method comparison study for quantitative methods is to estimate constant and/or proportional systematic errors. There are 2 recognized approaches for method comparison of quantitative methods: Bland-Altman analysis and regression analysis. In Bland-Altman analysis, the absolute or relative difference between the test and comparator methods is plotted on the y-axis versus the average of the results by the 2 methods on the x-axis.136,137  Since the true value of a sample is unknown in many cases, except when a gold standard or reference method exists, using the average of the test and comparator results as the estimate of the true value is usually recommended. While visual assessment of the difference plot on its own does not provide sufficient information about the systematic error of the test method, t test statistics can be utilized to make a quantitative estimate of systematic error; however, this bias estimate is reliable only at the mean of the data if there is any proportional error present.138–140 

Regression techniques are now generally preferred over t test statistics for calculating the systematic error at any decision level, as well as getting estimates of the proportional and constant components. However, it is important to note that the slope and intercept estimates derived from regression can be affected by lack of linearity, presence of outliers, and a narrow range of test results; choosing the right regression method and ensuring that basic assumptions are met remain paramount. If there is a constant standard deviation across the measuring interval, ordinary least squares (OLS) regression and/or constant standard deviation Deming regression can be used. If instead, the data exhibit proportional difference variability, then the assumptions for either OLS or constant standard deviation Deming regression are not met and instead, a constant coefficient of variation Deming regression should be performed. If there is mixed variability (standard deviation and coefficient of variation), then Passing-Bablok regression, a robust nonparametric method that is insensitive to the error distribution and data outliers, should be used. An assumption in OLS regression is that the reference method values are measured without error and that any difference between reference method and test method values is assignable to error in the test method; this assumption is seldom valid for clinical laboratory data unless a defined reference method or standard exists and, therefore, OLS regression is not appropriate in most cases for performance evaluation of quantitative outputs. In Deming regression, the errors between methods are assigned to both reference and test methods in proportion to the variances of the methods141  (Figure 4, A through C). Two special situations arise when the measurand is either a count variable (eg, inpatient length of stay in days) or a percentage bounded by the open interval (0, 1). For the former, Poisson or negative binomial regression techniques should be used depending on the dispersion of the data and for the latter, β regression techniques should be considered. Statistical techniques are also available for evaluation of models comparing ordinal to continuous variables, such as ordinal regression, but these are beyond the scope of this paper.

Figure 4.

Selected regression techniques based on data representation. A, Ordinary least squares regression assumes that there is only error or imprecision in the y variable (test method) and no error in the x variable (reference method). B, Constant standard deviation Deming regression assumes error or imprecision in both the y variable (test method) and the x variable (reference method). C, When proportionate bias is present, the assumptions for either ordinary least squares or constant standard deviation Deming regression are not met and instead, a constant coefficient of variation Deming regression should be performed.

Figure 4.

Selected regression techniques based on data representation. A, Ordinary least squares regression assumes that there is only error or imprecision in the y variable (test method) and no error in the x variable (reference method). B, Constant standard deviation Deming regression assumes error or imprecision in both the y variable (test method) and the x variable (reference method). C, When proportionate bias is present, the assumptions for either ordinary least squares or constant standard deviation Deming regression are not met and instead, a constant coefficient of variation Deming regression should be performed.

Close modal

Precision: Repeatability and Reproducibility

Precision refers to how closely test results obtained by replicate measurements on the same or similar conditions agree.142  Precision studies for ML models can be conducted similarly to test methodology evaluations in clinical laboratories; approaches for precision evaluation have been extensively described by the Clinical and Laboratory Standards Institute (CLSI). Repeatability reflects the variability among replicate measurements of a sample under experimental conditions held as constant as possible, that is, using the same operator(s), same instrument, same operating conditions, and same location over a short period of time (eg, typically considered within a day or a single run). Within-laboratory precision (an “intermediate” precision type) incorporates run-to-run and day-to-day sources of variation using a single instrument; these precision types may also include operator-to-operator variability and calibration cycle-to-cycle variability for procedures that require frequent calibration. As such, repeatability and within-laboratory precision are evaluated via a single-site study. Sometimes confused with “within-laboratory precision,” reproducibility (or between-laboratory reproducibility) refers to the precision between the results obtained at different laboratories, such as multiple laboratories operating under the same CLIA license. Reproducibility, while not always needed for single-lab precision evaluation, may be beneficial when a model is being deployed at more than one site.

Evaluation of repeatability and within-laboratory precision and reproducibility can be conducted as separate experiments (“simple precision” design). For the simple repeatability study, samples are tested in multiple replicates within a single day by 1 operator and 1 instrument under controlled conditions. For simple within-laboratory precision studies, the same samples are run once a day for several days within the same site, varying the instrument and/or operator as appropriate and per standard operating procedure. In a complex precision design, a single experiment is performed with replicate measurements for several days and across several instruments or operators, where applicable. Detailed statistical techniques for analyzing complex precision studies are described elsewhere (for quantitative studies using nested repeated-measured ANOVA [regression outputs], see EP05–Evaluation of Precision of Quantitative Measurement Procedures; for qualitative studies [classification outputs], see ISO 16140).143,144  These techniques provide estimates for repeatability and between-run and between-day precision, as well as within-laboratory precision or reproducibility depending on the study design. They allow a laboratory to rigorously establish the precision profile of a modified FDA-approved or laboratory-developed model or verify the precision claims of a manufacturer for an unmodified FDA-approved model.

For an unmodified FDA-authorized model, the objective is to verify the manufacturer’s precision claim, such as the proportion of correct replicates and/or the proportion of samples that have 100% concordance across replicates or days in the manufacturer’s precision experiment. For example, in a repeatability experiment, the manufacturer processes 1 whole slide image each from 35 tumor biopsies and 36 benign biopsies in 3 replicates using 1 scanner or operator in the same analytical run. For biopsies containing tumor, 99.0% (104 of 105) (95% CI: 94.8%–99.8%) of all scans and 97.1% (34 of 35) of all slides produced correct results, while for benign biopsies, 94.4% (102 of 108) (95% CI: 88.4%–97.4%) of all scans and 88.9% (32 of 36) of all slides produced correct results. An acceptable number of replicates for the repeatability study can be derived from Equation 1, but 60 replicates per model class should be sufficient in most cases allowing for a 10% tolerance below the manufacturer’s stated precision claim. Therefore, a user can verify the manufacturer’s stated repeatability claim by analyzing a set of slides from tumor biopsies and benign biopsies (n = 10 each) obtained at their local institution in replicates (n = 6) using 1 scanner or operator in the same analytical run.

If the observed precision profile meets or exceeds the manufacturer’s claims, the precision study can be accepted. For precision studies where most of the ML model outputs pass, but some fail, the medical director should review the study design and results to evaluate potential confounding causes of the imprecision. Failure to meet the stated claims does not mean that the observed precision is necessarily worse than the manufacturer’s claim, but it should require a 1-sample or 2-sample test of proportions to compare the observed precision profile to that described by the manufacturer. Alternatively, the laboratory may choose to qualitatively evaluate the precision profile of the model. The precision study may still be accepted if the laboratory director deems it appropriate; written justification and rationale for accepting the precision performance will be needed. Precision study results where all samples show 100% concordance across replicates, or where no more than 1 sample shows discordance, should be acceptable for most use cases. When discordance across replicates, runs, days, instruments, or operators is observed, the medical director should review the study design and results to evaluate confounding variables as a potential cause of the imprecision and may decide to repeat samples or runs, or the entire precision study if appropriate. In this scenario, close ongoing monitoring of the ML model is advised. For precision studies where many results failed or showed a high level of imprecision, the deployment site may reject the ML model for use as a clinical test, troubleshoot laboratory influences affecting the output results, or design a new study to better evaluate precision of the model.144,145 

Change control for ML-based testing in the laboratory is defined as the process of how changes to requirements for laboratory processes are managed. Change management is closely related and defined as the process for adjusting to changes for a new implementation. The goal of change management is to control the life cycle of any modifications. It is the fourth pillar of process management as defined by the CLSI, which describes the pillars as (1) analyzing and documenting laboratory processes, (2) verifying or validating a laboratory process, (3) implementing controls for monitoring process performance, and (4) change control and change management to approved laboratory processes.146  Change control encompasses alterations to preanalytical, analytical, and postanalytical variables. In the context of ML models in pathology, most alterations can be divided into changes in physical components (eg, reagents such as stains, specimen collection and transportation parameters, fixation times, specimen types, or instruments such as whole slide scanners, nucleic acid sequencers, chemistry and coagulation analyzers) and digital dependencies (eg, data inputs, algorithm code, file formats, resolution, software interoperability). Laboratory management should have a structured process for assessing changes in laboratory processes that involve or impact ML models deployed in the laboratory. Change management can become necessary for a variety of reasons including identification of a problem in the quality control process, an external or internal need to update instruments, equipment, or reagents, or to respond to the changing needs of the end user (Table 5).

Table 5.

Conditions That Trigger Change Management Processes

Conditions That Trigger Change Management Processes
Conditions That Trigger Change Management Processes

The purpose of creating a change management system is to ensure that changes to the system do not adversely affect the safety and quality of laboratory testing. For laboratory tests that employ ML systems, the ML model performance has a critical impact on the overall test performance and patient safety. Any changes to a previously verified and validated model should require performance characterization of the updated model. In addition to evaluating how changes may affect clinical functionality or performance, model changes can also be interpreted through the lens of risk assessment, with proposed changes stratified by risk. Modifications should be evaluated with particular attention to the potential for changes that may introduce a new risk or modify an existing risk that could result in significant harm, with controls in place to prevent harm. In addition, changes to the model or processes that the model is dependent on need to be transparent and communicated to all personnel and stakeholders. A successful change management strategy should reduce the failed changes and disruptions to service, meet regulatory requirements through auditable evidence of change management, and identify opportunities for improvement in the change process. Documentation for change control depends on the type of change, the scope and anticipated impact of the change, and the risk associated with the change. The documentation and verification should be completed before the change is clinically implemented (examples of documentation to support the change management process are listed in Table 6). CLIA regulations for proficiency testing require periodic reevaluation (eg, twice annually), even if there are no changes, to verify tests or assess accuracy.

Table 6.

Examples of Documentation That Should Be Prepared Prior to Implementation of a Change

Examples of Documentation That Should Be Prepared Prior to Implementation of a Change
Examples of Documentation That Should Be Prepared Prior to Implementation of a Change

One of the main challenges with ML-based systems is that they can have many dependencies that may not be readily apparent to everyone in the organization. It is recommended to identify all possible dependencies in the initial implementation and to update this documentation when process changes are made. Having the process team participate in the preparation and proposed modifications of a flow diagram can help ensure inclusion of the steps in the upstream and downstream workflows, especially when data inputs or processes are supplied by different personnel or different divisions in a laboratory. Implementation of ML-based CDS systems with dependencies across the laboratory will mean that the wider laboratory (eg, the histology laboratory, molecular pathology laboratory, chemistry laboratory, etc) should consider whether a new or revised process will affect any other processes including any deployed ML-based CDS systems. If so, a timeline needs to be established for the wider systemic review and revision of the associated process flows. Team members from each of the impacted processes for a new or changed process should be included and verify the updated flow chart. Once all documents associated with the revised process are prepared, validation and/or verification is needed to determine the performance characteristics of the revised test. With any revision of the model, the model version should be documented for auditing and tracking of performance. Verification is also performed when a laboratory intends to use a validated process in a similar manner for a different application. For example, consider a scenario where a laboratory has a validated ML-based CDS system on a specific whole slide scanning device for primary diagnosis. The laboratory now intends to install a new whole slide scanner from a different vendor and wants to use the same ML model with the new scanners. Reassessing the performance characteristics of the ML-based system is recommended after any model parameters have been changed to iterate a priori and assess the performance as the final model would be clinically deployed. Recent draft guidance from the FDA entitled, “Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence/Machine Learning (AI/ML)-Enabled Device Software Functions” speaks to these concepts.147  Ideally, predetermined change control plans should delineate manufacturer versus local medical director responsibilities relative to validation and verification. For the medical director, the predetermined change control plan should include both the verification process and a set of target verification metrics that have a clear relationship to clinical performance and are tailored to what is practical for local sites including data volume and number of cases.

Deployment Site Ongoing Monitoring

After a model has been duly verified and/or validated and appropriate instruction for use (eg, standard operating procedure) is in place, it is the responsibility of the laboratory to ensure the model performance is monitored for stability and reliability over time. Changes in the input data may result in an alteration of the model performance. The time it takes for the model to deteriorate in performance is determined by factors pertaining to the upstream and downstream processes or data that correspond to the model. Model performance may shift (eg, sudden loss of performance), drift (eg, gradual loss of performance), or show cyclical or recurring change depending on these influences. Shift may be obvious to the laboratory and can be exemplified by an update in the laboratory information system where the units of a specific input value have changed, and the ML model using that numerical data for predictions is now significantly different and shows inferior performance. In this scenario, the results of any test including an ML system with shift should not be reported until the updated system undergoes verification as part of change management with updating laboratory documentation as necessary. Certain changes causing shift or drift may require retraining or calibration of a new ML model and subsequent verification and/or validation of the updated model. Drift can be more challenging to identify, especially for very small incremental changes over time; it usually refers to changes in the data distribution that the model was originally trained with and validated for. Drift may occur due to changes in patient demographics, updated practice patterns, or new workflows105,112,148–150  The loss in performance may show a loss of discrimination or calibration; however, depending on the change of the data, there may be loss of calibration without a significant change in discrimination, or vice versa.151–153  Changes in healthcare can introduce transformed patient populations, specimen collection, data inputs, and clinical workflows.154–160  Vigilance in laboratory testing and performance monitoring is crucial to patient safety, medical decision making, and good laboratory practice. When the performance of a verified and/or validated model has deteriorated (eg, shift, drift), clinical testing using the ML system should be halted until the performance characterizations have been remediated. Performance monitoring techniques such as calibration drift detection have been suggested and will likely play an important role for ongoing monitoring of clinically deployed ML models.161  Additional ongoing monitoring of model stability includes recognition of adversarial attacks on ML models, predominantly for image classification.162  These attacks are adversarial images (eg, perturbations, skew vectors) that can be applied to the medical image that are indiscernible to the human eye, but may cause drastic changes in model performance.163  With appropriately diverse training data, decreasing the disparity between the development and target data, as well as the model architecture, has been shown to minimize the effect of adversarial attacks on the ML model performance.164,165 

Often the primary method for evaluating the performance of ML-based software is a standalone or in silico evaluation of the ML-based CDS system’s output compared to a reference standard (eg, ground truth) on a series of samples that are not part of the system’s training or tuning set (ie, on which the model was not trained). As described above, these evaluations can provide useful overall statistical metrics (ie, sensitivity, specificity, positive predictive value, correlation, and mean square error analysis). Validation performance evaluations educate the user about the augmented workflow of the pathologist’s use of the ML model during clinical routine workflow. While analytical validity is critical to establishing benchmark model performances, as augmented intelligence, a human interpreter (eg, pathologist) is still required to finalize and authenticate a patient’s report. Standalone (eg, in silico) evaluations fall short of capturing how the human ingests, interprets, and incorporates the results of the ML model when evaluating a sample. As there are disruptive clinical distractions, the ML-based CDS system may also have disruptive, time-consuming visualizations, workflows, or false results that may undermine the mental demand of such systems on users in the medical domain. The effect of this on patient care has been shown to be significant in several studies.166,167  Moreover, user certainty (eg, digital confidence), uncertainty (eg, digital hesitancy), and comfort with the interface may likewise have a profound effect on issuing a timely and accurate result.168–170  Implicit in this evaluation is the impact of the stress-related tasks related to the human-computer interaction.171  Presentation interaction complexity, display features, explainability of the model outputs, and information density may influence user-experienced stress and, hence, also reliability and adoption.

Studies have examined the human-computer interaction, capturing various metrics, such as improvement in accuracy, patient outcomes, and productivity. In surgical pathology, this has been most studied for ML-based CDS systems related to prostate cancer diagnosis (detection, grading, or assessment of perineural invasion). The use of these systems has resulted in changing various aspects for the diagnostic interpretation, although few studies assess CDS utility in a setting that mimics diagnostic practice.52,118,164  Generally, improvements are seen with the use of these highly accurate CDS systems. Optimization of the human-computer interaction is essential to ensuring that the user and the ML model show optimal performance when used together and exceed the performance than either the ML model or user alone. Optimal performance of the human-AI interface can be evaluated in terms of accuracy and productivity. Central to this optimization are themes including trust, understanding the model, explainability, and usability (eg, user interface, visualizations).

Trust

Rigorous evaluation of ML-based systems is needed to build trust or provide evidence not to trust the evaluated system. Regardless, ML systems can be met by skepticism from users.172  Distrust may stem from a lack of understanding, feeling threatened by the technology, or perhaps feeling that automation may negatively impact quality of care. In contrast, excess trust in CDS systems can lead to overreliance on the technology and user complacency. User complacency might result in a convergence of the human’s performance characteristics with that of the ML system. This consequence is arguably desirable if an ML model outperforms the human for a given task; however, it is rightfully worrisome until proven otherwise. Instead, an approach of vigilance is recommended to support assistive workflows. The user (eg, pathologist) can correct or adjust the model output if incorrect or accept the output if deemed correct. For a model with high performance, this supervision may result in the most optimal result for the patient in that the combined benefits of human and machine will be incorporated into the final diagnostic assessment. Trust and, ultimately, confidence will result from a high-performance system (eg, minimal errors). The key to achieving this level of human vigilance relies on the user understanding the strengths and weaknesses of the system.

Understanding the Model

Both diagnosticians and the CDS systems designed to aid them are imperfect. Thoughtfully designed standalone studies of ML models, with involvement of the diagnosticians intended to use them, can provide invaluable information as to how human weaknesses or strengths can be informed by the tool. For example, an ML model designed to detect invasive prostatic acinar carcinoma might show excellent standalone overall performance, but when tested against mimics of carcinoma, mimics of benign prostatic tissue, or other diagnostically challenging scenarios, the standalone performance might deteriorate. These situations may be less frequent, however they represent the true diversity of conditions in which a human pathologist could mitigate potential error introduced by the ML model. Outside of biologically relevant situations that pose a challenge for a pathologist, the tool should be evaluated in scenarios that expose possible bias or errors, such as in a variety of ages, races, and ethnicities, and on a sample set with a range of quality.173,174  Additionally, the model developer may have selected an operating point to support a screening task workflow, and thus the output predictions may have a perceived high false positive rate. Knowing the model’s operating point and intended use can allow the user to further understand the strengths and weaknesses of the ML system.

Explainability

The complexity of an ML model can impact the ability of humans to understand how it works and hinder the ability to predict the conditions in which the model may fail (eg, vulnerabilities to preanalytical variables or artifacts). The complexity of the model should be considered in terms of the number of parameters of the model and the extent to which nonlinear relationships are being modeled. For models that have highly nonlinear or multivariate properties, it can be challenging to relate the model’s behavior to the corresponding attributes that humans have studied in the medical domain. Explainability for ML systems is critical to the safety, approval, and acceptance of CDS systems.

Many ML models learn low-level features and the interpretability of those models using high-level interrogative processes can provide insights that humans can understand regarding how the model output was produced. Model transparency does not necessarily mean knowing the data or what features were used to train the model, but rather visibility into the model itself and relationships between the model and the outputs. ML systems may be considered a “black box” where the model’s internal workings are not directly interpretable and cannot be communicated to the pathologist in an understandable way. For example, a model that uses a deep convolutional neural network to estimate percentage of nuclear immunohistochemical DAB stain positivity may be too complex for the pathologist to thoroughly understand the mechanism by which it predicts an image to be “positive” but such a model may also perform better than an image analysis-based tool that explicitly segments cells and measures pixel intensity on a cell-by-cell basis. Failures of the image analysis software due to poor staining or crush artifacts can be more easily understood and remediated, whereas failures of the deep network may have a less obvious solution. Explainability for ML models include (1) returning exemplars to the pathologist to evaluate visually (eg, saliency map) or (2) quantifying image attributes (eg, number of cells in an image patch) and mathematically relating such measurements to the output of the model. For fair interpretability, Shapley values, adapted from coalition game theory, provide insights to the contribution of a feature compared to the difference between the actual prediction and the average prediction, given all features learned in the model. Newer tools to estimate Shapley values, such as Shapley additive explanation (SHAP) values are model agnostic (eg, classification and regression), openly supported, and easily interrogate an existing pretrained model. SHAP is a tool that can provide local accuracy (eg, differentiating between the expected ML model output and the output of a given instance), “missingness” (eg, missing features can be attributed), and consistency (eg, if an ML model changes and the weight of a given feature is modified, it should be directly proportional to the corresponding Shapley value). Shapley values can provide meaningful information showing the effects of the ML model output for the prediction of that instance175,176  (Figure 5, A and B). Other approaches for model explainability such as individual conditional expectation, local interpretable model-agnostic explanations, testing with concept activation vectors, or counterfactual explanations offering varying fidelity based on the use case.177–179 

Figure 5.

Representative graphical plot to interrogate model performance. A, Violin plots show effects representations of numerical data. They show the probabilistic density of the model’s inputs across the full distribution of the data. The features are shown on the x-axis; the y-axis demonstrates relationships to the model output. B, For categorical data, SHAP summary plot can be used to interrogate representations of model performance. SHAP summary plots characterize the inputs on the y-axis, and whether the effect of that value caused a higher or lower prediction on the x-axis. The red to blue color spectrum shows whether the feature was high or low for that specific data point. This summarization plot illustrates the relationship between the value of a feature and the impact that value holds for the machine learning model’s output. Abbreviation: SHAP, Shapley additive explanation.

Figure 5.

Representative graphical plot to interrogate model performance. A, Violin plots show effects representations of numerical data. They show the probabilistic density of the model’s inputs across the full distribution of the data. The features are shown on the x-axis; the y-axis demonstrates relationships to the model output. B, For categorical data, SHAP summary plot can be used to interrogate representations of model performance. SHAP summary plots characterize the inputs on the y-axis, and whether the effect of that value caused a higher or lower prediction on the x-axis. The red to blue color spectrum shows whether the feature was high or low for that specific data point. This summarization plot illustrates the relationship between the value of a feature and the impact that value holds for the machine learning model’s output. Abbreviation: SHAP, Shapley additive explanation.

Close modal

There is also controversy surrounding the definitive need to make ML models humanly interpretable, as it is conceivable that some image features utilized by the model are intrinsically not interpretable by humans, and if they were, there would not be a need to use complex models in the first place. These analyses can be useful companions to ML performance metrics as they can help a pathologist surmise how the model works. However, these methods are not without limitations. Providing examples to the pathologist relies on a qualitative human assessment that makes assumptions from observed correlations but does not establish causality. Further work in explainable AI is needed to help bridge the gap between human interpretation and model behavior.43,180,181  In addition, user interfaces and visualization methods designed to present the results of such analyses to pathologists in an intuitive fashion must continue to be developed.

User Interface and Model Output Visualization

Most end users will not directly interact with the ML model in isolation, but rather with a graphical user interface or output visualization. An intuitive user interface is integral to evaluate an ML model output effectively and efficiently. Interface design is critical to minimize improper use of the CDS system.182  Careful consideration of user-required inputs and methods ensures both ergonomic and psychometric appropriateness. User interface and data presentation are also crucial components to efficient evaluation of ML model outputs. For instance, for detection of low-grade squamous intraepithelial lesion in a liquid-based cytology specimen it may be more helpful to have a gallery of suspicious cells for ease of review than a binary slide-level label that does not provide the user with pixel-wise awareness of where the cells of interest are within the image. Similarly, tumor purity quantification of a molecular extraction specimen could provide numerical counts for each cell class, or highlight (eg, segment) the cells of each class for the user to view and assess. There are various visualizations to direct the user’s attention to specific representations of the model outputs (eg, arrows, crosshairs, heat maps). Additional studies reviewing user interface/user experience in using such tools are important to distinguish user perception from design principles. For instance, a large blinking red arrow around a region of interest may influence users to overcall a diagnosis compared to a static neutral-colored outline. The visualization type, if applicable, should be documented in the operating procedures. Not all model outputs will rely on specified visualization tools; clinical verification and/or validation should document whether human-AI interaction will be required to interpret the results. Ideally, the ML developer will have studied the tool in a setting that examines the human-AI interaction, and through user-centered design principles, provide the most optimal user interface and visualization to support the use case of the model (Figure 6, A through F).

Figure 6.

Select visualization techniques to evaluate model performance. A and D, Histograms showing differing distribution of data outputs. Histograms shows data spread and skewness including whether the data conforms to A, unimodal, D, bimodal, or multimodal (not shown) distributions. B and E, Segmentation of bacterial colony growth on an agar media plate with static image of nutrient agar plate with B, microbacterial colony growth seen by the human eye and E) showing automated segmentation of bacterial colonies to highlight and quantify regions on the image predicted to be bacterial growth. C and F, Hematoxylin-eosin stain, showing pixel attribution visualization (eg, class activation map, heat map, saliency) showing the hematoxylin-eosin region of interest of C, invasive prostatic acinar carcinoma and F, the corresponding class activation map indicating areas of high (red) to low (blue) probabilistic output of invasive carcinoma. Abbreviation: AKI, acute kidney injury (hematoxylin-eosin, original magnification ×40).

Figure 6.

Select visualization techniques to evaluate model performance. A and D, Histograms showing differing distribution of data outputs. Histograms shows data spread and skewness including whether the data conforms to A, unimodal, D, bimodal, or multimodal (not shown) distributions. B and E, Segmentation of bacterial colony growth on an agar media plate with static image of nutrient agar plate with B, microbacterial colony growth seen by the human eye and E) showing automated segmentation of bacterial colonies to highlight and quantify regions on the image predicted to be bacterial growth. C and F, Hematoxylin-eosin stain, showing pixel attribution visualization (eg, class activation map, heat map, saliency) showing the hematoxylin-eosin region of interest of C, invasive prostatic acinar carcinoma and F, the corresponding class activation map indicating areas of high (red) to low (blue) probabilistic output of invasive carcinoma. Abbreviation: AKI, acute kidney injury (hematoxylin-eosin, original magnification ×40).

Close modal

The training plan for an ML-based CDS system should include all professionals and supporting personnel who will interact with the system. This may include administrative, technical, and professional personnel. The scope of training provided should be tailored to the level of responsibility, delegated tasks, and amount of interaction anticipated for a given trainee. In general, closest attention should be given to those responsible for case selection, dataset or input field selection, and decision support to use or reject the model outputs. Clerical or supporting technical personnel should be aware of inputs that may impact model outputs (eg, specimen source driving analysis by a specific ML model), particularly if such data are not automatically transferred from the laboratory information systems.

Verification and/or validation datasets should include sufficient variation to ensure acceptable performance across a range of possible inputs or classes to the model. Additionally, personnel training materials should include case materials that fall close to critical decision thresholds and also include cases to illustrate the model failure modes (eg, image digitization artifacts, insufficient data, improper input data, case selection, use of model output, reporting). For those responsible for reporting results, a clinical competency assessment to use the tool should be demonstrated by successful completion of training cases and satisfactory concordance on a representative spectrum of competency cases. Competency assessment should replicate the procedures planned for patient cases and include cases that should be rejected for evaluation.183 

Training and competency materials can be prepared and provided by the manufacturer or derived from laboratory-validated cases from the user institution files. If manufacturer-provided training materials are used, the deployment laboratory should ensure that their in-house procedures and outputs are comparable to the materials provided by the manufacturer by tangible quality metrics.184,185  Good laboratory practices dictate that a tool should not be placed into use prior to completion of a written, lab director–approved procedure. Additionally, the training exercises and documentation should clearly delineate how to revert to downtime procedures in the event of model drift, shift, or other error. If any part of the model is consistently producing errors, the part of the testing process where the ML model is being used should cease immediately. The performance evaluation should include scenarios present within the intended clinical environment. This should also confirm that safety elements work properly. Confirmation of acceptable failure behavior in the clinical environment should be established, with plans for failsafe procedures in the event the model cannot be used. Definitive understanding of the failure modes including inappropriate use of inputs or errors in the model outputs should be remediated prior to clinical use. Training and competency assessment records should be retained in a retrievable manner for the full period of employment, or use of the tool. Standard operating procedures should reflect retraining triggers clearly; retraining and education should be performed whenever procedures change significantly, such as an expansion of case selection criteria. Documentation of training and competency assessment of appropriate users should be retained similarly to other laboratory procedures as per good laboratory practice.78,186 

ML models have enabled newfound functionality and workflows in pathology. Significant numbers of ML models are commercially available and organizations with computational pathology resources may also develop ML models that can be introduced into clinical practice. ML-based models in pathology include imaging-based and non–image-based methodologies. These proposed recommendations are based on the available evidence and literature, and the authors hope to encourage additional peer-reviewed literature to support adoption of these novel technologies in clinical practice (Table 7). The combination of pathologists or others and ML-based systems creates a paradigm of augmented intelligence, where the pathologist is assisted by ML from patient testing and reporting to enhance cognitive performance and clinical decision making. High-performance ML models will facilitate specific tasks for pathologists in an assistive fashion; the role of the ML model is limited to providing competent support to the healthcare provider.

Table 7.

Summary of Recommendations for Performance Evaluation of Machine Learning–Based Clinical Decision Support Systems in Pathology

Summary of Recommendations for Performance Evaluation of Machine Learning–Based Clinical Decision Support Systems in Pathology
Summary of Recommendations for Performance Evaluation of Machine Learning–Based Clinical Decision Support Systems in Pathology

Performance evaluation of ML models is critical to verification and/or validation of ML systems intended for clinical reporting of patient samples. Appropriate evaluation metrics based on the ML model should be used to evaluate the performance characteristics of the model prior to clinical testing. The importance of incorporating laboratory director oversight and using local data for verification and/or validation cannot be overstated. Measuring the similarity between the development dataset and the verification and/or validation dataset can help assure performance characteristics should generalize as indicated. Changes to the model (eg, retraining with additional data, intended use/indications of use, significant changes to the user interface) should require re-verification of the revised model and documentation of the model version. Additionally, any change to the workflow used to generate the input data needs to be considered. Understanding the preanalytic variables for biospecimen quality may impact model performance characteristics. Furthermore, monitoring for performance defects over time (eg, shift, drift) should be conducted to detect relevant data effects of the ML models. If a model is found to have deteriorated performance, the portion of the clinical test where the ML model is incorporated should immediately be stopped from use and remediated appropriately.

Pathology laboratories should be responsible for determining the clinical utility of the ML model prior to verifying and/or validating and for monitoring performance over time. Proper documentation of training and competency should be completed prior to clinical use of the ML-based system. As pathologists verify and validate ML models, they must learn about the scope of application, and its strengths and limitations, in preparation for a future that will incorporate powerful new tools and new management responsibilities.

The authors thank James Harrison, MD, for guidance on this concept paper and valuable commentary, Mary Kennedy and Kevin Schap for administrative support, and the College of American Pathologists committees and members who otherwise contributed to these recommendations.

1.
Wians
 
FH
Gill
 
GW.
Clinical and anatomic pathology test volume by specialty and subspecialty among high-complexity CLIA-certified laboratories in 2011
.
Lab Med
.
2013
;
44
(
2
):
163
167
.
2.
US Food and Drug Administration
.
FDA authorizes software that can help identify prostate cancer
. https://www.fda.gov/news-events/press-announcements/fda-authorizes-software-can-help-identify-prostate-cancer, Accessed November 9, 2021.
3.
US Food and Drug Administration
.
510(k) Premarket notification. X100 with full field peripheral blood smear (PBS) Application
. https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K201301. Accessed November 9, 2021.
4.
US Food and Drug Administration
.
510(k) Premarket notification. CellaVision
. https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfPMN/pmn.cfm?start_search=1&productcode=JOY&knumber=&applicant=CELLAVISION%20AB. Accessed November 9, 2021.
5.
US Food and Drug Administration
.
510(k) Premarket Notification. APAS independence with urine analysis module
. https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K183648. Accessed December 30, 2021.
6.
US Food and Drug Administration
.
Premarket approval (PMA). ThinPrep integrated imager
. https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpma/pma.cfm?id=P950039S036. Accessed November 9, 2021.
7.
Chabrun
 
F,
Dieu
 
X,
Ferre
 
M,
et al.
Achieving expert-level interpretation of serum protein electrophoresis through deep learning driven by human reasoning
.
Clin Chem
.
2021
;
67
(
10
):
1406
1414
.
8.
Punchoo
 
R,
Bhoora
 
S,
Pillay
 
N.
Applications of machine learning in the chemical pathology laboratory
.
J Clin Pathol
.
2021
;
74
(
7
):
435
442
.
9.
Baron
 
JM,
Mermel
 
CH,
Lewandrowski
 
KB,
Dighe
 
AS.
Detection of preanalytic laboratory testing errors using a statistically guided protocol
.
Am J Clin Pathol
.
2012
;
138
(
3
):
406
413
.
10.
Rosenbaum
 
MW,
Baron
 
JM.
Using machine learning-based multianalyte delta checks to detect wrong blood in tube errors
.
Am J Clin Pathol
.
2018
;
150
(
6
):
555
566
.
11.
Farrell
 
CJL,
Giannoutsos
 
J.
Machine learning models outperform manual result review for the identification of wrong blood in tube errors in complete blood count results
.
Int J Lab Hematol
.
2022
;
44
(
3
):
497
503
.
12.
Luo
 
Y,
Szolovits
 
P,
Dighe
 
AS,
Baron
 
JM.
Using machine learning to predict laboratory test results
.
Am J Clin Pathol
.
2016
;
145
(
6
):
778
788
.
13.
Lidbury
 
BA,
Richardson
 
AM,
Badrick
 
T.
Assessment of machine-learning techniques on large pathology data sets to address assay redundancy in routine liver function test profiles
.
Diagn Berl Ger
.
2015
;
2
(
1
):
41
51
.
14.
Poole
 
S,
Schroeder
 
LF,
Shah
 
N.
An unsupervised learning method to identify reference intervals from a clinical database
.
J Biomed Inform
.
2016
;
59
:
276
284
.
15.
Wilkes
 
EH,
Emmett
 
E,
Beltran
 
L,
Woodward
 
GM,
Carling
 
RS.
A machine learning approach for the automated interpretation of plasma amino acid profiles
.
Clin Chem
.
2020
;
66
(
9
):
1210
1218
.
16.
Lee
 
ES,
Durant
 
TJS.
Supervised machine learning in the mass spectrometry laboratory: a tutorial
.
J Mass Spectrom Adv Clin Lab
.
2021
;
23
:
1
6
.
17.
Yu
 
M,
Bazydlo
 
LAL,
Bruns
 
DE,
Harrison
 
JH
.
Streamlining quality review of mass spectrometry data in the clinical laboratory by use of machine learning
.
Arch Pathol Lab Med
.
2019
;
143
(
8
):
990
998
.
18.
Demirci
 
F,
Akan
 
P,
Kume
 
T,
Sisman
 
AR,
Erbayraktar
 
Z,
Sevinc
 
S.
Artificial neural network approach in laboratory test reporting:  learning algorithms
.
Am J Clin Pathol
.
2016
;
146
(
2
):
227
237
.
19.
Lipkova
 
J,
Chen
 
RJ,
Chen
 
B,
et al.
Artificial intelligence for multimodal data integration in oncology
.
Cancer Cell
.
2022
;
40
(
10
):
1095
1110
.
20.
Wong
 
A,
Otles
 
E,
Donnelly
 
JP,
et al.
External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients
.
JAMA Intern Med
.
2021
;
181
(
8
):
1065
1070
.
21.
Rank
 
N,
Pfahringer
 
B,
Kempfert
 
J,
et al.
Deep-learning-based real-time prediction of acute kidney injury outperforms human predictive performance
.
Npj Digit Med
.
2020
;
3
(
1
):
1
12
.
22.
Abd-Elrazek
 
MA,
Eltahawi
 
AA,
Abd Elaziz
 
MH,
Abd-Elwhab
 
MN.
Predicting length of stay in hospitals intensive care unit using general admission features
.
Ain Shams Eng J
.
2021
;
12
(
4
):
3691
3702
.
23.
Ashmore
 
R,
Calinescu
 
R,
Paterson
 
C.
Assuring the machine learning lifecycle: desiderata, methods, and challenges
.
ACM Comput Surv
.
2021
;
54
(
5
):
1
39
.
24.
Schaffter
 
T,
Buist
 
DSM,
Lee
 
CI,
et al.
Evaluation of combined artificial intelligence and radiologist assessment to interpret screening mammograms
.
JAMA Netw Open
.
2020
;
3
(
3
):
e200265
.
25.
American Medical Association
.
Augmented intelligence in health care
. https://www.ama-assn.org/system/files/2019-01/augmented-intelligence-policy-report.pdf. Accessed November 9, 2021.
26.
H-480.940
.
Augmented intelligence in health care
.
American Medical Association Web site
. https://policysearch.ama-assn.org/policyfinder/detail/augmented%20intelligence?uri=%2FAMADoc%2FHOD.xml-H-480.940.xml. Accessed November 9, 2021.
27.
da Silva
 
LM,
Pereira
 
EM,
Salles
 
PG,
et al.
Independent real-world application of a clinical-grade automated prostate cancer detection system
.
J Pathol
.
2021
;
254
(
2
):
147
158
.
28.
Capper
 
D,
Jones
 
DTW,
Sill
 
M,
et al.
DNA methylation-based classification of central nervous system tumours
.
Nature
.
2018
;
555
(
7697
):
469
474
.
29.
Aikins
 
JS.
Prototypes and production rules: an approach to knowledge representation for hypothesis formation
. In:
International Joint Conference on Artificial Intelligence
;
1979
. https://openreview.net/forum?id=rk44fBMuWr. Accessed April 18, 2022.
30.
Aikins
 
JS,
Kunz
 
JC,
Shortliffe
 
EH,
Fallat
 
RJ.
PUFF: an expert system for interpretation of pulmonary function data
.
Comput Biomed Res Int J
.
1983
;
16
(
3
):
199
208
.
31.
Aikins
 
JS.
Prototypical knowledge for expert systems: a retrospective analysis
.
Artif Intell
.
1993
;
59
(
1
):
207
211
.
32.
Perry
 
CA.
Knowledge bases in medicine: a review
.
Bull Med Libr Assoc
.
1990
;
78
(
3
):
271
282
.
33.
Evans
 
AJ,
Brown
 
RW,
Bui
 
MM,
et al.
Validating whole slide imaging systems for diagnostic purposes in pathology: guideline update from the College of American Pathologists in collaboration with the American Society for Clinical Pathology and the Association for Pathology Informatics
.
Arch Pathol Lab Med
.
2022
;
146
(
4
):
440
450
.
34.
Bui
 
MM,
Riben
 
MW,
Allison
 
KH,
et al.
quantitative image analysis of human epidermal growth factor receptor 2 immunohistochemistry for breast cancer: guideline from the College of American Pathologists
.
Arch Pathol Lab Med
.
2019
;
143
(
10
):
1180
1195
.
35.
Aziz
 
N,
Zhao
 
Q,
Bry
 
L,
et al.
College of American pathologists’ laboratory standards for next-generation sequencing clinical tests
.
Arch Pathol Lab Med
.
2015
;
139
(
4
):
481
493
.
36.
Pressman
 
NJ.
Markovian analysis of cervical cell images
.
J Histochem Cytochem
.
1976
;
24
(
1
):
138
144
.
37.
Levine
 
GM,
Brousseau
 
P,
O’Shaughnessy
 
DJ,
Losos
 
GJ.
Quantitative immunocytochemistry by digital image analysis: application to toxicologic pathology
.
Toxicol Pathol
.
1987
;
15
(
3
):
303
307
.
38.
Cornish
 
TC.
Clinical application of image analysis in pathology
.
Adv Anat Pathol
.
2020
;
27
(
4
):
227
235
.
39.
Gil
 
J,
Wu
 
HS.
Applications of image analysis to anatomic pathology: realities and promises
.
Cancer Invest
.
2003
;
21
(
6
):
950
959
.
40.
Webster
 
JD,
Dunstan
 
RW.
Whole-slide imaging and automated image analysis: considerations and opportunities in the practice of
pathology. Vet Pathol
.
2014
;
51
(
1
):
211
223
.
41.
LeCun
 
Y,
Bengio
 
Y,
Hinton
 
G.
Deep learning
.
Nature
.
2015
;
521
(
7553
):
436
444
.
42.
Explainable AI: the basics
.
Policy Briefing
.
The Royal Society Web site
. https://royalsociety.org/-/media/policy/projects/explainable-ai/AI-and-interpretability-policy-briefing.pdf. Accessed April 18, 2022.
43.
Tosun
 
AB,
Pullara
 
F,
Becich
 
MJ,
Taylor
 
DL,
Fine
 
JL,
Chennubhotla
 
SC.
Explainable AI (xAI) for anatomic pathology
.
Adv Anat Pathol
.
2020
;
27
(
4
):
241
250
.
44.
Chen
 
PHC,
Liu
 
Y,
Peng
 
L.
How to develop machine learning models for healthcare
.
Nat Mater
.
2019
;
18
(
5
):
410
414
.
45.
Harrison
 
JH,
Gilbertson
 
JR,
Hanna
 
MG,
et al.
Introduction to artificial intelligence and machine learning for pathology
.
Arch Pathol Lab Med
.
2021
;
145
(
10
):
1228
1254
.
46.
Clinical Laboratory Improvement Amendments of 1988 (CLIA) Title 42: The Public Health and Welfare. Subpart 2: Clinical Laboratories (42 U.S.C. 263a)
. https://www.govinfo.gov/content/pkg/USCODE-2011-title42/pdf/USCODE-2011-title42-chap6A-subchapII-partF-subpart2-sec263a.pdf. Accessed April 18, 2022.
47.
Standard: Establishment and verification of performance specifications
.
42 CFR § 493.1253
.
LII/Legal Information Institute Web site
. https://www.law.cornell.edu/cfr/text/42/493.1253. Accessed November 9, 2022.
48.
Pantanowitz
 
L,
Hartman
 
D,
Qi
 
Y,
et al.
Accuracy and efficiency of an artificial intelligence tool when counting breast mitoses
.
Diagn Pathol
.
2020
;
15
(
1
):
80
.
49.
Sandbank
 
J,
Nudelman
 
A,
Krasnitsky
 
I,
et al.
Implementation of an AI solution for breast cancer diagnosis and reporting in clinical practice. USCAP 2022 Abstracts: informatics (977–1017)
.
Mod Pathol
.
2022
;
35
(
Suppl 2
):
1163
1210
.
50.
Sandbank
 
J,
Sebag
 
G,
Arad
 
A,
et al.
Validation and clinical deployment of an AI-based solution for detection of gastric adenocarcinoma and Helicobacter pylori in gastric biopsies. USCAP 2022 Abstracts: gastrointestinal pathology (372-507)
.
Mod Pathol
.
2022
;
35
(
Suppl 2
):
493
639
.
51.
Ehteshami Bejnordi
 
B,
Veta
 
M,
Johannes van Diest
 
P,
et al.
diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer
.
JAMA
.
2017
;
318
(
22
):
2199
2210
.
52.
Perincheri
 
S,
Levi
 
AW,
Celli
 
R,
et al.
An independent assessment of an artificial intelligence system for prostate cancer detection shows strong diagnostic accuracy
.
Mod Pathol
.
2021
;
34
(
8
):
1588
1595
.
53.
Bulten
 
W,
Kartasalo
 
K,
Chen
 
PHC,
et al.
Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge
.
Nat Med
.
2022
;
28
(
1
):
154
163
.
54.
Steiner
 
DF,
MacDonald
 
R,
Liu
 
Y,
et al.
Impact of deep learning assistance on the histopathologic review of lymph nodes for metastatic breast cancer
.
Am J Surg Pathol
.
2018
;
42
(
12
):
1636
1646
.
55.
US Food and Drug Administration
.
Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based software as a medical device (SAMD)
.
US FDA Artificial Intelligence and Machine Learning Discussion Paper
.
US
https://www.fda.gov/media/122535/download. Accessed December 30, 2021.
56.
College of American Pathologists
.
Individualized quality control plan (IQCP) frequently asked questions
. https://documents.cap.org/documents/iqcp-faqs.pdf. Accessed April 18, 2022.
57.
US Food and Drug Administration Software as a medical device (SAMD): clinical evaluation—guidance for industry and Food and Drug Administration staff
. https://www.fda.gov/media/100714/download. Accessed November 10, 2022.
58.
American Society of Mechanical Engineers
.
Assessing Credibility of Computational Modeling Through Verification and Validation: Application to Medical Devices
.
New York
:
American Society of Mechanical Engineers
;
2018
.
59.
Meaning of intended uses
.
21 CFR 801.4
. https://www.ecfr.gov/current/title-21/chapter-I/subchapter-H/part-801/subpart-A/section-801.4. Accessed November 10, 2022.
60.
Wilkinson
 
MD,
Dumontier
 
M,
Aalbersberg
 
IjJ,
et al.
The FAIR guiding principles for scientific data management and stewardship
.
Sci Data
.
2016
;
3
(
1
):
160018
.
61.
Kush
 
RD,
Warzel
 
D,
Kush
 
MA,
et al.
FAIR data sharing: the roles of common data elements and harmonization
.
J Biomed Inform
.
2020
;
107
:
103421
.
62.
Barocas
 
S,
Hardt
 
M,
Narayanan
 
A.
Fairness and machine learning
. http://www.fairmlbook.org. Accessed April 19, 2022.
63.
Sjoding
 
MW,
Dickson
 
RP,
Iwashyna
 
TJ,
Gay
 
SE,
Valley
 
TS.
Racial bias in pulse oximetry measurement
.
N Engl J Med
.
2020
;
383
(
25
):
2477
2478
.
64.
Buolamwini
 
J,
Gebru
 
T.
Gender shades: intersectional accuracy disparities in commercial gender classification
. In: Proceedings of the 1st Conference on Fairness, Accountability and Transparency.
Proc. Machine Learning Res
.
2018
;
81
:
77
91
. https://proceedings.mlr.press/v81/buolamwini18a.html. Accessed April 19, 2022.
65.
Obermeyer
 
Z,
Powers
 
B,
Vogeli
 
C,
Mullainathan
 
S.
Dissecting racial bias in an algorithm used to manage the health of populations
.
Science
.
2019
;
366
(
6464
):
447
453
.
66.
Howard
 
FM,
Dolezal
 
J,
Kochanny
 
S,
et al.
The impact of site-specific digital histology signatures on deep learning model accuracy and bias
.
Nat Commun
.
2021
;
12
(
1
):
4423
.
67.
Leo
 
P,
Lee
 
G,
Shih
 
NNC,
Elliott
 
R,
Feldman
 
MD,
Madabhushi
 
A.
Evaluating stability of histomorphometric features across scanner and staining variations: prostate cancer diagnosis from whole slide images
.
J Med Imaging
.
2016
;
3
(
4
):
047502
.
68.
Panch
 
T,
Mattie
 
H,
Atun
 
R.
Artificial intelligence and algorithmic bias: implications for health systems
.
J Glob Health
.
2019
;
9
(
2
):
020318
.
69.
Jobin
 
A,
Ienca
 
M,
Vayena
 
E.
The global landscape of AI ethics guidelines
.
Nat Machine Intell
.
2019
;
1
(
9
):
389
399
.
70.
Jackson
 
BR,
Ye
 
Y,
Crawford
 
JM,
et al.
The ethics of artificial intelligence in pathology and laboratory medicine: principles and practice
.
Acad Pathol
.
2021
;
8
:
2374289521990784
.
71.
Howerton
 
D,
Anderson
 
N,
Bosse
 
D,
Granade
 
S,
Westbrook
 
G.
Good laboratory practices for waived testing sites: survey findings from testing sites holding a certificate of waiver under the clinical laboratory improvement amendments of 1988 and recommendations for promoting quality testing
.
MMWR Recomm Rep
.
2005
;
54
(
RR-13
):
1
25
;
quiz CE1–4
.
72.
Ezzelle
 
J,
Rodriguez-Chavez
 
IR,
Darden
 
JM,
et al.
Guidelines on good clinical laboratory practice
.
J Pharm Biomed Anal
.
2008
;
46
(
1
):
18
29
.
73.
Tworek
 
JA,
Henry
 
MR,
Blond
 
B,
Jones
 
BA.
College of American Pathologists Gynecologic Cytopathology Quality Consensus Conference on good laboratory practices in gynecologic cytology: background, rationale, and organization
.
Arch Pathol Lab Med
.
2013
;
137
(
2
):
158
163
.
74.
Gutman
 
DA,
Cobb
 
J,
Somanna
 
D,
et al.
Cancer digital slide archive: an informatics resource to support integrated in silico analysis of TCGA pathology data
.
J Am Med Inform Assoc
.
2013
;
20
(
6
):
1091
1098
.
75.
Fedorov
 
A,
Longabaugh
 
WJR,
Pot
 
D,
et al.
NCI imaging data commons
.
Cancer Res
.
2021
;
81
(
16
):
4188
4193
.
76.
Choi
 
JH,
Hong
 
SE,
Woo
 
HG.
Pan-cancer analysis of systematic batch effects on somatic sequence variations
.
BMC Bioinformatics
.
2017
;
18
(
1
):
211
.
77.
Goh
 
WWB,
Wang
 
W,
Wong
 
L.
Why batch effects matter in omics data, and how to avoid them
.
Trends Biotechnol
.
2017
;
35
(
6
):
498
507
.
78.
Kothari
 
S,
Phan
 
JH,
Stokes
 
TH,
Osunkoya
 
AO,
Young
 
AN,
Wang
 
MD.
Removing batch effects from histopathological images for enhanced cancer diagnosis
.
IEEE J Biomed Health Inform
.
2014
;
18
(
3
):
765
772
.
79.
Tom
 
JA,
Reeder
 
J,
Forrest
 
WF,
et al.
Identifying and mitigating batch effects in whole genome sequencing data
.
BMC Bioinformatics
.
2017
;
18
(
1
):
351
.
80.
Aeffner
 
F,
Wilson
 
K,
Martin
 
NT,
et al.
the gold standard paradox in digital image analysis: manual versus automated scoring as ground truth
.
Arch Pathol Lab Med
.
2017
;
141
(
9
):
1267
1275
.
81.
Stålhammar
 
G,
Fuentes Martinez
 
N,
Lippert
 
M,
et al.
Digital image analysis outperforms manual biomarker assessment in breast cancer
.
Mod Pathol
.
2016
;
29
(
4
):
318
329
.
82.
Nielsen
 
TO,
Leung
 
SCY,
Rimm
 
DL,
et al.
Assessment of Ki67 in breast cancer: updated recommendations from the International Ki67 in Breast Cancer Working Group
.
J Natl Cancer Inst
.
2021
;
113
(
7
):
808
819
.
83.
Dolan
 
M,
Snover
 
D.
Comparison of immunohistochemical and fluorescence in situ hybridization assessment of HER-2 status in routine practice
.
Am J Clin Pathol
.
2005
;
123
(
5
):
766
770
.
84.
Singer
 
M,
Deutschman
 
CS,
Seymour
 
CW,
et al.
The third international consensus definitions for sepsis and septic shock (Sepsis-3)
.
JAMA
.
2016
;
315
(
8
):
801
810
.
85.
American College of Chest Physicians/Society of Critical Care Medicine Consensus Conference: definitions for sepsis and organ failure and guidelines for the use of innovative therapies in sepsis
.
Crit Care Med
.
1992
;
20
(
6
):
864
874
.
86.
Goh
 
KH,
Wang
 
L,
Yeow
 
AYK,
et al.
Artificial intelligence in sepsis early prediction and diagnosis using unstructured data in healthcare
.
Nat Commun
.
2021
;
12
(
1
):
711
.
87.
Elmore
 
JG,
Longton
 
GM,
Carney
 
PA,
et al.
Diagnostic concordance among pathologists interpreting breast biopsy specimens
.
JAMA
.
2015
;
313
(
11
):
1122
1132
.
88.
Viswanathan
 
K,
Patel
 
A,
Abdelsayed
 
M,
et al.
Interobserver variability between cytopathologists and cytotechnologists upon application and characterization of the indeterminate category in the Milan system for reporting salivary gland cytopathology
.
Cancer Cytopathol
.
2020
;
128
(
11
):
828
839
.
89.
Tummers
 
P,
Gerestein
 
K,
Mens
 
JW,
Verstraelen
 
H,
van Doorn
 
H.
Interobserver variability of the International Federation of Gynecology and Obstetrics staging in cervical cancer
.
Int J Gynecol Cancer
.
2013
;
23
(
5
):
890
894
.
90.
Thomas
 
S,
Hussein
 
Y,
Bandyopadhyay
 
S,
et al.
Interobserver variability in the diagnosis of uterine high-grade endometrioid carcinoma
.
Arch Pathol Lab Med
.
2016
;
140
(
8
):
836
843
.
91.
Pentenero
 
M,
Todaro
 
D,
Marino
 
R,
Gandolfo
 
S.
Interobserver and intraobserver variability affecting the assessment of loss of autofluorescence of oral mucosal lesions
.
Photodiagn Photodyn Ther
.
2019
;
28
:
338
342
.
92.
Ortonne
 
N,
Carroll
 
SL,
Rodriguez
 
FJ,
et al.
Assessing interobserver variability and accuracy in the histological diagnosis and classification of cutaneous neurofibromass
.
Neuro-Oncol Adv
.
2020
;
2
(
Suppl 1
):
i117
i123
.
93.
Kwak
 
HA,
Liu
 
X,
Allende
 
DS,
Pai
 
RK,
Hart
 
J,
Xiao
 
SY.
Interobserver variability in intraductal papillary mucinous neoplasm subtypes and application of their mucin immunoprofiles
.
Mod Pathol
.
2016
;
29
(
9
):
977
984
.
94.
Klaver
 
CEL,
Bulkmans
 
N,
Drillenburg
 
P,
et al.
Interobserver, intraobserver, and interlaboratory variability in reporting pT4a colon cancer
.
Virchows Arch Int J Pathol
.
2020
;
476
(
2
):
219
230
.
95.
Kang
 
HJ,
Kwon
 
SY,
Kim
 
A,
et al.
A multicenter study of interobserver variability in pathologic diagnosis of papillary breast lesions on core needle biopsy with WHO classification
.
J Pathol Transl Med
.
2021
;
55
(
6
):
380
387
.
96.
Horvath
 
B,
Allende
 
D,
Xie
 
H,
et al.
Interobserver variability in scoring liver biopsies with a diagnosis of alcoholic hepatitis
.
Alcohol Clin Exp Res
.
2017
;
41
(
9
):
1568
1573
.
97.
Burchardt
 
M,
Engers
 
R,
Müller
 
M,
et al.
Interobserver reproducibility of Gleason grading: evaluation using prostate cancer tissue microarrays
.
J Cancer Res Clin Oncol
.
2008
;
134
(
10
):
1071
1078
.
98.
Bektas
 
S,
Bahadir
 
B,
Kandemir
 
NO,
Barut
 
F,
Gul
 
AE,
Ozdamar
 
SO.
Intraobserver and interobserver variability of Fuhrman and modified Fuhrman grading systems for conventional renal cell carcinoma
.
Kaohsiung J Med Sci
.
2009
;
25
(
11
):
596
600
.
99.
Allard
 
FD,
Goldsmith
 
JD,
Ayata
 
G,
et al.
Intraobserver and interobserver variability in the assessment of dysplasia in ampullary mucosal biopsies
.
Am J Surg Pathol
.
2018
;
42
(
8
):
1095
1100
.
100.
Rodriguez
 
FJ,
Giannini
 
C.
Oligodendroglial tumors: diagnostic and molecular pathology
.
Semin Diagn Pathol
.
2010
;
27
(
2
):
136
145
.
101.
Samorodnitsky
 
E,
Datta
 
J,
Jewell
 
BM,
et al.
Comparison of custom capture for targeted next-generation DNA sequencing
.
J Mol Diagn
.
2015
;
17
(
1
):
64
75
.
102.
Campanella
 
G,
Hanna
 
MG,
Geneslaw
 
L,
et al.
Clinical-grade computational pathology using weakly supervised deep learning on whole slide images
.
Nat Med
.
2019
;
25
(
8
):
1301
1309
.
103.
Shipe
 
ME,
Deppen
 
SA,
Farjah
 
F,
Grogan
 
EL.
Developing prediction models for clinical use using logistic regression: an overview
.
J Thorac Dis
.
2019
;
11
(
Suppl 4
):
S574
S584
.
104.
Park
 
SH,
Han
 
K.
Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction
.
Radiology
.
2018
;
286
(
3
):
800
809
.
105.
Moons
 
KG,
Kengne
 
AP,
Grobbee
 
DE,
et al.
Risk prediction models: II. External validation, model updating, and impact assessment
.
Heart
.
2012
May;
98
(
9
):
691
698
.
106.
Debray
 
TPA,
Vergouwe
 
Y,
Koffijberg
 
H,
Nieboer
 
D,
Steyerberg
 
EW,
Moons
 
KGM.
A new framework to enhance the interpretation of external validation studies of clinical prediction models
.
J Clin Epidemiol
.
2015
;
68
(
3
):
279
289
.
107.
Wu
 
L,
Zhang
 
J,
Zhou
 
W,
et al.
Randomised controlled trial of WISENSE, a real-time quality improving system for monitoring blind spots during esophagogastroduodenoscopy
.
Gut
.
2019
;
68
(
12
):
2161
2169
.
108.
Wang
 
P,
Liu
 
X,
Berzin
 
TM,
et al.
Effect of a deep-learning computer-aided detection system on adenoma detection during colonoscopy (CADe-DB trial): a double-blind randomised study
.
Lancet Gastroenterol Hepatol
.
2020
;
5
(
4
):
343
351
.
109.
Repici
 
A,
Badalamenti
 
M,
Maselli
 
R,
et al.
Efficacy of real-time computer-aided detection of colorectal neoplasia in a randomized trial
.
Gastroenterology
.
2020
;
159
(
2
):
512
520.e7
.
110.
Wijnberge
 
M,
Geerts
 
BF,
Hol
 
L,
et al.
Effect of a machine learning-derived early warning system for intraoperative hypotension vs standard care on depth and duration of intraoperative hypotension during elective noncardiac surgery: the HYPE randomized clinical trial
.
JAMA
.
2020
;
323
(
11
):
1052
1060
.
111.
Wang
 
P,
Berzin
 
TM,
Glissen Brown
 
JR,
et al.
Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study
.
Gut
.
2019
;
68
(
10
):
1813
1819
.
112.
INFANT Collaborative Group
.
Computerised interpretation of fetal heart rate during labour (INFANT): a randomised controlled trial
.
Lancet
.
2017
;
389
(
10080
):
1719
1729
.
113.
Clinical Laboratory Improvement Amendments (CLIA)
.
CLIA verification of performance specifications
. https://www.cms.gov/regulations-and-guidance/legislation/clia/downloads/6064bk.pdf. Accessed April 19, 2022.
114.
College of American Pathologists
.
CAP all common checklist. Test method validation and verification
. https://appsuite.cap.org/appsuite/learning/LAP/FFoC/ValidationVerificationStudies/story_content/external_files/checklistrequirements.pdf. Accessed April 19, 2022.
115.
Van Calster
 
B,
McLernon
 
DJ,
van Smeden
 
M,
et al.
Calibration: the Achilles heel of predictive analytics
.
BMC Med
.
2019
;
17
(
1
):
230
.
116.
Van Hoorde
 
K,
Van Huffel
 
S,
Timmerman
 
D,
Bourne
 
T,
Van Calster
 
B.
A spline-based tool to assess and visualize the calibration of multiclass risk predictions
.
J Biomed Inform
.
2015
;
54
:
283
293
.
117.
van der Ploeg
 
T,
Nieboer
 
D,
Steyerberg
 
EW.
Modern modeling techniques had limited external validity in predicting mortality from traumatic brain injury
.
J Clin Epidemiol
.
2016
;
78
:
83
89
.
118.
Pantanowitz
 
L,
Quiroga-Garza
 
GM,
Bien
 
L,
et al.
An artificial intelligence algorithm for prostate cancer diagnosis in whole slide images of core needle biopsies: a blinded clinical validation and deployment study
.
Lancet Digit Health
.
2020
;
2
(
8
):
e407
e416
.
119.
Davis
 
SE,
Greevy
 
RA,
Fonnesbeck
 
C,
Lasko
 
TA,
Walsh
 
CG,
Matheny
 
ME.
A nonparametric updating method to correct clinical prediction model drift
.
J Am Med Inform Assoc
.
2019
;
26
(
12
):
1448
1457
.
120.
Epstein
 
JI,
Zelefsky
 
MJ,
Sjoberg
 
DD,
et al.
A contemporary prostate cancer grading system: a validated alternative to the Gleason score
.
Eur Urol
.
2016
;
69
(
3
):
428
435
.
121.
Hattab
 
EM,
Koch
 
MO,
Eble
 
JN,
Lin
 
H,
Cheng
 
L.
Tertiary Gleason pattern 5 is a powerful predictor of biochemical relapse in patients with Gleason score 7 prostatic adenocarcinoma
.
J Urol
.
2006
;
175
(
5
):
1695
1699
; discussion 1699.
122.
García
 
V,
Mollineda
 
RA,
Sánchez
 
JS.
Index of balanced accuracy: a performance measure for skewed class distributions. In:
Araujo
 
H,
Mendonça
 
AM,
Pinho
 
AJ,
Torres
 
MI
, eds.
Pattern Recognition and Image Analysis. Lecture Notes in Computer Science
.
Berlin, Heidelberg, Germany
:
Springer
;
2009
:
441
448
.
123.
Delgado
 
R,
Tibau
 
XA.
Why Cohen’s kappa should be avoided as performance measure in classification
.
PloS One
.
2019
;
14
(
9
):
e0222916
.
124.
Ben-David
 
A.
Comparison of classification accuracy using Cohen’s weighted kappa
.
Expert Syst Appl
.
2008
;
34
(
2
):
825
832
.
125.
Chicco
 
D,
Jurman
 
G.
The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation
.
BMC Genomics
.
2020
;
21
(
1
):
6
.
126.
Gorodkin
 
J.
Comparing two K-category assignments by a K-category correlation coefficient
.
Comput Biol Chem
.
2004
;
28
(
5
):
367
374
.
127.
Baldi
 
P,
Brunak
 
S,
Chauvin
 
Y,
Andersen
 
CA,
Nielsen
 
H.
Assessing the accuracy of prediction algorithms for classification: an overview
.
Bioinformatics
.
2000
;
16
(
5
):
412
424
.
128.
Moskowitz
 
CS.
Using free-response receiver operating characteristic curves to assess the accuracy of machine diagnosis of cancer
.
JAMA
.
2017
;
318
(
22
):
2250
2251
.
129.
Park
 
SH,
Choi
 
J,
Byeon
 
JS.
Key principles of clinical validation, device approval, and insurance coverage decisions of artificial intelligence
.
Korean J Radiol
.
2021
;
22
(
3
):
442
453
.
130.
Vu
 
QD,
Graham
 
S,
Kurc
 
T,
et al.
Methods for segmentation and classification of digital microscopy tissue images
.
Front Bioeng Biotechnol
.
2019
;
7
:
53
.
131.
D’Agostino
 
R,
Nam
 
B.
Evaluation of the performance of survival analysis models: discrimination and calibration measures
.
Amsterdam, The Netherlands: Elsevier
;
2003
:
1
25
.
132.
Hosmer
 
DW,
Lemeshow
 
S.
Assessing the fit of the model. In:
Hosmer
 
DW,
Lemeshow
 
S
, eds.
Applied Logistic Regression
.
John Wiley & Sons, Ltd
;
2000
:
143
202
.
133.
US Department of Health and Human Services, Food and Drug Administration, Center for Devices and Radiological Health
.
Statistical guidance on reporting results from studies evaluating diagnostic tests—guidance for industry and FDA staff
.
US Food and Drug Administration Web site
. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/statistical-guidance-reporting-results-studies-evaluating-diagnostic-tests-guidance-industry-and-fda. Accessed September 5, 2022.
134.
Morgenthaler
 
S.
Exploratory data analysis
.
WIREs Comput Stat
.
2009
;
1
(
1
):
33
44
.
135.
Ben-Gal
 
I.
Outlier detection. In:
Maimon
 
O,
Rokach
 
L
, eds.
Data Mining and Knowledge Discovery Handbook
.
Tel Aviv, Israel: Springer US
;
2005
:
131
146
.
136.
Bland
 
JM,
Altman
 
DG.
Statistical methods for assessing agreement between two methods of clinical measurement
.
Lancet
.
1986
;
1
(
8476
):
307
310
.
137.
Bland
 
JM,
Altman
 
DR.
Statistical methods for assessing agreement between measurements
.
Lancet
.
1986
;
1
(
8476
):
307
310
.
138.
Petersen
 
PH,
Stöckl
 
D,
Blaabjerg
 
O,
et al.
Graphical interpretation of analytical data from comparison of a field method with reference method by use of difference plots
.
Clin Chem
.
1997
;
43
(
11
):
2039
2046
.
139.
Hollis
 
S.
Analysis of method comparison studies
.
Ann Clin Biochem
.
1996
;
33
(
1
):
1
4
.
140.
Stöckl
 
D.
Beyond the myths of difference plots
.
Ann Clin Biochem
.
1996
;
33
(
Pt 6
):
575
577
.
141.
Cornbleet
 
PJ,
Gochman
 
N.
Incorrect least-squares regression coefficients in method-comparison analysis
.
Clin Chem
.
1979
;
25
(
3
):
432
438
.
142.
Bureau International des Poids et Mesures (BIPM)
.
International Vocabulary of Metrology – Basic and General Concepts and Associated Terms (VIM)
. 3rd ed. https://www.bipm.org/documents/20126/2071204/JCGM_200_2012.pdf/f0e1ad45-d337-bbeb-53a6-15fe649d0ff1. Accessed June 1, 2022.
143.
McEnroe
 
RJ,
Durham
 
AP,
Goldford
 
MD,
et al.
Evaluation of Precision of Quantitative Measurement Procedures; Approved Guideline
. 3rd ed.
CLSI document EP05-A3
.
Wayne, PA
:
Clinical and Laboratory Standards Institute
;
2014
. https://clsi.org/standards/products/method-evaluation/documents/ep05/. Accessed June 1, 2022.
144.
International Organization for Standardization
.
ISO 16140-1:2016 - Microbiology of the food chain - Method validation - Part 1: Vocabulary
.
Geneva, Switzerland
:
International Organization for Standardization
;
2016
.
145.
Carey
 
RN,
Durham
 
AP,
Hauck
 
WW,
et al.
User Verification of Precision and Dstimation of Bias; Approved Guideline
. 3rd ed.
CLSI document EP15-A3
.
Wayne, PA
:
Clinical and Laboratory Standards Institute
;
2014
.
146.
Berte
 
LM.
Process Management
.
Wayne, PA
:
Clinical and Laboratory Standards Institute
.
2015
.
147.
Health Center for Device and Radiologic Health (CDRH)
.
Marketing submission recommendations for a predetermined change control plan for artificial intelligence/machine learning (AI/ML)-enabled device software functions
.
US Food and Drug Administration Web site
. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/marketing-submission-recommendations-predetermined-change-control-plan-artificial. Accessed May 13, 2023.
148.
Jenkins
 
DA,
Sperrin
 
M,
Martin
 
GP,
Peek
 
N.
Dynamic models to predict health outcomes: current status and methodological challenges
.
Diagn Progn Res
.
2018
;
2
(
1
):
23
.
149.
Toll
 
DB,
Janssen
 
KJM,
Vergouwe
 
Y,
Moons
 
KGM.
Validation, updating and impact of clinical prediction rules: a review
.
J Clin Epidemiol
.
2008
;
61
(
11
):
1085
1094
.
150.
Kappen
 
TH,
Vergouwe
 
Y,
van Klei
 
WA,
van Wolfswinkel
 
L,
Kalkman
 
CJ,
Moons
 
KGM.
Adaptation of clinical prediction models for application in local settings
.
Med Decis Making
.
2012
;
32
(
3
):
E1
E10
.
151.
Davis
 
SE,
Lasko
 
TA,
Chen
 
G,
Siew
 
ED,
Matheny
 
ME.
Calibration drift in regression and machine learning models for acute kidney injury
.
J Am Med Inform Assoc
.
2017
;
24
(
6
):
1052
1061
.
152.
Diamond
 
GA.
What price perfection?: calibration and discrimination of clinical prediction models
.
J Clin Epidemiol
.
1992
;
45
(
1
):
85
89
.
153.
Davis
 
SE,
Lasko
 
TA,
Chen
 
G,
Matheny
 
ME.
Calibration drift among regression and machine learning models for hospital mortality
.
AMIA Annu Symp Proc
.
2018
;
2017
:
625
634
.
154.
Sinard
 
JH.
An analysis of the effect of the COVID-19 pandemic on case volumes in an academic subspecialty-based anatomic pathology practice
.
Acad Pathol
.
2020
;
7
:
2374289520959788
.
155.
Mann
 
DM,
Chen
 
J,
Chunara
 
R,
Testa
 
PA,
Nov
 
O.
COVID-19 transforms health care through telemedicine: evidence from the field
.
J Am Med Inform Assoc
.
2020
;
27
(
7
):
1132
1135
.
156.
Calabrese
 
F,
Pezzuto
 
F,
Fortarezza
 
F,
et al.
Pulmonary pathology and COVID-19: lessons from autopsy. The experience of European pulmonary pathologists
.
Virchows Arch
.
2020
;
477
(
3
):
359
372
.
157.
Di Toro
 
F,
Gjoka
 
M,
Di Lorenzo
 
G,
et al.
Impact of COVID-19 on maternal and neonatal outcomes: a systematic review and meta-analysis
.
Clin Microbiol Infect
.
2021
;
27
(
1
):
36
46
.
158.
Hanna
 
MG,
Reuter
 
VE,
Ardon
 
O,
et al.
Validation of a digital pathology system including remote review during the COVID-19 pandemic
.
Mod Pathol
.
2020
;
33
(
11
):
2115
2127
.
159.
Vigliar
 
E,
Cepurnaite
 
R,
Alcaraz-Mateos
 
E,
et al.
Global impact of the COVID-19 pandemic on cytopathology practice: results from an international survey of laboratories in 23 countries
.
Cancer Cytopathol
.
2020
;
128
(
12
):
885
894
.
160.
Tang
 
YW,
Schmitz
 
JE,
Persing
 
DH,
Stratton
 
CW.
Laboratory diagnosis of COVID-19: current issues and challenges
.
J Clin Microbiol
.
2020
;
58
(
6
):
e00512-20
.
161.
Davis
 
SE,
Greevy
 
RA,
Lasko
 
TA,
Walsh
 
CG,
Matheny
 
ME.
Detection of calibration drift in clinical prediction models to inform model updating
.
J Biomed Inform
.
2020
;
112
:
103611
.
162.
Finlayson
 
SG,
Bowers
 
JD,
Ito
 
J,
Zittrain
 
JL,
Beam
 
AL,
Kohane
 
IS.
Adversarial attacks on medical machine learning
.
Science
.
2019
;
363
(
6433
):
1287
1289
.
163.
Allyn
 
J,
Allou
 
N,
Vidal
 
C,
Renou
 
A,
Ferdynus
 
C.
Adversarial attack on deep learning-based dermatoscopic image recognition systems: risk of misdiagnosis due to undetectable image perturbations
.
Medicine (Baltimore)
.
2020
;
99
(
50
):
e23568
.
164.
Laleh
 
NG,
Truhn
 
D,
Veldhuizen
 
GP,
et al.
Adversarial attacks and adversarial robustness in computational pathology
.
Nat Commun
.
2022
;
13
(
1
):
5711
.
165.
Bortsova
 
G,
González-Gonzalo
 
C,
Wetstein
 
SC,
et al.
Adversarial attack vulnerability of medical image analysis systems: unexplored factors
.
Med Image Anal
.
2021
;
73
:
102141
.
166.
Raciti
 
P,
Sue
 
J,
Ceballos
 
R,
et al.
Novel artificial intelligence system increases the detection of prostate cancer in whole slide images of core needle biopsies
.
Mod Pathol
.
2020
;
33
(
10
):
2058
2066
.
167.
Nishikawa
 
RM,
Bae
 
KT.
Importance of better human-computer interaction in the era of deep learning: mammography computer-aided diagnosis as a use case
.
J Am Coll Radiol
.
2018
;
15
(
1 Pt A
):
49
52
.
168.
Burgoon
 
JK,
Bonito
 
JA,
Bengtsson
 
B,
Cederberg
 
C,
Lundeberg
 
M,
Allspach
 
L.
Interactivity in human-computer interaction: a study of credibility, understanding, and influence
.
Comput Hum Behav
.
2000
;
16
(
6
):
553
574
.
169.
Jensen
 
M,
Meservy
 
T,
Burgoon
 
J,
Nunamaker
 
J.
Automatic, multimodal evaluation of human interaction
.
Group Decis Negot
.
2010
;
19
:
367
389
.
170.
Lee
 
EJ.
Factors That Enhance Consumer Trust in Human-Computer Interaction: An Examination of Interface Factors and Moderating Influences [dissertation]
.
Knoxville
:
University of Tennessee
. https://trace.tennessee.edu/utk_graddiss/2148. Accessed April 19, 2022.
171.
Szalma
 
JL,
Hancock
 
PA.
Noise effects on human performance: a meta-analytic synthesis
.
Psychol Bull
.
2011
;
137
(
4
):
682
707
.
172.
Rudin
 
C.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
.
Nat Machine Intell
.
2019
;
1
(
5
):
206
215
.
173.
Banerjee
 
I,
Bhimireddy
 
AR,
Burns
 
JL,
et al.
Reading race: AI recognises patient’s racial identity in medical images
.
ArXiv
.
2107.10356 [Cs.Eess]
. [
Preprint. Posted online July 21
,
2021
].
doi:arxiv.org/abs/2107.10356
174.
Schömig-Markiefka
 
B,
Pryalukhin
 
A,
Hulla
 
W,
et al.
Quality control stress test for deep learning-based diagnostic model in digital pathology
.
Mod Pathol
.
2021
;
34
(
12
):
2098
2108
.
175.
Shapley
 
LS.
17. A value for n-person games. In:
Kuhn
 
HW,
Tucker
 
AW
, eds.
Contributions to the Theory of Games (AM-28)
, Volume
II
.
Princeton, NJ
:
Princeton University Press
;
1953
:
307
318
.
176.
Lundberg
 
S,
Lee
 
SI.
A unified approach to interpreting model predictions
.
arXiv
. [
Preprint. Posted online Nov 25
,
2017
.].
177.
Molnar
 
C.
Interpretable Machine Learning
. https://christophm.github.io/interpretable-ml-book/. Accessed May 16, 2022.
178.
Kim
 
B,
Wattenberg
 
M,
Gilmer
 
J,
et al.
Interpretability beyond feature attribution: quantitative testing with concept activation vectors (TCAV)
.
arXiv
. [
Preprint. Posted online June 7
,
2018
.].
179.
Ribeiro
 
MT,
Singh
 
S,
Guestrin
 
C.
“Why should I trust you?”: explaining the predictions of any classifier
.
arXiv
. [
Preprint. Posted online February 6
,
2016
.].
180.
Evans
 
T,
Retzlaff
 
CO,
Geiβler
 
C,
et al.
The explainability paradox: challenges for xAI in digital pathology
.
Future Gener Comput Syst
.
2022
;
133
:
281
296
.
181.
Linardatos
 
P,
Papastefanopoulos
 
V,
Kotsiantis
 
S.
Explainable AI: a review of machine learning interpretability methods
.
Entropy Basel Switz
.
2020
;
23
(
1
):
E18
.
182.
Sears
 
A,
Jacko
 
JA
, eds.
Human-Computer Interaction Fundamentals
.
Boca Raton, FL
:
CRC Press
;
2010
.
183.
Fitzgibbons
 
PL,
Bradley
 
LA,
Fatheree
 
LA,
et al.
Principles of analytic validation of immunohistochemical assays: Guideline from the College of American Pathologists Pathology and Laboratory Quality Center
.
Arch Pathol Lab Med
.
2014
;
138
(
11
):
1432
1443
.
184.
College of American Pathologists
.
Laboratory general checklist
. https://medicalcourier.com/wp-content/uploads/CAP-Lab-Master-Checklist-2017.pdf.
Accessed May 12, 2022
.
185.
Centers for Medicare and Medicaid Service
.
What do I need to do to assess personnel competency
? https://www.cms.gov/regulations-and-guidance/legislation/clia/downloads/clia_compbrochure_508.pdf.
Accessed May 12, 2022
.
186.
Centers for Disease Control and Prevention
.
Competency guidelines for public health laboratory professionals
. https://www.cdc.gov/mmwr/pdf/other/su6401.pdf.
Accessed May 12, 2022
.

Author notes

All authors are members from the Machine Learning Workgroup, Informatics Committee, Digital and Computational Pathology Committee, and Council on Informatics and Pathology Innovation of the College of American Pathologists, except Souers, who is an employee of the College of American Pathologists. Hanna is a consultant for PaigeAI, PathPresenter, and VolastraTX. Krishnamurthy is a consultant on the breast pathology faculty advisory board for Daiichi Sankyo, Inc. and AstraZeneca and serves as a scientific advisory board member for AstraZeneca. Krishnamurthy received an investigator-initiated sponsored research award from PathomIQ Research, sponsored research funding from IBEX Research, and an investigator-initiated sponsored research award from Caliber Inc. Raciti has stock options at Paige (<1%) and has employment and stock compensation at Janssen. Mays’ affiliation with the MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE’s concurrence with, or support for, the positions, opinions, or viewpoints expressed by the author. The other authors have no relevant financial interest in the products or companies described in this article. Hanna is now located at the Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania.