Automated prostate cancer detection using machine learning technology has led to speculation that pathologists will soon be replaced by algorithms. This review covers the development of machine learning algorithms and their reported effectiveness specific to prostate cancer detection and Gleason grading.
To examine current algorithms regarding their accuracy and classification abilities. We provide a general explanation of the technology and how it is being used in clinical practice. The challenges to the application of machine learning algorithms in clinical practice are also discussed.
The literature for this review was identified and collected using a systematic search. Criteria were established prior to the sorting process to effectively direct the selection of studies. A 4-point system was implemented to rank the papers according to their relevancy. For papers accepted as relevant to our metrics, all cited and citing studies were also reviewed. Studies were then categorized based on whether they implemented binary or multi-class classification methods. Data were extracted from papers that contained accuracy, area under the curve (AUC), or κ values in the context of prostate cancer detection. The results were visually summarized to present accuracy trends between classification abilities.
It is more difficult to achieve high accuracy metrics for multiclassification tasks than for binary tasks. The clinical implementation of an algorithm that can assign a Gleason grade to clinical whole slide images (WSIs) remains elusive. Machine learning technology is currently not able to replace pathologists but can serve as an important safeguard against misdiagnosis.
The adoption of whole slide image (WSI) scanners in clinical practice was accelerated by US Food and Drug Administration approval in 2017, which allowed primary pathologic diagnoses to be made on scanned images. Images in the digital domain allow the application of pathology artificial intelligence (AI), including clinical decision support with algorithms performing specific diagnoses.1,2 These algorithms, if trained properly, could go beyond the ability of human observation to detect and quantify features that are not recognizable by human perception.1,3,4
Prostate cancer is an ideal target for AI diagnostic support. Criteria for prostate adenocarcinoma diagnosis from histologic slides are well defined, and a corresponding grading system known as the Gleason system provides prognostic information and guidelines for treatment.5 Prostate cancer is the second most frequent cancer and the fifth leading cause of death in men. Pathologists diagnose prostate cancer by examining 6 to 12 needle core biopsies of the prostate. The already-problematic shortage of pathologists is expected to worsen, driving the need for the implementation of AI-based screening tools.6,7 The Gleason grade system, which denotes the advancement of cancer in the tissue, suffers from diagnostic variation among pathologists. Pathology AI is intended to reduce diagnostic variation and increase the productivity of pathologists, leading to improved patient treatment.8
This review summarizes recent literature regarding the use of AI to diagnose prostate adenocarcinoma. Its purpose is to provide an accessible explanation of the current machine learning methods and an overview of the current state of classification ability for working surgical pathologists. It begins by outlining the systematic review process that was followed to produce this literature review. A general machine learning algorithm based on the algorithms uncovered in the literature will then be described, following which the methods used to compare these algorithms will be presented and justified. Finally, a summary of the results found, both qualitative and quantitative, will be given and future implications will be considered.
SYSTEMATIC REVIEW PROCESS
The research process used to conduct this literature review is summarized in Figure 1. The primary questions guiding this review were: What is the current state of AI-driven automatic Gleason grading of prostate WSIs? Are any machine learning algorithms currently being implemented in clinical practice, and if not, what are the most pressing and current functional challenges to the implementation of this technology? What are the common grading methods used by machine learning technology?
Process of the systematic literature review. The sorting and screening sections refer to how papers were selected based on their relevancy. The ranking section describes the in-depth review that was performed to guide organization and writing.
Process of the systematic literature review. The sorting and screening sections refer to how papers were selected based on their relevancy. The ranking section describes the in-depth review that was performed to guide organization and writing.
Search Terms and Criteria
The search expression was broadly descriptive of AI, as there are many terms commonly used to describe machine learning. Although broad, the search expression required that a given paper include the words “Gleason” and “prostate” The search criteria expression was “Gleason” AND (“cancer” OR “adenocarcinoma”) AND “prostate” AND (“machine learning” OR “artificial intelligence” OR “neural network” OR “deep learning”) AND (“H&E” OR “hematoxylin and eosin”) AND (“WSI” OR “whole slide”).
Point System Filter
The database selected was Google Scholar; only studies published after 2017 were considered. The initial search was conducted on March 31, 2022; it yielded a total of 474 results. Relevant papers that were published while this literature review was being written were noted and included.
For the filtering process, we created a system based on 4 key factors. The study received 1 point for each factor satisfactorily checked. These factors were established to maximize uniformity and minimize human subjectivity in the sorting and classification processes. The 4 factors were: Does the paper use a machine learning model aimed at detecting prostate cancer? Does the paper contain a quantifiable measurement of their machine learning model's performance? Does the paper compare its performance with that of actual pathologists, or does it consider the model's performance in a real-world clinical setting? Does the paper present a high-impact feature, such as a considerable number (>20) of others who have cited it, or any other novel feature that may make it useful in comparison?
Of the 474 papers we initially reviewed, 40 papers were awarded a score of 4; 78 were awarded a 3; 145 were awarded a 2; and 211 were awarded a 1 or 0.
Forwards/Backwards Search
From this point, we conducted a forward and backward search of the literature that received at least 3 points. For this, all papers cited by a relevant paper and those citing it were manually investigated and considered as potentially relevant. Papers that were added during the forward and backward search were also subjected to the point-system filter; accordingly, 24 papers were added to the existing pool of 118 papers scoring 3 or higher. We ultimately identified a total of 142 studies (see full list in the Supplemental Table in the supplemental digital content at https://meridian.allenpress.com/aplm in the May 2024 table of contents.) with a score of at least 3 points.
In-Depth Review
To further categorize these 142 studies, we examined each paper in depth according to 4 designated topics. The 4 topics were data sets, preprocessing techniques, training strategies, and testing approach. For each article, the relevant information regarding these topics was extracted and recorded.
MACHINE LEARNING
Introduction
The field of computer-aided diagnosis for prostate cancer relies on the skillful application of deep learning systems. Deep learning systems imitate the human brain to glean patterns from copious amounts of data.9 When applied to cancer, these systems accurately detect and classify prostate tissue in WSIs.10 The modern prostate cancer detection algorithm uses a multilayered neural network as its backbone.11–14 The network of choice is almost exclusively the convolutional neural network (CNN), which mimics the human visual system.15 The first CNN was introduced in 1989, when LeCun et al16 created LeNet to recognize handwritten zip code digits in a data set provided by the US Postal Service. Today, dozens of competitive CNNs are being used to detect cancer. The generic machine learning process that will be explained in this section is summarized in Figure 2.
General representation of the process by which a neural network is trained to differentiate whole slide images. At the end of the flow chart the line is broken into 4 categories; for detailed discussion see the Methods section.
General representation of the process by which a neural network is trained to differentiate whole slide images. At the end of the flow chart the line is broken into 4 categories; for detailed discussion see the Methods section.
Data Sets
Regarding data sets, the idiom “what you get out is what you put in” applies quite well. If the data set used to train an algorithm is small and of inferior quality, then the algorithm will produce poor results or have low generalizability to clinical practice.17–19 Machine learning algorithms must be trained on large, high-quality data sets, which are ideally composed of WSIs representative of those seen in clinical practice.20–24 Quality as used here refers to the resolution of the WSI scan and the kind of annotation used to label it. The annotation of the WSI in most data sets is at least overseen by a pathologist and can range from a single slide–level label of an overall Gleason score (GS) or Gleason grade25 to individual gland–level Gleason pattern assignments.26,27 Some studies attempt to strengthen the generalizability of their algorithm by using multiple data sets because annotation variation exists in clinical practice.26,28–33
Another point that must be considered is that annotations are only as good as the person making them. Observer variability is known to be an issue to the consistency of Gleason grading among pathologists,34 leading some researchers to look at possible genetic indicators for cancer detection instead of hematoxylin and eosin (H&E) slide annotation.35 There are a number of factors that could explain the difference in Gleason grading among pathologists, including the experience of the pathologist assigning the grade, the time spent examining the slide, or the quality of the image. Assigning a Gleason grade is somewhat subjective to the pathologist. Training an algorithm on data that may be specific to an individual or a small group of pathologists can limit its generalizability to clinical practice.27 One solution that has been implemented to enhance generalizability is a model wherein multiple pathologists' annotations are merged and averaged into a single ground truth label for training purposes.36
Pretraining/Transfer Learning
Just as humans learn patterns in one domain and apply those patterns elsewhere, strategically designed algorithms can also transfer what they learn. The modern prostate cancer detection algorithm first learns patterns from a generic data set. Popular generic data sets used in the literature are ImageNet, Microsoft Common Objects in Context, and Canadian Institute for Advanced Research, all of which include millions of annotated images with hundreds of classes.37,38 A model trained on a generic data set learns general image detection principles, such as the detection of edges, shapes, and objects; this is an algorithm's version of first learning to walk before learning to run.
A pretrained model has a head start when applied to the image detection task of classifying prostate tissue WSIs.39 The already-experienced model is better at detecting cancer than a nascent model, which has not learned general patterns from pretraining. Today's cancer detection machine learning models often learn patterns from one type of cancer and apply these patterns to another type of cancer. This method of machine learning, known as cross-domain transfer learning, is making cancer detection more accurate and bridges the gaps between models trained on different types of tissue. For example, a model that pretrains on breast cancer WSIs and applies its knowledge to prostate cancer WSIs is far more accurate than a model that learns without pretraining, beginning training with a blank slate.29,40
Image Preprocessing
Image preprocessing prepares images in a data set for use in training. It enhances the contents of a data set to improve the performance of the overall machine learning model.41–43 In this process the overall qualities of each image, such as color, orientation, and size, are normalized.27 Hospitals and medical centers commonly have unique staining protocols and varying whole slide scanners, so prostate tissue data sets suffer from intercenter variation in magnification, stain color saturation, image noise, etc.44–46 To combat this variation, a cancer detection algorithm uses a preprocessing step to make a data set's images ideally formatted for the machine learning algorithm.47–53
Generally, a cancer detection algorithm would perform 3 main functions during image preprocessing: filtering (or tiling), color normalization, and image denoising. A generic image preprocessing step would proceed as follows:
First, a sliding box, termed “filter,” is passed over a WSI at different magnifications.28 Each frame of reference produced as the box slides over the WSI creates a unique subimage, called a tile.17,29,54 Therefore, a region of tissue may exist in multiple tiles if the tiles are of different sizes.55 This allows the machine learning model to learn from multiple magnification levels of the same tissue.
Next, each tile is color normalized to produce a set of images with homogeneous stain distribution.29,56 If 2 tiles have significantly different saturations, then they are transformed into tiles of similar stain profiles.57
Third, the tiles are denoised.58 In a data set, image noise exists as unwanted, random brightness or color variation produced by a scanner's image sensor. Noise, which negatively affects an algorithm's performance, can be reduced before each tile is passed onward in the network.27,46 It is clear that images in a data set can be preprocessed to increase the performance of a model; a model's accuracy can also be further improved by augmenting the resulting tiles from the preprocessing stage.59
Data Augmentation
An algorithm is not limited in learning to the tiles from the image preprocessing phase. In fact, most algorithms augment these tiles to produce a more robust data set. A model's performance is dependent on the size of the data set it is trained on.21,23 An algorithm can leverage this principle by artificially increasing, or augmenting, the size of a data set.24 It is also necessary to augment the distribution of tissue classes, so that a data set is balanced for each class of tissue. The most straightforward augmentations are achieved by transforming tiles.
The most common tile transformations used to augment data sets include mirroring, rotating, differentiable zooming, and scaling tiles.60–62 Class balancing is commonly achieved by oversampling underrepresented classes (reusing the same data points repeatedly) or undersampling overrepresented classes (selecting a random subsample of data points).
Segmentation
After preprocessing and augmenting, the machine learning model looks at each tile with a granular approach.65,66 As pixel-wise occurrences like artifacts, cribriform patterns, or Gleason patterns are detected in each tile, the network associates related pixels with a certain class.67,68 When 2 separate instances of the same tissue differentiation occur, the algorithm classifies them as similar but unique cases.69 A region of tissue can be segmented multiple times—once for each magnification level. Segmentation for prostate cancer detection and classification produces regions of interest (ROIs), which are areas of notable tissue differentiation.30,70–73 The ROIs at different magnifications are passed onward so the model can extract features from them; the uninteresting, nondifferentiated tissue is excluded.74–76
Feature Extraction
During feature extraction, a network detects features in ROIs to simplify complex data. Features are numbers that represent specific qualities of the ROI.77 Prostate cancer detection algorithms extract features that describe the concentration of nuclei, the sizes of tumors, and even the borders of malignant tissue.78,79 With tiles and ROIs of the same tissue area present in different magnifications, the features must be combined in a meaningful way.29,65 The tiles' data are merged to decrease computational expenses while maintaining algorithm performance; the tiles of prostate tissue are represented by a grid filled with numbers, also known as a matrix. The matrices are stacked to form a three-dimensional matrix stack that contains the feature information of the preliminary matrices.1,10 The higher priority the feature, the more weight the feature holds in the new matrix.80,81 After the data are stacked in a process called pooling, the feature information is used to delegate ROIs into specific classes during the classification step.82,83
Classification
Classification is the culmination of a prostate cancer detection algorithm.84 All previous processes, transformations, and computations lead to the final class predictions of a model. These predictions are made by a fully convolutional (FC) head. The FC head is the last step, or layer, in a neural network. The flattened data from the previous pooling layer are given to the FC head, which ultimately decides how an ROI should be classified according to the probability that the region belongs to a certain class.28,85 Classification methods can be binary (between 2 outputs) or multi-class (among more than 2 outputs); the FC head's classification method is determined before the neural network is trained.86–91
METHODS
Classification Groups
After the in-depth review, we performed an aggregation of the reported classification methods. Studies with explicit numerical results were examined and categorized according to 4 classification categories, which are depicted in Figure 3.74 We found 65 papers containing results useful for comparison. These categories reflect 4 general types of classification approaches taken in the literature reviewed.
Overview of risk levels associated with common whole slide image labels. Color key maps these labels against 4 classes of algorithms found in the literature (see Methods section for definition of these classes). Abbreviations: GG, Gleason Grade Group; GP, Gleason pattern; GS, Gleason score.
Overview of risk levels associated with common whole slide image labels. Color key maps these labels against 4 classes of algorithms found in the literature (see Methods section for definition of these classes). Abbreviations: GG, Gleason Grade Group; GP, Gleason pattern; GS, Gleason score.
The first group, labeled Binary 1, contains studies that implemented machine learning methods to distinguish between cancerous and noncancerous regions of the prostate WSI. Also included in this category were studies that labeled regions of the WSI as benign or malignant and those that labeled slides as suspicious or nonsuspicious. Some applied a binary label to the entire WSI, whereas others applied the label to specific regions on the slide. This category is designated as Binary 1 because the decisions made by the machine learning models were binary in nature: suspicious/nonsuspicious, benign/malignant, cancerous/noncancerous.
The second group, titled Binary 2, refers to studies in which machine learning models made binary decisions but in the context of a Gleason classification. Included in Binary 2 are those models that could distinguish between WSIs with a GS of greater than 6 or less than or equal to 6 (GS >6 versus GS ≤6). Some distinguished between slides with a Gleason Grade Group of 3 or greater than 3, and others distinguished between a Gleason pattern of 3 and a Gleason pattern of 4 or greater. As seen in the chart, these distinctions are similar in their nature and difficulty; thus, they were combined into one category.
The third group, called Multiclassification 1, contains studies that attempted to perform more than one classification, specifically in terms of Gleason Grade Groups. Studies included in this category distinguish Gleason Grade Groups 0 through 5, with Gleason Grade Group 0 in this context referring to benign tissue. Studies that distinguished between Gleason Grade Groups 2 through 5 or 3 through 5 were also included in this category. These studies are distinct from those in Binary 2, which classified only between slides that had a Gleason Grade Group of 3 or greater than 3. Although the distinction between Gleason Grade Groups 3 and 4 is critical, the distinction between Gleason Grade Groups 4 and 5 also affects treatment plans and the prospects of surgery.
Finally, the fourth group, which is called Multiclassification 2, refers to studies that performed more than one classification but not specifically in terms of a Gleason Grade Group. These studies classified Gleason patterns 1 through 5 or 3 through 5. Some studies classified their slides as Gleason score 6 through 10.
Measurements Taken Into Consideration
The κ values, AUC values, and accuracy values were collected for each of the 4 classification groups. All 3 values were collected and included in Figure 4 if they were reported. For example, if both a κ and an AUC value were reported, both values were added to the figure. This could raise the concern of counting the same paper twice, but by separating these values into distinct categories, the additional data points facilitate comparison among different accuracy measurements. Sometimes algorithms were tested on multiple data sets or trials and thus had multiple reported values for the same measurement. In these cases, 1 of each given measurement, either accuracy, κ, or an AUC value, was selected. Some studies included additional measurements like F-1 scores and sensitivity and specificity values to quantify accuracy, but they were excluded, as infrequent use prevented comparison.
Scatterplot of area under the curve (AUC) values, accuracy values, and κ values for the studies examined. As machine learning algorithms progress from binary classification to multiclassification efforts, a greater variability can be observed in the results reported.
Scatterplot of area under the curve (AUC) values, accuracy values, and κ values for the studies examined. As machine learning algorithms progress from binary classification to multiclassification efforts, a greater variability can be observed in the results reported.
In many cases the exact methodology for how the accuracy value was calculated was not reported. In some cases, the accuracy value was a mean taken over several trials performed with the machine learning model.93
The AUC value balances the accuracy of the true-positive and false-positive values. A model that got 100% of the tests wrong would have an AUC value of 0, and inversely, if 100% of the tests were right it would have an AUC value of 1. The AUC value is quite literally the area under a curve on a graph whose x-axis is the false-positive value on a scale of 0 to 1 and whose y-axis is the true-positive value on a scale of 0 to 1.69,89,94,95 If the AUC achieves a value of 1, this means it was 100% accurate in its tasks.39 AUC values are often used to quantify accuracy, but, as can be observed in the scatterplot, there is little variability in the individual values among studies. Thus, the AUC value is a poor metric with which to quantify accuracy because it does very little to distinguish among developed algorithms.96 A better accuracy metric for comparison is the κ value.
If the raters have perfect agreement, the κ value is 1, and if there is nothing more than random agreement among raters, the κ value is 0.92 In some situations, a weighted κ value was used, which ensures that highly erroneous classifications are penalized heavily.97 The κ value helps balance observer variability and is commonly used in this domain of machine learning.98 To simplify, the κ value helps to quantify the concordance between a defined ground truth and the machine learning algorithms' classification.30,99 It is also used to compare pathologists' scores with each other to quantify differences in individual ratings.58
RESULTS
The quantitative results of this literature review are summarized in Figure 4, which depicts the accuracy values across the 4 categories of classification methods. From left to right (binary to multiclassification) in the scatterplot a general trend can be seen downwards for accuracy values, and a trend for increased variability. This shows that as algorithms move beyond binary classification it is more difficult to achieve high accuracy values.80 It is also important to note that the algorithms that performed a binary classification task were in much better agreement with experts. These results suggest that binary classification of prostate tissue from H&E WSIs is in many cases accurate and reliable.
The increased variability seen in the multiclassification tasks can be explained by the increasing difficulty of the task and by the type of statistical measurement used to quantify accuracy. Most studies that performed binary classification were able to report a nearly perfect AUC value, which suggests the AUC value is not ideal for comparison among algorithms. There is much less variation among AUC values in comparison with κ values, which were more commonly used to quantify accuracy among studies that performed multiclassification tasks. The κ values incorporate more information into the statistical measure, which allows for better comparison of performance among algorithms. Overall, the graph shows the increasing difficulty of multiclassification tasks and presents a general overview of the current state of the literature with regard to prostate cancer detection.
The qualitative results of this literature review were extracted and compiled by an in-depth review. The clinical implementation of this technology remains minimal in nature and has been hindered by observer variability.
DISCUSSION
Observer variability is a challenge that must be taken seriously in the development of machine learning algorithms.34,100 The fact that Gleason grading is somewhat subjective based on the grader means that it can be difficult to assign a “correct” grade to every image for a machine learning model.101 In a case study, an expert team of 4 pathologists, including a genitourinary specialist, labeled a set of 331 slides with Gleason gradings. Afterward, 29 pathologists were tasked to assign individual Gleason grades to the same slides. Upon comparison, an average Gleason grading accuracy value of 0.61 was found between the expert team and individual pathologists, with individual accuracies ranging between 0.31 and 0.74.102 Another study found the agreement among 24 pathologists in the context of Gleason grading to be a κ value of 0.67.98
Given the discrepancy among grades assigned by pathologists, it is difficult to define a minimum “pathologist accuracy” threshold that a machine learning algorithm needs to pass. Comparing a machine learning output directly against one specific pathologist is a poor performance indicator because of different experience levels and specializations. There is a significant amount of variation among pathologists when it comes to Gleason grading of WSIs.103 This variation makes it difficult to train an algorithm, because the annotations of a single pathologist do not establish an absolute truth in terms of Gleason grading. The existing variance in label quality suggests that a variety of annotations from different pathologists should be taken to create a better ground truth in terms of Gleason grading. Instead of just assigning a single value, the AI should instead also report a confidence score that could resemble the likely discrepancy among an expert group, potentially encoded as grade variance. As it stands, training an algorithm that classifies Gleason grades with 100% accuracy is impossible because “accuracy” here is the interpretive consensus of a group of pathologists. Instead of an expert panel consensus, recording the distribution of Gleason grades assigned by individual pathologists would yield a better label set. In this situation an AI would be trained to predict an expected distribution of expert opinions, rather than that of a single expert. We think that such a model would be far better at supporting pathologists, as it would de facto represent a crowd response.
In the absence of such a model, it may be best to view the clinical implementation of machine learning algorithms as a way to focus pathologists on the most relevant tissue areas, to help reduce variability among pathologists, and to provide a second opinion at little cost, which would enhance care and treatment options for patients104,105 and serve as a safety check against any misdiagnosis.106–109
As expected, multiclassification tasks are very difficult for current AI models. As mentioned previously and highlighted in Figure 4, there is a noticeable downward trend in accuracy from the binary classification to the multiclassification tasks. Unsurprisingly, the increasing number of categories deteriorates precision and accuracy because each additional category represents an increasingly smaller slice of all the possible outcomes. The lower accuracy of multiclassification tasks is also related to the subjectivity inherent in the Gleason grading system—Gleason Grade Groups are not based on mathematical measurements like tumor length, quantity, or circumference. Although these could be taken into consideration by a pathologist, there is no set mathematical scale by which Gleason Grade Groups are classified. They are roughly based on the morphologic features present, but again, this is subjective to the pathologist or experts examining the H&E stain. Even among experts there is considerable variation in classification of WSIs.103
As Gleason grading is a difficult subjective measure, other features taken into consideration by pathologists may present attractive alternative objectives for AI to learn. Such features include the quantification of tumors, perineural invasion, or intraductal and cribriform pattern. Perineural invasion describes the invasion of cancer in the immediate periphery of nerves. Nerves can serve as a route for metastatic spread, and thus an invasion can predict irreversible metastasis. Some algorithms were trained to recognize this element in addition to other factors with marked success.110 Intraductal carcinoma, a proliferation of cancer in prostate ducts, can also be a training objective. Some algorithms take this morphologic feature into consideration when grading a WSI.107 Cribriform pattern on a tumor is associated with carcinoma and can serve as an indicator for areas of the prostate biopsy that need additional attention. There are several developed algorithms that can detect the presence of cribriform pattern.77,78
The presence of cribriform pattern, perineural invasion, intraductal carcinoma, or other details such as tumor length and density can factor into an expert's diagnosis and treatment plan. Future AI algorithms would be well advised to consider these features, as their prediction could aid pathologists greatly in their efficiency and decision making.
For those who worry about replacing pathologists by AI algorithms, comfort can be found in recent examples of parallel AI applications failing to adapt to the level of variation that exists in the real world. IBM attempted to develop an AI doctor called Watson back in 2011, but after investing billions into the project, it was scrapped entirely by 2014.111 Although AI may one day be capable of replacing doctors or pathologists, that day has not yet arrived. On that day—if AI ever achieves general human-level intelligence—pathology will be just one among countless professions threatened by AI. Until that day, refusing to use AI as a tool for augmenting a pathologist may be compared with stubbornly refusing email in favor of paper mail. Although arguments can be made, efficiency will likely favor willing pathologists who thoroughly vet and carefully choose suitable AI to amplify their output. Based on current trends, AI will most likely be limited to serve as an assistant for pathologists in the foreseeable future.
CONCLUSIONS
Clinical Implementation
In conclusion, machine learning algorithms will not replace human pathologists in the near future, but instead may offer a useful tool to help decrease the work burden and increase the accuracy of practicing pathologists.69,80,84,112–114 One of the most well-known tools being implemented to date is Paige Prostate (Paige AI).115,116 Currently the Paige Prostate software is able to detect and label carcinoma as well as provide a Gleason score and other quantifiable measurements of the tumors. Paige Prostate has also demonstrated its ability to decrease time spent by pathologists examining slides and increase overall accuracy of assigned Gleason grades.99 One important consideration is that for time to be saved by the pathologist, there must be a level of trust in the algorithm. If there is little trust, then no time will be saved, because the pathologist may be spending time double-checking the algorithm.
Others found comparable results regarding accuracy improvement when machine learning methods were used.34,58 One specific tool, developed by researchers at the University of Wisconsin, was also implemented to improve the accuracy of 3 pathologists (κ = 0.56–0.70 to κ = 0.88–0.93).58 Another tool called Galen Prostate (Ibex-AI) has been in use in Israel as a quality control for pathologists.110 Two other prostate screening tools recognized in the literature and cleared for clinical implementation by a governing regulatory body are Deep-Dx117 (DeepBio) and Inify94 (Inify Laboratories).
The expanding field of AI-integrated pathology presents promising possibilities.118 This innovative technology may decrease workloads and increase accuracy for practicing pathologists.116 One hurdle to clinical implementation, which may be better viewed as an opportunity, is the widespread adoption of whole slide scanners. Effective implementation of this technology in clinical practice necessitates the adoption of whole slide scanners. High-quality digital slides that can be uploaded and evaluated easily by pathologists are necessary for machine learning algorithms as well.21,119 Looking beyond the United States and Europe, the ability to upload slides and view them digitally from afar can help improve access to health care in the developing world where pathologists are scarce.22,120
Limitations of This Study
This study is limited to the results given in the initial Google search, or which were derived from a forward and backward search. We also limited our literature to studies published between 2018 and late 2022. Furthermore, some aspects of the sorting process were inherently subjective to the research assistant whose responsibility it was to review the literature and categorize it. Some generalizations were also made to present the 4 categories of classification.
We thank TJ Hart for his contributions in identifying relevant algorithms. We thank Brigham Young University for providing undergraduate research funds.
References
Author notes
Frewing and Gibson contributed equally to this work.
The authors have no relevant financial interest in the products or companies described in this article.
Supplemental digital content is available for this at https://meridian.allenpress.com/aplm in the May 2024 table of contents.