Nodal metastasis of a primary tumor influences therapy decisions for a variety of cancers. Histologic identification of tumor cells in lymph nodes can be laborious and error-prone, especially for small tumor foci.
To evaluate the application and clinical implementation of a state-of-the-art deep learning–based artificial intelligence algorithm (LYmph Node Assistant or LYNA) for detection of metastatic breast cancer in sentinel lymph node biopsies.
Whole slide images were obtained from hematoxylin-eosin–stained lymph nodes from 399 patients (publicly available Camelyon16 challenge dataset). LYNA was developed by using 270 slides and evaluated on the remaining 129 slides. We compared the findings to those obtained from an independent laboratory (108 slides from 20 patients/86 blocks) using a different scanner to measure reproducibility.
LYNA achieved a slide-level area under the receiver operating characteristic (AUC) of 99% and a tumor-level sensitivity of 91% at 1 false positive per patient on the Camelyon16 evaluation dataset. We also identified 2 “normal” slides that contained micrometastases. When applied to our second dataset, LYNA achieved an AUC of 99.6%. LYNA was not affected by common histology artifacts such as overfixation, poor staining, and air bubbles.
Artificial intelligence algorithms can exhaustively evaluate every tissue patch on a slide, achieving higher tumor-level sensitivity than, and comparable slide-level performance to, pathologists. These techniques may improve the pathologist's productivity and reduce the number of false negatives associated with morphologic detection of tumor cells. We provide a framework to aid practicing pathologists in assessing such algorithms for adoption into their workflow (akin to how a pathologist assesses immunohistochemistry results).
Reviewing sentinel lymph node biopsies for evidence of metastasis is an important feature of breast cancer staging, concurrently impacting clinical staging and treatment decisions.1 However, reviewing lymph nodes for the presence of tumor cells is a tedious, time-consuming, and potentially error-prone process. Although obtaining additional sections from the tissue block and performing immunohistochemical (IHC) staining improve detection sensitivity, these techniques are associated with increased workload, costs, and reporting delays.
With the approval2 and gradual implementation of whole slide imaging for primary diagnosis, utilization of computer-aided image analysis is becoming more feasible in routine diagnostic settings. In recent years, deep learning,3 a kind of computer algorithm loosely inspired by biological neural networks, has significantly improved the ability of computers to identify objects in images.4,5 In medicine, deep learning was used to diagnose referable diabetic retinopathy or diabetic macular edema and skin cancer with accuracy comparable to that of board-certified ophthalmologists or dermatologists.6–8 Many other works using machine learning or deep learning for breast and other malignancies have been published.9,10 Moreover, deep learning–based algorithms accurately detect metastatic breast cancer in lymph nodes, based on both slide-level and tumor-level receiver-operating-characteristic performance metrics.11–13 The diagnostic accuracy of these algorithms is comparable to that of pathologists without time constraint, and significantly more accurate than pathologists in a simulated (1 minute per slide) environment.14
Computer algorithms may significantly improve a pathologist's workflow. However, from a clinical perspective, they have not achieved wide-scale acceptance, and from a regulatory perspective they have not yet been fully examined. Despite good performance metrics, the safety and quality15 profile of these algorithms have not been completely addressed. The practicing pathologist has little technical understanding of the underlying algorithms, diagnostic accuracy and error rates, as well as the utility of these programs in clinical practice.
In this study, we applied the current state-of-the-art algorithm (LYmph Node Assistant, or LYNA) to the Cancer Metastases in Lymph Nodes 2016 challenge dataset (Camelyon16),14 as well as a set of 108 images from a different source. We analyze LYNA's performance by dividing our analysis into 2 sections: one at the whole-slide level and the other at the level of tissue patches, or regions of interest (ROIs). We show that LYNA is unaffected by common histology artifacts such as poor fixation, overstaining, chatter, and coverslip bubbles. The technology, as such, can be used by various institutions with different histopathology laboratory procedures and digital infrastructure. LYNA, however, at times did falsely identify giant cells, germinal centers, and histiocytes as tumor foci. While these misidentified cells are readily diagnosed by an experienced pathologist, we hypothesize that the image distortion may have confused the algorithm. We will discuss these findings and show preliminary data on a small cohort of images that indicate that even without further modification of the LYNA algorithm, it is capable of identifying other metastatic tumors in lymph nodes. Lastly, we anticipate that our analysis into the quantitative performance, examination of how the algorithm works, and comprehensive tissue-level error analysis can be used as a framework for evaluation of future algorithms and their implementation into the clinical workflow.
MATERIALS AND METHODS
Study Design and Image Acquisition
We obtained data from 2 sources: the Camelyon1614 challenge containing 399 slides, and a separate dataset (“DS2”) that we digitized, containing 108 slides from 20 patients (86 tissue blocks). In the Camelyon16 dataset, 244 slides were digitized with a 3DHISTECH (Budapest, Hungary) Pannoramic 250 Flash II digital slide scanner and 155 slides were digitized with a Hamamatsu (Hamamatsu, Japan) XR C12000 digital slide scanner. DS2 slides were digitized with an Aperio (Leica Biosystems, Buffalo Grove, Illinois) AT2 whole-slide scanner. All slides were scanned at a resolution of 0.24 ± 1 μm per pixel. The number of slides containing metastasis and the purpose of each dataset are detailed in Table 1. Institutional review board waived the need for informed consent for the use of Camelyon16 slides. All work related to the DS2 dataset was approved by their institutional review board.
Ground Truth Determination for Training and Validation
The Camelyon16 ground truth diagnoses were provided by the organizers and are described in detail elsewhere.14 Briefly, the slides were reviewed by 1 of 2 pathologists and cases that were interpreted as negative on hematoxylin-eosin (H&E) alone, or were otherwise considered diagnostically challenging, were subjected to IHC confirmation (anti-cytokeratin, CAM 5.2 [BD Biosciences, San Jose, California]).
For DS2, 2 US board-certified pathologists (with at least 10 years of experience) independently graded each slide as macrometastasis, micrometastasis, isolated tumor cells (ITCs), or negative. The pathologists also provided a brief description of the tumor location within micrometastasis- and ITC-containing slides for ease of adjudication. A third pathologist reviewed the slides when the 2 pathologists rendered conflicting diagnoses. The third pathologist referred to additional sections from the tissue block or reviewed the corresponding cytokeratin IHC stain for challenging cases. To remain consistent with the Camelyon16 challenge evaluation and College of American Pathologists guidelines (since ITCs are considered N0), we excluded 2 DS2 slides that contained only ITCs. For transparency, a pathologic analysis of LYNA's performance on these 2 slides is included in the Results section.
Computer Image Analysis: LYNA (Algorithm) Development
Our deep learning–based image analysis workflow is divided into 2 stages: algorithm development and algorithm application (Figure 1). To develop the algorithm, we randomly sampled square image patches of size 128 pixels at high power (≈32 μm on a side). This patch size was selected to encompass several cells and was also used by Litjens et al.11 The algorithm takes as input a larger square patch of size 299 pixels (≈75 μm) to provide additional context akin to how a pathologist reviews a slide. We used 299 pixels as the input size because it is the default input size of the deep learning architecture used, Inception (V3).16
A deep learning architecture is a series of mathematical operations arranged in a hierarchy of layers. Earlier layers tend to produce low-level image features (such as edges), and later layers use the low-level features to construct more abstract features (such as shapes).4 A simple tool for visual exploration of deep learning features is available at playground.tensorflow.org. While the operations and their order are predetermined by the architecture, the parameters of the operations are automatically learned, a process called training in the machine learning literature. The correct predictions are termed labels, which are determined and annotated by pathologists by outlining tumors at the pixel level. Specifically, for each slide in the Camelyon16 dataset, tumors (if any) are outlined at the pixel level. When extracting image patches, we also extract the corresponding label of the tissue patch (benign: 0 or tumor: 1) and train the algorithm by repeatedly adjusting the weights of the algorithm to reduce the error on the image patches seen by the algorithm.
We improve upon a previously published algorithm12 by increasing the ratio of normal to tumor patches seen by the algorithm to 4:1 to reduce false positives. We further enhanced the computational efficiency of the training process, which improved the diversity of tissues “seen” by the algorithm during its training phase. These changes substantially increased both tumor-level sensitivity and slide-level area under the receiver operating characteristic (AUC, see Results). A detailed explanation of our deep learning algorithm is included in the Supplemental Digital Content available at www.archivesofpathology.org in the July 2019 table of contents.
Computer Image Analysis: Algorithm Usage
After training, LYNA was used by exhaustive application across the slide. This creates a 2-dimensional table of numbers, where each number indicates the predicted tumor likelihood of the corresponding ≈32-μm square tissue patch. In practice, we only applied LYNA to patches that contained tissue by using a conservative threshold similar to that of Janowczyk and Madabhushi.9 These predictions were visualized as a heatmap, where blue indicates low and red indicates high likelihood, or with regions of high tumor likelihood highlighted (Figure 1). To obtain a slide-level prediction, we used the maximum predicted value across all 100 000 high-magnification (×40) patches in each slide.
Computer Image Analysis: Color Normalization
Hematoxylin-eosin slides typically vary dramatically in appearance across institutions because of factors such as tissue preparation, staining protocols, and oxidation in the laboratory. When these slides are digitized by a whole slide scanner, additional variation can be introduced (eg, by digital white balance). We found that normalizing for these variations12 improved algorithm performance when used in conjunction with the other improvements described above. Briefly, we transformed the colors into a hue-saturation-density space17 that accounts for the nonlinear relationship between stain (such as H&E) amount and pixel intensity values. We then apply a slide-specific transformation18 to change each slide's color statistics to a reference slide. This is a simplified version of the approach described by Bejnordi et al.19 For this work, we used the median color statistics across the training set as the reference; these statistics were not optimized for pathologist review. Figure 2, A through F, shows sample image patches before and after normalization by our approach. This color normalization was applied only for LYNA review and not pathologists' review for the images in this study.
Unpacking the Black Box: Examining Mechanisms of LYNA's Predictions
We examined how LYNA made its predictions by computing the amount that each pixel affects LYNA's output prediction, in other words, the gradient of the output with respect to each input pixel. Next, we smoothed these gradients by averaging across several versions of the input image that have some noise artificially added to the pixels, a technique called SmoothGrad.20 In each input patch, we selected the 5% most important pixels. Using the percentile instead of a fixed value allowed this threshold to be invariant to the absolute gradient magnitudes in each image. Finally, we visualized the pixels in the original image that were within 1 μm of these “important pixels.” From our experience, because truly important pixels tended to cluster around a few cells, this final “importance dilation step” typically highlighted whole cells, making the final image more easily interpretable (Results).
LYNA was evaluated by using metrics based on receiver-operating-characteristic curves (ROCs) at the slide level and tumor level as in the Camelyon16 challenge.14 The slide-level area under ROC (AUC) was used to assess the ability of LYNA to discriminate between benign and metastasis-containing slides. The tumor-level ROC (called free-response ROC, or FROC) was used to assess the sensitivity of LYNA for individual tumor foci at various numbers of false positives flagged per slide. We report the sensitivities at several false-positive rates: 0, 0.25, 1, and 8. The last 3 values were chosen to match the official Camelyon16 evaluation metrics, while “0” allows comparisons with the reported performance of a pathologist in the challenge who did not make any false-positive diagnoses.
Quantitative Evaluation of LYNA
LYNA operates by exhaustively generating a prediction for each tissue patch on every slide. Some examples of these predictions are shown in Figure 3, A through P. The maximum value (corresponding to the most suspicious tissue patch) across the 100 000 predictions in each slide is the slide-level prediction. Thus we evaluate both slide-level predictions and the patch-level predictions. Using the Camelyon16 evaluation set, LYNA achieved a slide-level AUC (nodal metastasis: present or absent) of 99.3% (95% CI, 98.1%–100%), similar to the previous best of 99.4%. The second- and third-ranked teams achieved 97.6% and 96.4%, respectively.14 For comparison, the Camelyon16 challenge also tasked a practicing pathologist with evaluating the same slides digitally, where they achieved an AUC of 96.6% without any time constraint. Under time constraint of a minute per slide, the average of 11 pathologists' AUC was significantly lower at 81.0%. On the Camelyon16 dataset, setting LYNA's threshold to capture all of the positive cases (ie, 100% sensitivity and negative predictive value), which might be used, for example, to prioritize review of positive slides, results in a positive predictive value of 79% (49 of 62 slides). A threshold that captured all of the negative cases (ie, 100% specificity and positive predictive value), used for example to order IHC stains in advance (if consistent with institutional practice for reviewing negative cases), results in a negative predictive value of 96% (80 of 83 slides).
However, when evaluated for its ability to detect all the tumors on every slide, LYNA performed significantly better than the winning algorithm in Camelyon16. LYNA detected all 40 macrometastases without any false positives, and achieved 69% sensitivity (161 of 225 tumor foci; 95% CI, 63%–78%) for all the metastases before the first false-positive patch-level prediction. When allowed 1 patch-level false-positive prediction per tumor-negative slide, LYNA achieved a sensitivity of 91% (205 of 225 tumor foci), halving the false-negative rate relative to the previous best result of 81%14 (183 of 225 tumor foci). The second- and third-ranked teams achieved 75% (168 of 225 tumor foci) and 73% (164 of 225 tumor foci), respectively.14 For comparison, the Camelyon16 practicing pathologist achieved a sensitivity of 72% (163 of 225 tumor foci) after spending 30 hours reviewing 129 slides. Results on other false-positive thresholds are reported in Table 2.
LYNA also detected micrometastases in 2 “normal” slides (Normal_086 and Normal_144; Figure 3, A though F), which were not detected by other participants in Camelyon16 before LYNA. These were brought to the attention of and confirmed by the Camelyon16 challenge organizers, who confirmed that tumor was present in the normal slides. Fortunately, these were data processing errors and the patients' diagnoses were correct.
Of the 108 slides in the DS2 dataset, 2 slides contained only ITCs and were excluded from the performance metrics as well to match diagnostic guidelines21 and the Camelyon16 protocol,14 where no slides contained only ITCs. LYNA confidently detected the ITCs in 1 of these 2 slides but not the other (Figure 3, M through P). On the remaining 106 slides in DS2, LYNA achieved a similar slide-level AUC of 99.6% (95% CI, 98.8%–100%). On the DS2 dataset, a threshold that captures all of the positive cases results in a positive predictive value of 87% (54 of 62 slides), and a threshold that captures all of the negative cases results in a negative predictive value of 95% (52 of 55 slides). We did not collect pixel-level annotations for DS2 and thus were unable to compute the tumor-level sensitivity. However, we conducted an exhaustive analysis of the patch-level errors (detailed in a following section).
In general, we observed that LYNA was insensitive to a variety of artifacts such as cautery, air bubbles, cutting artifacts, hemorrhage, necrosis, and poor processing. A sample field of view containing many of these artifacts and LYNA's predictions is shown in Figure 3, G and H.
False-Negative and False-Positive Slides
Next, we exhaustively analyzed all of the false positives in Camelyon16 and DS2 by using a threshold that resulted in no slide-level false negatives. The threshold used for the 2 datasets was slightly different because the case mixes were different (eg, macrometastasis versus micrometastasis; Table 1). We then divided these errors by their causes into either the histology or scanning workflow component. These results are summarized in Table 3. The histology-related errors involved fixation and tissue processing quality, and floaters or contaminants; the errors originating from the scanning workflow included out-of-focus (OOF) germinal centers, histiocytes, and multinucleated giant cells. These were, in general, easy for pathologists to rule out during the review process.
With these no-false-negative thresholds (100% sensitivity in 49 tumor-containing cases), LYNA had 84% specificity (67 of 80 slides), and a positive predictive value of 79% (49 of 62 slides) in the Camelyon16 dataset. At the level of tumor detection, the macrometastasis sensitivity was 100% (22 of 22 slides) at zero patch-level false positives. In other words, all false negatives were small foci. Frequently, these small foci were technically above the 200-μm cutoff for micrometastasis, but contained far fewer than the other cutoff criterion for micrometastasis: 200 cells (usually <20 cells). In 1 slide (Test_051), many small extranodal tumor foci were surrounded by fat and in poor focus, resulting in missed detections. In cases such as this, the (larger) intranodal foci were detected correctly, leading to correct case- and slide-level diagnosis. In the DS2 dataset, LYNA had 85% specificity (44 of 52 slides), and a positive predictive value of 87% (54 of 62 slides).
Unpacking the Black Box
To understand how LYNA worked, we examined the most important pixels used to make the prediction for each input image. In the metastatic cancer, LYNA focused its attention primarily on nuclear features, such as pleomorphism and hyperchromasia (Figure 4, A through D). In fields of view containing giant cells, LYNA focused on the crowded and overlapping nuclei (Figure 4, E and F). In fields of view containing OOF nuclei, LYNA focused on nuclei from different focal depths that appeared stacked together (Figure 4, G and H). In a field of view containing a capsular nevus (discussed in more detail in the next section), LYNA focused on the stacked and crowded nuclei (Figure 4, I and J). In a field of view containing a floater, LYNA again focused on the pleomorphic, hyperchromatic nuclei (Figure 4, K and L).
False-Positive Regions of Interest
Next, we reviewed all of the top patch-level predictions (ROIs) for each slide (including those lower than the 100% slide-level sensitivity threshold above). We discovered that several predictions would have triggered additional workflows such as deeper levels of H&E or IHC in actual clinical practice. Therefore, LYNA flagging these areas for review would benefit the clinical workflow by enabling pathologists to determine the next steps.
The first category of actionable false positives consisted of floaters or contaminants (henceforth termed floaters for brevity), which were detected by LYNA in 4 slides: Test_018, Test_044, Test_054, and Test_101 (Figure 5, A through D). The first 3 are suspicious and likely cancer, while the last is consistent with normal colonic tissue. While determining whether to penalize the algorithm for identifying this and to better understand the clinical significance of these results, we conducted the following study. We asked 5 board-certified pathologists to review these 4 lymph node images as per their usual clinical workflow. To verify that any missed floaters were true errors instead of not being reported, we asked the pathologists if they had seen the foci on their initial review. All of the pathologists arrived at the correct slide-level diagnosis (negative) for all 4 slides. Two of the 5 pathologists detected the floater for the first case, none detected the floater for the second case, all detected the floater for the third case, and 3 of the 5 detected the floater for the fourth case. Of the 4 cases, the 5 pathologists detected floaters in 1 of 4, 1 of 4, 2 of 4, 3 of 4, and 3 of 4 cases, respectively, on initial review, indicating that the use of LYNA might help detect a significant number of floaters or contaminants that might otherwise have been missed in routine clinical settings.
In addition, LYNA detected epithelioid-appearing cells in the capsule in 2 images (Test_037 and Test_063; Figure 5, E and F). The challenge organizers (with access to the original slides, IHC, and the original case, eg, melanoma or breast cancer) confirmed that these were capsular nevi. Previous studies have shown that these cells, when present in small clusters in the capsule, can be diagnostically challenging on H&E alone when using glass slides.22,23 They can be even more difficult when reviewed in a digital format where color contrast and focus can be slightly different than when reviewing the corresponding glass slides using a microscope.
Thus, while determining whether to penalize the algorithm for finding these capsular nevi and to better understand the significance of these results, we conducted the following study. We asked 5 board-certified pathologists for a review of each slide. After this unbiased review, we asked if their diagnosis or decisions changed after being directed to the specific region. For Test_037, 5 of 5 pathologists detected the ROI on first review, but requested IHC stains (5 of 5 for cytokeratin, and 3 of 5 for a melanocytic marker such as S100, Sox10, HMB-45, or MART1). Their decisions did not change when we directed them to the specific region. For Test_063, 4 of 5 pathologists detected the ROI on first review. Three of 5 requested cytokeratin IHC and 1 requested melanocytic markers in addition. Interestingly, one of the pathologists who had seen the region on the initial review (but considered it benign) requested an IHC stain when we directed them to the region. The remaining pathologist who had not identified the region on initial review considered the region a nevus when asked. We expect that with the assistance of LYNA, nevi or melanoma (if present) will be more consistently detected and be included in the differential diagnosis for lymph nodes examined for metastatic disease.
Our analysis shows that LYNA generated both slide-level and patch-level predictions accurately while ignoring many types of artifacts and benign mimics of cancer (Figure 3, A through N; Table 2). LYNA detected all the macrometastases at a threshold corresponding to zero false positives, and a combined macrometastasis or micrometastasis sensitivity just less than that of a human pathologist. Although detecting all of the individual tumor foci does not directly reflect clinical workflow (eg, after the detection of a macrometastasis, detecting additional smaller foci does not affect clinical staging), it is a useful proxy for tumor detection ability under the assumption that any given tumor could have appeared as the only focus in that case or slide. In this scenario, missed detections by either a pathologist or an algorithm will result in a false-negative case-level diagnosis. In addition, the tumor-level sensitivity assesses the ability to both detect metastasis-containing slides and also correctly locate each focus. This correctly penalizes an algorithm that produced a correct slide-level prediction by falsely identifying a benign region as tumor, but missing the true tumor focus.
One of the evaluation metrics, the “zero false positive” tumor sensitivity reveals additional insight into the algorithm's mode of operation and how it and other similar algorithms can be best evaluated and used. First, note in Table 2 that the confidence intervals are wide, but decrease with increasing false positives. Because LYNA (and other similar algorithms) operate via exhaustive search of every slide at high-power magnification (about 100 000 fields of view per slide), there are many opportunities for error for each slide. One false positive is equivalent to a specificity of 99.999%. Therefore, if this performance metric is used for primary evaluation, it can be highly variable across studies. When the threshold is loosened, LYNA achieves a significantly higher tumor sensitivity with narrower confidence intervals. This also suggests that pathologists can leverage LYNA's exhaustive search and resultant high tumor-level sensitivity by first reviewing the top predicted tumor regions for each slide, then ignoring false positives and interpreting only the true positive regions, such as for size measurements and vascular invasion. In this manner, algorithms such as LYNA can raise “alerts” for ROIs (such as tumor) and leave the interpretation of the tissue to pathologists.
Actionable False-Positive ROI
In our study, we find that in addition to metastases, LYNA detected 2 types of actionable false positives: floaters/contaminants and capsular nevi. The detection of floaters and contaminants, while often of marginal clinical interest, may occasionally prompt institution- or pathologist-specific protocols to investigate the origins of these findings.23–25 Our study of 4 slides indicates that consistent detection is challenging, likely because of the small size of these fragments and often random location on the slide. Sensibly, the detection rate of floaters and contaminants directly correlated with the size of these fragments: in order from smallest to largest floater, the detection rate was 0 of 5, 2 of 5, 3 of 5, and 5 of 5. In particular, the case with the smallest floater (≈75 μm; Figure 5, B) was not detected by any of the 5 pathologists on initial review. While LYNA was not developed to detect floaters, the fact that it does identify several (that were missed by pathologists simulating a routine workflow) presents another benefit of leveraging algorithms during sign-out.
Based on our study, capsular nevi were similarly challenging to diagnose by H&E alone: the 2 cases prompted cytokeratin IHC analysis from 5 of 5 pathologists for both images, and melanocytic IHC analysis from 3 of 5 pathologists in the first image and 1 of 5 pathologists in the second image. These data indicate that the capsular nevi prompted suspicion of breast cancer metastasis as well as a second malignancy of melanoma, or potentially a mislabeled slide. While our algorithm was not trained to detect these suspicious-looking cells from H&E alone, the fact that it does identify them could also be of benefit to practicing pathologists.
Other False-Positive ROI: Out of Focus
In our study, other classes of false positives (such as OOF giant cells, histiocytes, and germinal centers; see Figure 6, A through H) were easy for pathologists to rule out. However, in a select subset of instances, some cells were concerning enough to warrant either a glass-slide review or IHC staining to rule out tumor cells. A common theme among these false positives was “local” OOF that affects either individual cells or cellular compartments, and “regional” OOF that affects larger patches of tissue, such as entire scan lanes (Figure 6, I and J) or entire slides. We have also observed entire slides that were OOF (“global” OOF) resulting from dust or dirt on the glass slide or cover slip. These were resolved by cleaning the slide and rescanning. Local OOF might cause otherwise benign tissue to have morphologic characteristics of tumor (such as indistinct cell to cell boundaries and packed nuclei). This level of OOF stems from the fundamental fact that each tissue section's thickness can exceed the optics' depth of field. This issue is exacerbated at higher magnifications, since the depth of field inversely decreases with the magnification—this is why we are constantly refocusing when reviewing glass slides. Such “local” OOF affects the algorithm more so than the pathologist in most, but not all, scenarios (difficult cases are still challenging to evaluate on a digitized slide with no ability to refocus). Correspondingly, solving this OOF issue may involve the use of scanners that enable refocusing. Regional OOF, on the other hand, can obscure high-power review of tissue entirely for both a pathologist and LYNA (Figure 6, J). In contrast to local OOF, regional and global OOF are problems that must be solved by improved scanner capability to correctly detect the focal plane of the tissue.
Unpacking the Black Box and Extension to Nonbreast Cancer
Next, based on the “unpacking the black box” study above, LYNA appeared to be learning sensible morphologic features of malignancy in lymph nodes. For example, LYNA seemed to be sensitive to large and pleomorphic nuclei and ductlike structures. We reasoned that LYNA might also generalize to other cancers present in lymph nodes. Although an in-depth analysis of multiple cancer types is beyond the scope of the current study, we had unintentionally digitized a few slides from nonbreast cancer cases in the course of validating our results. Nodes from 3 of these cases were positive for metastatic cancer, and LYNA correctly detected all 3. The 3 metastatic cases were respectively adenocarcinoma of the colon (Figure 7, A and B), signet ring cell carcinoma of the colon (Figure 7, C and D), and papillary thyroid carcinoma (Figure 7, E and F). Although anecdotal, these results suggest that despite not having been developed with nonbreast specimens, LYNA may generalize to other metastatic cancers in the lymph node, possibly by identifying common tumor morphologic characteristics. Also interestingly, signet ring cell carcinoma of the breast might similarly be detected by LYNA. We hypothesize that further development using other cancer types as training data will enable development of a general cancer detection algorithm for lymph nodes.
Utility of Image Analysis Algorithms in Clinical Practice
Despite some debate, people agree that computer-assisted diagnosis using technologies such as deep learning could be used to augment pathologists' workflows.24–26 One slide-level use case is to automatically flag negative or challenging slides for IHC staining before pathologist review as a way to streamline the pathologist's review process.11 When used in this manner, all of the negative cases could be verified by up-front IHC, at a cost excess of 3 slides in both datasets studied (staining 83 slides to verify the 80 negative slides in Camelyon16 and 55 slides to verify 52 negatives in DS2). Conversely, the algorithm could be used to prioritize the review of positive cases to speed up sign-out for cases with positive nodes. Our data suggest that reviewing the 62 slides predicted to have the highest likelihood of tumor in both datasets would capture all of the positives, corresponding to reviewing 48% (62 of 129) of the slides in Camelyon16 and 58% (62 of 106) of the slides in DS2. Skipping review of the remaining slides (all are negative in our study) is in principle possible, but would require additional validation of these results in other datasets. Another use case could be a “second read” that flags missed metastasis for review, particularly as part of an institutional or individual Quality Assurance protocol. Finally, a patch-level–assisted read mode could guide pathologists to highly suspicious regions, similar to the role of a junior resident indicating ROIs with marking ink on a physical glass slide. Concretely, at the patch level, LYNA predictions could be filtered to show only the highest tumor-likelihood regions based on a (potentially adjustable) threshold depending on the use case. For example, a “looser” threshold that indicates more regions might be desirable for an in-depth review, while a “tighter” threshold that highlights fewer regions might be desirable for a second read of negative cases. These filtered predictions can then be displayed in a small number of colors (such as 1 or 2) to help prioritize review.
On another note, the implementation of digital pathology and computer algorithms such as this could enable more accurate data collection for future American Joint Committee on Cancer guidelines.21 In breast cancer, for example, measuring metastatic foci using their largest dimension is error prone because unless the focus is a perfect sphere, the true largest dimension can be missed, based on the sectioning protocol and specimen orientation. Automated analysis using computer algorithms would enable more exhaustive sectioning and more accurate measurements of tumor size for both clinical workflow and research purposes.
We have conducted a thorough analysis of the error modes of the LYNA algorithm to allow pathologists to evaluate its strengths and weaknesses, much as we need to understand potential false positives and negatives of an IHC stain before clinical use.27,28 In addition, we have evaluated LYNA on its generalizability to specimens from a different institution, prepared by using a different protocol, and digitized by using a different scanner. Finally, we have identified the image features that LYNA triggers on in order to “open the black box.” We hope that this work provides a template for the evaluation of future histopathology artificial intelligence algorithms for clinical use.
Despite promising results, our study contains some limitations. First, LYNA operates on a 75-μm field of view. This means that LYNA lacks context about the anatomic position of the current field of view and will be unable to automatically make position-dependent determinations such as extranodal extension and lymph-vascular invasion. More generally, LYNA is currently unable to compare the current field of view with similar cells in less ambiguous regions of the same slide or case as a pathologist would. Moreover, despite valuable insight, our study would be improved by more cases and slides to detect the rarer error modes. Finally, this work does not directly evaluate the effects on work efficiency or accuracy when using LYNA for diagnosis of lymph node slides. This is the subject of future work.29
We thank Greg Corrado, PhD, and Philip Nelson, PhD, for their advice and guidance in enabling this work, Craig Mermel, MD, PhD, for helpful comments on the manuscript, James Wren, MPH, for administrative support, and Josh Pomorski, BS, for data collection. We thank members of the Google AI Pathology team for software infrastructure and logistical support, and slide digitization services. Gratitude also goes to pathologists Kathy Brady, MD, Imok Cha, MD, Steve Cordero, MD, Chris Kim, MD, and one other pathologist for assistance in interpreting images as part of the floater or nevi studies, or the DS2 dataset. Thanks also go to Hossein Talebi, PhD, for helpful discussions about color normalization. Last but not least, we are grateful to the Camelyon16 organizers for creating the challenge, data access, and helpful discussions in clarifying image findings and performance evaluation.
Supplemental digital content is available for this article at www.archivesofpathology.org in the July 2019 table of contents.
Drs Liu, Kohlberger, Norouzi, Dahl, Peng, Hipp, and Stumpe are employees of Google Inc and own stock in the company. Dr Smith was compensated by Google LLC for pathology annotations for a prostate cancer project and primary cancer classification unrelated to this manuscript. Drs Mohtashamian and Olson have no relevant financial interest in the products or companies described in this article.
The views expressed in this article are those of the authors and do not necessarily reflect the official policy or position of the Department of the Navy, Department of Defense, or the US Government.