Context.—

Breast carcinoma grade, as determined by the Nottingham Grading System (NGS), is an important criterion for determining prognosis. The NGS is based on 3 parameters: tubule formation (TF), nuclear pleomorphism (NP), and mitotic count (MC). The advent of digital pathology and artificial intelligence (AI) have increased interest in virtual microscopy using digital whole slide imaging (WSI) more broadly.

Objective.—

To compare concordance in breast carcinoma grading between AI and a multi-institutional group of breast pathologists using digital WSI.

Design.—

We have developed an automated NGS framework using deep learning. Six pathologists and AI independently reviewed a digitally scanned slide from 137 invasive carcinomas and assigned a grade based on scoring of the TF, NP, and MC.

Results.—

Interobserver agreement for the pathologists and AI for overall grade was moderate (κ = 0.471). Agreement was good (κ = 0.681), moderate (κ = 0.442), and fair (κ = 0.368) for grades 1, 3, and 2, respectively. Observer pair concordance for AI and individual pathologists ranged from fair to good (κ = 0.313–0.606). Perfect agreement was observed in 25 cases (27.4%). Interobserver agreement for the individual components was best for TF (κ = 0.471 each) followed by NP (κ = 0.342) and was worst for MC (κ = 0.233). There were no observed differences in concordance amongst pathologists alone versus pathologists + AI.

Conclusions.—

Ours is the first study comparing concordance in breast carcinoma grading between a multi-institutional group of pathologists using virtual microscopy to a newly developed WSI AI methodology. Using explainable methods, AI demonstrated similar concordance to pathologists alone.

Over the years, microscopic evaluation of hematoxylin-eosin–stained slides from patient samples has been the gold standard for cancer diagnosis and grading of malignancy. The Nottingham Grading System (NGS) is widely used for breast cancer grading and is dependent on 3 variables: tubule formation (TF), nuclear pleomorphism (NP), and mitotic count (MC).1  Grading of breast carcinoma is a critical prognostic feature in the evaluation of early-stage breast carcinoma.16  As such, the most recent edition of the American Joint Committee on Cancer (AJCC) Cancer Staging Manual incorporates grade (along with biomarker status, ie, estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 [HER2]) and molecular testing results7  into staging groups. Doing so has resulted in a staging system that outperforms TNM alone.8 

Although pathologists undergo several years of extensive training to become proficient in the NGS, reproducibility amongst pathologists is imperfect.4,915  Therefore, tools that will improve breast carcinoma grading concordance are continually being sought out and evaluated.

Digital whole slide imaging (WSI) allows glass slides to be digitally scanned at a high resolution for viewing on a screen. Developments in this technology have allowed virtual microscopy (VM) to gain traction beyond educational, research, image analysis, and quality assurance settings, and into the clinical domain.1621  With the US Food and Drug Administration approval of some platforms for diagnostic use, increased utilization of digital pathology in routine practice is expected.22  While data regarding the variability in breast carcinoma grading using VM are limited, we demonstrated moderate concordance in breast carcinoma grading amongst a multi-institutional group of breast pathologists using VM.11  Along with VM, artificial intelligence (AI) and machine learning techniques have pervaded the domain of digital pathology in recent years. Artificial intelligence coupled with WSI can offer a potential solution in situations with unacceptable interobserver variation, low reproducibility, high subjectivity, and low accuracy. To date, there have been several diagnostic applications of AI in breast pathology such as primary tumor detection,2329  metastatic lymph node detection,30  breast cancer grading,31,32  breast cancer subtyping,31  receptor status and intrinsic subtype assessment,31,3336  and assessment of tumor heterogeneity34,37  and tumor microenvironment.3840  In addition to these applications, deep learning neural networks have been developed for identifying and quantifying mitosis4144  assessment of tubule formation,45  as well as detection and scoring of nuclear pleomorphism.37,4648  While most studies have developed deep learning algorithms for the 3 individual components of the NGS, there are relatively few that have applied machine learning to all 3 components and assessed AI's performance in the NGS compared to grading by pathologists.49 

Considering the developments in AI and various novel algorithms being developed for the assessment and scoring of breast cancers, we sought to develop a combination of deep learning algorithms for the NGS and evaluate its performance against a multi-institutional group of 6 academic breast pathologists using digital WSI.

Patient Cohort

We identified cases of consecutive invasive breast carcinoma from a single year (2016) in the pathology files at New York-Presbyterian Hospital/Weill Cornell Medicine (New York, New York). Exclusion criteria included cases of microinvasive carcinoma, cases with insufficient tumor area to perform formal MCs, and cases treated with neoadjuvant chemotherapy. The final cohort consisted of 143 consecutive invasive breast carcinomas.

One pathologist (P.S.G.) reviewed archived hematoxylin-eosin slides, and 1 representative slide for each case was selected to be scanned into the digital WSI platform. Clinicopathologic variables including age, tumor size, hormone receptor status, HER2 status, and lymph node involvement were collected. The ethical considerations regarding our study are covered by the Institutional Review Board (IRB)–approved protocol entitled “Immunologic and Molecular Analysis of Breast Disease” at our institution. The IRB approval number is “0411007570,” and the protocol was approved on December 17, 2020.

Digital Whole Slide Scanning

Using a single z-plane via an Aperio AT2 whole slide scanner (Leica Biosystems, San Diego, California), slides were scanned at an ×40 magnification. The scanned digital WSIs were evaluated for quality and to ensure that they were in focus. De-identified digital files in (.svs) format were stored on an image server for remote evaluation, which was performed with the Aperio ImageScope application (Leica Biosystems, Buffalo Grove, Illinois).

Breast Carcinoma Grading by Pathologists

Six pathologists (P.S.G., R.I., T.M.D., S.F., S.Ja., and M.H.) independently reviewed the digital WSIs and were instructed to grade the carcinomas from established NGS criteria for TF, NP, and MC.1,7  To perform MC, the pathologists were given instructions for annotating areas corresponding to a total area of 2.38 mm2, corresponding to the area in 10 high-power fields, using an eyepiece with a field diameter of 0.55 mm. Mitotic counts of 8 or fewer, 9 to 17, and 18 or greater were scored as 1, 2, and 3, respectively, within this area. Pathologists had a median of 14 years of experience (range, 4–25 years), and all had subspecialty interest and/or fellowship training in breast pathology. The pathologists were blinded to the original reported grade and other clinicopathologic parameters.

Machine Learning Methods

Machine learning methods were attempted on the cohort of 143 cases. The files for 4 cases were corrupted and 2 cases were removed due to computational issues. As such, the final cohort consisted of 137 WSIs, of which 46 were randomly chosen to be in the training set and the remaining 91 cases formed the test set. There were no significant differences between clinicopathologic variables between the training and test sets (Table 1). WSIs were annotated for the invasive tumor region and leading edge by a single pathologist (L.K.) (Figure 1). Areas of in situ carcinoma as well as wide areas of fibrosis and necrosis were excluded in the annotations. For the machine learning models, the ground truth was defined as the grade reported in the original pathology report.

Table 1

Clinicopathologic Features of Patients in Training and Test Sets

Clinicopathologic Features of Patients in Training and Test Sets
Clinicopathologic Features of Patients in Training and Test Sets
Figure 1

Block diagram of the artificial intelligence training and testing methods. Abbreviations: FCN, fully convolutional network; HPF, high-power field; MC, mitotic count; NP, nuclear pleomorphism; TF, tubule formation.

Figure 1

Block diagram of the artificial intelligence training and testing methods. Abbreviations: FCN, fully convolutional network; HPF, high-power field; MC, mitotic count; NP, nuclear pleomorphism; TF, tubule formation.

Close modal

Artificial Intelligence Training for Tubule Formation

For the TF training model, we selected 25 cases from the 46 WSIs and used the remaining 21 cases for validation. From the 25 WSIs, we selected 50 images of 1024 × 1024 pixels at ×10 magnification within the tumor region. Using images from ×10 helps in capturing better information about the tubules and sheets patterns and also reduces the computational power used compared to ×40. We performed stain-normalization on these images by using the Vahadne50  technique and extracted 1911 patches of 256 × 256 pixels followed by data augmentation to train the models. The annotations for training the model consisted of 1303 tubule annotations and 2375 sheet annotations for the training of semantic segmentation algorithms. The model's architecture consisted of Unet51  with a DenseNet52  backbone, with preloaded ImageNet53  weights, and was trained for 12 epochs, using categorical cross entropy and dice loss as loss function, on NVIDIA RTX 2080ti GPU (graphics processing unit).

To grade TF, we calculated the percentage area of tubules with respect to the total tumor area within the annotated region by using semantic segmentation. During validation experiments we found that the percentage cutoffs given by the guideline were not correlating well with the reported original scores. Assuming there could be error by the human eye in calculating the percentage of tubule area and sheet area, compared to our models that can calculate the exact area of tubules and sheets, we factored that into the grading guidelines and modified the cutoffs. Supplemental Table 1 (see Supplemental digital content at https://meridian.allenpress.com/aplm in the November 2022 table of contents) shows the official guidelines and modified guidelines. From these modified guidelines for tubule scoring, we observed improved correlations between AI and ground truth for grading TF (Supplemental Table 2).

Artificial Intelligence Training for Nuclear Pleomorphism

For the NP training model, we selected 25 cases from the 46 WSIs; the remaining 21 cases were used for validation. From the 25 WSIs, we selected 50 images of 1024 × 1024 pixels at ×40 to capture better details of nuclei features for segmenting them within the tumor region. These 50 images were annotated for instance segmentation, with 20 000 tumor cell annotations. Stain-normalization was performed on the 50 images by using the Vahadne method and creating 5000 patches followed by data augmentation. We trained HoVerNet,54  with pretrained ImageNet ResNet50-Preact55  weights for 100 epochs, using Nvidia Tesla V100, using categorical binary cross entropy, dice loss along, and with HoVer loss54  as loss functions.

For any given WSI, the model segmented individual tumor cells and contoured them as individual entities. Using these contours, the morphometry of each individual cell was computed, giving the area and perimeter of the cell or nucleus. We performed area thresholding and removed outliers from the tumor cell distribution by considering the population within the interquartile range. We measured the standard deviation of these tumor cells and scored NP according to Supplemental Table 3.

We took the top 10 images of 4096 × 4096 pixels at ×40 from tumor-rich regions and passed them through the model for segmenting tumor cells and postprocessed them to score NP. Tumor-rich regions were determined after tubule scoring, which calculated the total tumor area; subsequently, NP scoring was performed. We used the 21 WSIs from the training set to develop the thresholds for standard deviation of morphologic properties for the area property of the tumor cells and to calibrate the thresholds to score NP. The thresholds ultimately set are shown in Supplemental Table 3.

Artificial Intelligence Training for Mitotic Count

To train the model for MC, we split the 46 WSI training data set into 36 cases for training and 10 cases for validation. We used more WSIs to train for MC because more data were required to train the model for mitotic figures. We extracted 1320 patches of 1024 × 1024 pixels at ×40 to capture better details of nuclei features for detecting them, which we used for training and validating our models. Stain-normalization was performed on the patches with the Vahadne method, followed by data augmentation on the training set. The model architecture is of a LinkNet56  with the EfficientNet B457  backbone, trained for 15 epochs on a Nvidia Tesla V100.

For the digital equivalent of a high-power field (HPF), we used an image of size 2048 × 2048 pixels (same as 1 HPF diameter of 0.55 mm with a conversion factor of 1 pixel = 0.2476 μm). To score MC, we extracted HPFs of 2048 × 2048 pixels from the leading-edge annotations at ×40.

We attempted 2 counting methods to score mitosis: (1) we considered the top 10 mitotically active HPFs and aggregated them to determine the final MC and (2) we considered the first 10 HPFs with at least 1 mitosis and aggregated them to determine the final MC. The second approach correlated more with the ground truth MC score (Supplemental Table 4), so we applied the same to the test set. Figure 2, A through F, shows sample input-output predictions of AI for TF, NP, and MC.

Figure 2

Examples of artificial intelligence (AI) predictions for tubule formation, nuclear pleomorphism, and mitotic counts. A, Invasive carcinoma with varying degrees of tubule formation. B, AI predictions of tubules (yellow) and sheets (blue). C, Invasive carcinoma with varying nuclear pleomorphism. D, AI predictions show instance segmentation of tumor cells. E, Invasive carcinoma with mitosis. F, AI prediction of mitotic figure (green circle) (hematoxylin-eosin, original magnifications ×10 [A and B] and ×40 [C through F]).

Figure 2

Examples of artificial intelligence (AI) predictions for tubule formation, nuclear pleomorphism, and mitotic counts. A, Invasive carcinoma with varying degrees of tubule formation. B, AI predictions of tubules (yellow) and sheets (blue). C, Invasive carcinoma with varying nuclear pleomorphism. D, AI predictions show instance segmentation of tumor cells. E, Invasive carcinoma with mitosis. F, AI prediction of mitotic figure (green circle) (hematoxylin-eosin, original magnifications ×10 [A and B] and ×40 [C through F]).

Close modal

Statistical Analysis

AI models for each parameter of NGS were trained against “ground truth” in the training set, and subsequently evaluated in independent test sets. The association between clinicopathologic variables in the training and test sets was assessed by using χ2. Comparisons of agreement were performed on 2 groups. Group 1 included pathologists only and group 2 included pathologists and AI. Fleiss κ for overall agreement amongst all pathologists and AI was calculated for overall grade and individual components. Cohen κ for overall agreement was calculated for pairwise comparisons. κ statistic levels of agreement were defined as follows: 0.20 or less = slight, 0.21 to 0.40 = fair, 0.41 to 0.60 = moderate, 0.61 to 0.80 = good, and 0.81 to 1.00 = very good.58,59  Statistical significance was based on a P value of <.05 (2-tailed). All analyses were performed with Python and Microsoft Excel.

Accuracy of Artificial Intelligence With Respect to Reported Grade

Compared to the reported grade in the test set, AI demonstrated an accuracy of 0.659 for overall grade, 0.758 for TF, 0.802 for NP, and 0.659 for MC. The accuracy of AI was 0.70 for grade 1 and 2 tumors and 0.52 for grade 3 tumors, compared to the reported grade.

Accuracy of Artificial Intelligence and All 6 Pathologists With Respect to Reported Grade

We took the weighted average of F1-score to calculate the performance of AI and 6 pathologists with regard to the original reported grade. A comparable F1-score was observed between AI and P1, with F1-scores of .664 and 0.663, respectively. Observers P2, P3, and P5 achieved F1-scores greater than 0.75, with 0.78, 0.76, and 0.79, respectively, with P4 and P6 achieving F1-scores of 0.72 and 0.694, respectively.

Agreement Between Artificial Intelligence and All 6 Pathologists in Breast Carcinoma Grading

Among the 91 test cases, perfect agreement was observed between all 6 pathologists and AI in 25 cases (27.4%) for overall histologic grade (Table 2). Perfect agreement was achieved in 9 grade 1 carcinomas (9.89%), 8 grade 2 carcinomas (8.79%), and 8 grade 3 carcinomas (8.79%). Discordance between grades 1 and 2 only was observed in 15 cases (16.4%) and between grades 2 and 3 only in 42 cases (46.1%); no cases demonstrated a discrepancy between grades 1 and 3 only. Nine of 91 cases (9.89%) were scored as 1, 2, or 3 by the 7 observers. The concordance between AI and the pathologists was similar to the concordance among pathologists alone.

Table 2

Cases With Perfect Agreement Amongst Pathologists Only and With Artificial Intelligence (AI) in Breast Carcinoma Grading (n = 91)

Cases With Perfect Agreement Amongst Pathologists Only and With Artificial Intelligence (AI) in Breast Carcinoma Grading (n = 91)
Cases With Perfect Agreement Amongst Pathologists Only and With Artificial Intelligence (AI) in Breast Carcinoma Grading (n = 91)

For the individual components, perfect agreement was reached for TF in 41 cases (45%), for NP in 23 cases (25.2%), and for MC in 11 cases (12.08%). Amongst the individual components, the highest perfect agreement among pathologists and AI was observed for grade 3 TF category (39 cases; 42.8%), grade 2 NP category (13 cases; 14.2%), and grade 1 MC category (10 cases; 10.9%).

Variability Between Pathologists and Artificial Intelligence in Breast Carcinoma Grading

Overall interobserver agreement for grade amongst all pathologists and AI was moderate (κ = 0.471), with the best agreement for grade 1 (κ = 0.681), followed by grade 3 (κ = 0.442), and only fair agreement for grade 2 (κ = 0.368) (Table 3). Similar interobserver agreement was observed when pathologists only were evaluated. For observer pairs, concordance for overall grade ranged from fair to good (κ = 0.313–0.623) (Table 4). For the individual components, the concordance among observer pairs for TF ranged from slight to good (κ = 0.245–0.648); for NP, from slight to good (κ = 0.148–0.616); and for MC, from slight to moderate (κ = 0.009–0.439). Interobserver agreement amongst all pathologists and AI was fair to moderate for the individual components, with κ values of 0.233, 0.342, and 0.471 for the MC, NP, and TF, respectively. For the individual categories of the grade components, the degree of agreement amongst all pathologists and AI ranged from slight to good, with the least concordance for MC score 1 (κ = 0.092) and the best concordance for TF score 3 (κ = 0.634).

Table 3

Interobserver Variability for Grade and Individual Grading Components Between Pathologists Only and With Artificial Intelligence (AI)

Interobserver Variability for Grade and Individual Grading Components Between Pathologists Only and With Artificial Intelligence (AI)
Interobserver Variability for Grade and Individual Grading Components Between Pathologists Only and With Artificial Intelligence (AI)
Table 4

Pairwise Cohen κ for Overall Grade, Tubule Formation, Nuclear Pleomorphism, and Mitotic Count

Pairwise Cohen κ for Overall Grade, Tubule Formation, Nuclear Pleomorphism, and Mitotic Count
Pairwise Cohen κ for Overall Grade, Tubule Formation, Nuclear Pleomorphism, and Mitotic Count

Variability Between Pathologists and Artificial Intelligence in Breast Carcinoma Grading Based on Type of Breast Carcinoma

Overall interobserver agreement for grade amongst all pathologists and AI was moderate for invasive ductal carcinoma (κ = 0.4938) and slight for invasive lobular carcinoma (κ = 0.0473). The best agreement was observed for cases of other special types of invasive carcinoma (κ = 0.6692), which included 1 tubular carcinoma, 2 invasive solid papillary carcinomas, 1 invasive tubulolobular carcinoma, and 1 invasive carcinoma with squamous metaplastic features (Table 5).

Table 5

Fleiss κ for Overall Grade Based on Type of Breast Carcinoma

Fleiss κ for Overall Grade Based on Type of Breast Carcinoma
Fleiss κ for Overall Grade Based on Type of Breast Carcinoma

Histologic grading is one of the most important prognostic features in the evaluation of early-stage breast carcinoma; however, manual grading of breast carcinoma by pathologists is imperfect, with moderate to good interobserver variability.12,6065  As interest in digital WSI and VM increases, demonstrating reasonable concordance amongst pathologists using these digital platforms is of the utmost importance. Few studies have been conducted that compared the performance of the pathologists on digital WSI and on glass microscopy. Davidson et al9  found that while interobserver agreement for NGS amongst pathologists using glass slides or WSI was similar, agreement for NGS on WSI was slightly, but not significantly, lower. Our prior study using VM with a multi-institutional cohort of pathologists showed moderate concordance for breast cancer grading, similar to studies using light microscopy.11  As VM has become increasingly utilized, so have AI and machine learning techniques that may be used to improve unacceptable interobserver variation in histologic evaluation of breast specimens. Our study applied machine learning on WSIs to evaluate AI's performance in the NGS compared to grading by pathologists.

Agreement between pathologists only (group 1; 6 observers) and AI + pathologists (group 2; 7 observers) was comparable with moderate concordance for overall histologic grade and TF, as well as slight concordance for overall NP and overall MC. Except for NP, agreement for categories 1 and 3 was better than for category 2. Interestingly, with AI in group 2 as a seventh observer, the agreement from group 1 to group 2 demonstrated a slight increment for TF score 2 (κ = 0.242 versus κ = 0.281, respectively), TF score 3 (κ = 0.627 versus κ = 0.634, respectively), overall MC (κ = 0.225 versus κ = 0.233, respectively), MC score 1 (κ = 0.30 versus κ = 0.303, respectively), MC score 2 (κ = 0.082 versus κ = 0.092, respectively), and MC score 3 (κ = 0.339 versus κ = 0.352, respectively). When evaluating the individual components, the best concordance for both pathologists only (group 1; 6 observers) and AI + pathologists (group 2; 7 observers) was in score 3 for TF, NP, and MC. Pairwise agreement between AI and each pathologist for overall histologic grade ranged from fair to good (κ = 0.31–0.61). With the exception of NP score 1, which could not be assessed, perfect agreement for overall grade and individual components is similar to what has been previously reported by pathologists alone.11  A similar pattern emerged with highest rates of perfect agreement for TF, followed by NP, and then MC. When evaluating histologic types of breast carcinoma, agreement between pathologists only (group 1; 6 observers) and AI + pathologists (group 2; 7 observers) was comparable with moderate concordance for invasive ductal carcinoma, slight concordance for invasive lobular carcinoma, and good for the other types of invasive carcinomas. Overall, these findings show that AI concordance is not different from that of manual grading by pathologists on a digital platform.

Ours is one of few studies to use AI for breast carcinoma grading based on individual component (TF, NP, and MC) scores at the WSI level. One study, which assessed performance of AI in evaluating TF, demonstrated an accuracy of 89% in quantifying the degree of TF at tile level.45  Comparatively, we were able to demonstrate an accuracy of 75.8% for TF at the WSI level. Compared to batch mode active learning done at the patch level, which has been shown to achieve 84% to 86% accuracy on nuclear pleomorphism (ie, atypia), we achieved 80% accuracy with respect to the original reported score for the cohort at the WSI level.46  Compared to Khan et al,48  who used a patch-level approach for MC scoring to achieve an accuracy of 0.83 for Aperio and Hamamatsu scanned images, we achieved an accuracy of 0.659 at the WSI level. Given the minimal change in κ values when adding AI as a seventh observer to that seen amongst the 6 pathologists alone, our method for MC scoring performs equivalently to manual scoring by pathologists at the WSI level. Similar results were seen in a study using maximized interclass weighted measure.42 

As compared to a study that used deep learning on tissue microarrays (TMAs) to determine grade, we achieved an accuracy of 52% with respect to the original reported score for high-grade tumors, while they achieved 72% accuracy in classifying high-grade tumors, and when combining grades 1 and 2, we achieved an accuracy of 70%, compared to their accuracy of 94%.31  This previous study, where grading was established on 1 to 4 TMA cores per case at ×20 magnification, used an image classification approach that did not provide explainable features directly to grade the tumors. To extract explainable features we analyzed the different components at differing magnifications (ie, TF at ×10 and NP and MC at ×40). We found that using ×10, compared to ×20 or ×40, allowed us to capture a larger area and allowed AI to better segment tubules and sheets in order to grade TF. Furthermore, evaluating nuclei at ×40 allowed us to better capture nuclear details than at ×20 and allowed us to detect mitosis and segment tumor cells for MC and NP grading, respectively. We combined the analysis of these individual components to determine the grade. We did not observe a specific pattern that led to the observation that AI was least accurate for grade 3 tumors (ie, not always related to inaccuracies of an individual component). While our study shows similar rates of concordance between a multi-institutional group of pathologists and AI in histologic grading, pathologist agreement in grading on TMAs exceeded that of deep learning on TMAs (κ of 0.78 versus 0.64, respectively).31 

We acknowledge there are limitations to our study. Our study defined the ground truth for training AI on the basis of grades in the original reports, which were signed out by a number of pathologists. With the known variability in grading amongst pathologists, we may have considered using a ground truth based on an expert consensus for training AI. Our AI model requires pathologist annotation of the tumor (for TF and NP) and the leading edge of the tumor (for MC). By comparison, the region of tumor evaluated by pathologists for MC was not annotated but chosen at their discretion (ie, may not always be limited to the leading edge). Additionally, NP scoring in our model is dependent on the TF output. Our cohort was also biased toward NP scores of 2 and 3, so we were unable to assess cases with perfect agreement amongst pathologists and AI for NP score 1. Finally, we acknowledge that there will be the need to test this AI model on other independent test sets to ultimately determine its utility.

Ours is the first study comparing concordance in breast carcinoma grading between a multi-institutional group of pathologists using VM to a newly developed WSI AI methodology that incorporated individual grading components to determine overall grade. Our approach toward grading is explainable, where TF scoring is based on AI segmentation of tubules and sheets, NP scoring is based on statistical analysis of AI segmented tumor cell morphometrics, and MC scoring is based on AI-detected mitoses within digital equivalents of HPFs. Using these explainable methods, AI demonstrated similar concordance to pathologists alone. As previously reported, MC continues to present challenges on the digital platform, and its value as a reproducible variable in grading deserves additional study. Further studies with larger and less biased cohorts may lead to improvements in developing custom deep learning models to further improve the accuracy metrics of the models, which potentially could incorporate additional features to make the approach more explainable in grading.

In this study, we sought to develop and compare a newly developed AI methodology for grading to a multi-institutional group of breast pathologists using VM. We have demonstrated reasonable concordance between AI and a group of multi-institutional pathologists; however, further studies would be beneficial to understand how access to this newly developed AI methodology may influence pathologists' grading or be used to improve efficiency or concordance in pathologists' grading across institutions. At this juncture, we cannot comment on whether this methodology should be used as an assistant or an authoritative tool but believe that future studies that include access by pathologists to the AI grading, as well as comparisons of AI to trainees and/or general pathologists, may assist in demonstrating the implications of using this AI methodology in practical settings and its limitations.

1.
Elston
CW,
Ellis
IO.
Pathological prognostic factors in breast cancer, I—the value of histological grade in breast cancer: experience from a large study with long-term follow-up
.
Histopathology
.
1991
;
19
(5)
:
403
410
.
2.
Bloom
HJ.
Further studies on prognosis of breast carcinoma
.
Br J Cancer
.
1950
;
4
(4)
:
347
367
.
3.
Bloom
HJ.
Prognosis in carcinoma of the breast
.
Br J Cancer
.
1950
;
4
(3)
:
259
288
.
4.
Bloom
HJ,
Richardson
WW.
Histological grading and prognosis in breast cancer; a study of 1409 cases of which 359 have been followed for 15 years
.
Br J Cancer
.
1957
;
11
(3)
:
359
377
.
5.
Elston
CW.
The assessment of histological differentiation in breast cancer
.
Aust N Z J Surg
.
1984
;
54
(1)
:
11
15
.
6.
Rakha
EA,
El-Sayed
ME,
Lee
AHS,
et al
Prognostic significance of Nottingham histologic grade in invasive breast carcinoma
.
J Clin Oncol
.
2008
;
26
(19)
:
3153
3158
.
7.
Amin
MB,
Edge
SB,
Greene
FL, et al, eds.
AJCC Cancer Staging Manual. 8th ed. Springer;
2017
.
8.
Li
X,
Zhang
Y,
Meisel
J,
et al
Validation of the newly proposed American Joint Committee on Cancer (AJCC) breast cancer prognostic staging group and proposing a new staging system using the National Cancer Database
.
Breast Cancer Res Treat
.
2018
;
171
(2)
:
303
313
.
9.
Davidson
TM,
Rendi
MH,
Frederick
PD,
et al
Breast cancer prognostic factors in the digital era: comparison of Nottingham grade using whole slide images and glass slides
.
J Pathol Inform
.
2019
;
10
:
11
.
10.
Delides
GS,
Garas
G,
Georgouli
G,
et al
Intralaboratory variations in the grading of breast carcinoma
.
Arch Pathol Lab Med
.
1982
;
106
(3)
:
126
128
.
11.
Ginter
PS,
Idress
R,
D'Alfonso
TM,
et al
Histologic grading of breast carcinoma: a multi-institution study of interobserver variation using virtual microscopy
.
Mod Pathol
.
2020
;
34
(4)
:
701
709
.
12.
Meyer
JS,
Alvarez
C,
Milikowski
C,
et al
Breast carcinoma malignancy grading by Bloom-Richardson system vs proliferation index: reproducibility of grade and advantages of proliferation index
.
Mod Pathol
.
2005
;
18
(8)
:
1067
1078
.
13.
Rakha
EA,
Aleskandarani
M,
Toss
MS,
et al
Breast cancer histologic grading using digital microscopy: concordance and outcome association
.
J Clin Pathol
.
2018
;
71
(8)
:
680
686
.
14.
Rakha
EA,
Aleskandarany
MA,
Toss
MS,
et al
Impact of breast cancer grade discordance on prediction of outcome
.
Histopathology
.
2018
;
73
(6)
:
904
915
.
15.
Robbins
P,
Pinder
S,
de Klerk
N,
et al
Histological grading of breast carcinomas: a study of interobserver agreement
.
Hum Pathol
.
1995
;
26
(8)
:
873
879
.
16.
Al-Janabi
S,
Huisman
A,
Van Diest
PJ.
Digital pathology: current status and future perspectives
.
Histopathology
.
2012
;
61
(1)
:
1
9
.
17.
Allen
TC.
Digital pathology and federalism
.
Arch Pathol Lab Med
.
2014
;
138
(2)
:
162
165
.
18.
Brachtel
E,
Yagi
Y.
Digital imaging in pathology–current applications and challenges
.
J Biophotonics
.
2012
;
5
(4)
:
327
335
.
19.
Hedvat
CV.
Digital microscopy: past, present, and future
.
Arch Pathol Lab Med
.
2010
;
134
(11)
:
1666
1670
.
20.
Kayser
K.
Introduction of virtual microscopy in routine surgical pathology—a hypothesis and personal view from Europe
.
Diagn Pathol
.
2012
;
7
:
48
.
21.
Rocha
R,
Vassallo
J,
Soares
F,
Miller
K,
Gobbi
H.
Digital slides: present status of a tool for consultation, teaching, and quality control in pathology
.
Pathol Res Pract
.
2009
;
205
(11)
:
735
741
.
22.
US Food & Drug Administration.
FDA allows marketing of first whole slide imaging system for digital pathology
.
2021
.
23.
Alom
MZ,
Yakopcic
C,
Nasrin
MS,
Taha
TM,
Asari
VK.
Breast cancer classification from histopathological images with inception recurrent residual Convolutional Neural Network
.
J Digit Imaging
.
2019
;
32
(4)
:
605
617
.
24.
Araujo
T,
Aresta
G,
Castro
E,
et al
Classification of breast cancer histology images using Convolutional Neural Networks
.
PLoS One
.
2017
;
12
(6)
:
e0177544
.
25.
Cruz-Roa
A,
Gilmore
H,
Basavanhally
A,
et al
Accurate and reproducible invasive breast cancer detection in whole-slide images: a Deep Learning approach for quantifying tumor extent
.
Sci Rep
.
2017
;
7
:
46450
.
26.
Han
Z,
Wei
B,
Zheng
Y,
et al
Breast cancer multi-classification from histopathological images with structured deep learning model
.
Sci Rep
.
2017
;
7
(1)
:
4172
.
27.
Mercan
E,
Mehta
S,
Bartlett
J,
et al
Assessment of machine learning of breast pathology structures for automated differentiation of breast cancer and high-risk proliferative lesions
.
JAMA Netw Open
.
2019
;
2
(8)
:
e198777
.
28.
Qi
Q,
Li
Y,
Wang
J,
et al
Label-efficient breast cancer histopathological image classification
.
IEEE J Biomed Health Inform
.
2019
;
23
(5)
:
2108
2116
.
29.
Wolberg
WH,
Street
WN,
Mangasarian
OL.
Machine learning techniques to diagnose breast cancer from image-processed nuclear features of fine needle aspirates
.
Cancer Lett
.
1994
;
77
(2-3)
:
163
171
.
30.
Ehteshami Bejnordi
B,
Veta
M,
Johannes van Diest
P,
et al
.
Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer
.
JAMA
.
2017
;
318
(22)
:
2199
2210
.
31.
Couture
HD,
Williams
LA,
Geradts
J,
et al
Image analysis with deep learning to predict breast cancer grade, ER status, histologic subtype, and intrinsic subtype
.
NPJ Breast Cancer
.
2018
;
4
:
30
.
32.
Loukas
C,
Kostopoulos
S,
Tanoglidi
A,
et al
Breast cancer characterization based on image classification of tissue sections visualized under low magnification
.
Comput Math Methods Med
.
2013
;
2013
:
829461
.
33.
Adabor
ES,
Acquaah-Mensah
GK.
Machine learning approaches to decipher hormone and HER2 receptor status phenotypes in breast cancer
.
Brief Bioinform
.
2019
;
20
(2)
:
504
514
.
34.
Jaber
MI,
Song
B,
Taylor
C,
et al
A deep learning image-based intrinsic molecular subtype classifier of breast tumors reveals tumor heterogeneity that may affect survival
.
Breast Cancer Res
.
2020
;
22
(1)
:
12
.
35.
Naik
N,
Madani
A,
Esteva
A,
et al
Deep learning-enabled breast cancer hormonal receptor status determination from base-level H&E stains
.
Nat Commun
.
2020
;
11
(1)
:
5727
.
36.
Rawat
RR,
Ortega
I,
Roy
P,
et al
Deep learned tissue “fingerprints” classify breast cancers by ER/PR/Her2 status from H&E images
.
Sci Rep
.
2020
;
10
(1)
:
7275
.
37.
Lu
C,
Xu
H,
Xu
J,
et al
Multi-pass adaptive voting for nuclei detection in histopathological images
.
Sci Rep
.
2016
;
6
:
33985
.
38.
Basavanhally
AN,
Ganesan
S,
Agner
S,
et al
Computerized image-based detection and grading of lymphocytic infiltration in HER2+ breast cancer histopathology
.
IEEE Trans Biomed Eng
.
2010
;
57
(3)
:
642
653
.
39.
McIntire
PJ,
Irshaid
L,
Liu
Y,
et al
Hot spot and whole-tumor enumeration of CD8(+) tumor-infiltrating lymphocytes utilizing digital image analysis is prognostic in triple-negative breast cancer
.
Clin Breast Cancer
.
2018
;
18
(6)
:
451
458
e451.
40.
McIntire
PJ,
Zhong
E,
Patel
A,
et al
Hotspot enumeration of CD8+ tumor-infiltrating lymphocytes using digital image analysis in triple-negative breast cancer yields consistent results
.
Hum Pathol
.
2019
;
85
:
27
32
.
41.
Balkenhol
MCA,
Tellez
D,
Vreuls
W,
et al
Deep learning assisted mitotic counting for breast cancer
.
Lab Invest
.
2019
;
99
(11)
:
1596
1606
.
42.
Nateghi
R,
Danyali
H,
Helfroush
MS.
Maximized inter-class weighted mean for fast and accurate mitosis cells detection in breast cancer histopathology images
.
J Med Syst
.
2017
;
41
(9)
:
146
.
43.
Veta
M,
van Diest
PJ,
Willems
SM,
et al
Assessment of algorithms for mitosis detection in breast cancer histopathology images
.
Med Image Anal
.
2015
;
20
(1)
:
237
248
.
44.
Wahab
N,
Khan
A,
Lee
YS.
Two-phase deep convolutional neural network for reducing class skewness in histopathological images based breast cancer detection
.
Comput Biol Med
.
2017
;
85
:
86
97
.
45.
Basavanhally
A,
Yu
E,
Xu
J,
et al
Incorporating domain knowledge for tubule detection in breast histopathology using O'Callaghan neighborhoods.
Paper presented at: SPIE Medical Imaging; February
14
16
,
2011
;
Lake Buena Vista (Orlando), FL.
46.
Das
A,
Nair
MS,
Peter
DS.
Batch mode active learning on the Riemannian manifold for automated scoring of nuclear pleomorphism in breast cancer
.
Artif Intell Med
.
2020
;
103
:
101805
.
47.
Das
A,
Nair
MS,
Peter
SD.
Computer-aided histopathological image analysis techniques for automated nuclear atypia scoring of breast cancer: a review
.
J Digit Imaging
.
2020
;
33
(5)
:
1091
1121
.
48.
Khan
AM,
Sirinukunwattana
K,
Rajpoot
N. A
Global covariance descriptor for nuclear atypia scoring in breast histopathology images
.
IEEE J Biomed Health Inform
.
2015
;
19
(5)
:
1637
1647
.
49.
Srivastava
A,
Kulkami
C,
Li
Z,
Parwani
A,
Macchiraju
R.
Nottingham grading of breast invasive carcinoma utilizing deep learning models
.
Mod Pathol
.
2019
;
32
:
145
.
50.
Vahadane
A,
Peng
T,
Sethi
A,
et al
Structure-preserving color normalization and sparse stain separation for histological images
.
IEEE Trans Med Imaging
.
2016
;
35
(8)
:
1962
1971
.
51.
Ronneberger
O,
Fischer
P,
Brox
T.
U-Net: convolutional networks for biomedical image segmentation.
The SAO/NASA Astrophysics Data System. 2015.
2021
.
52.
Huang
G,
Liu
Z,
van der Maaten
L,
Weinberger
KQ.
Densely connected convolutional networks.
The SAO/NASA Astrophysics Data System. 2016.
2021
.
53.
Russakovsky
O,
Deng
J,
Su
H,
et al
ImageNet Large Scale Visual Recognition Challenge
.
Int J Comput Vis
.
2015
;
115
(3)
:
211
252
.
54.
Graham
S,
Vu
QD,
Raza
SEA,
et al
Hover-Net: simultaneous segmentation and classification of nuclei in multi-tissue histology images
.
Med Image Anal
.
2019
;
58
:
101563
.
55.
He
K,
Zhang
X,
Ren
S,
Sun
J.
Identity mappings in deep residual networks
.
The SAO/NASA Astrophysics Data System. 2016.
2021
.
56.
Chaurasia
A,
Culurciello
E.
LinkNet: exploiting encoder representations for efficient semantic segmentation
.
The SAO/NASA Astrophysics Data System. 2017.
2021
.
57.
Tan
M,
Le
QV.
EfficientNet: rethinking model scaling for convolutional neural networks
.
The SAO/NASA Astrophysics Data System. 2019.
2021
.
58.
Fleiss
JL,
Levin
B,
Paik
MC.
Statistical Methods for Rates and Proportions. 3rd ed
.
Hoboken, NJ
:
J. Wiley;
2003
:
760
.
59.
Landis
JR,
Koch
GG.
The measurement of observer agreement for categorical data
.
Biometrics
.
1977
;
33
(1)
:
159
174
.
60.
Boiesen
P,
Bendahl
PO,
Anagnostaki
L,
et al
Histologic grading in breast cancer—reproducibility between seven pathologic departments: South Sweden Breast Cancer Group
.
Acta Oncol
.
2000
;
39
(1)
:
41
45
.
61.
Chowdhury
N,
Pai
MR,
Lobo
FD,
Kini
H,
Varghese
R.
Impact of an increase in grading categories and double reporting on the reliability of breast cancer grade
.
APMIS
.
2007
;
115
(4)
:
360
366
.
62.
Harvey
JM,
de Klerk
NH,
Sterrett
GF.
Histological grading in breast cancer: interobserver agreement, and relation to other prognostic factors including ploidy
.
Pathology
.
1992
;
24
(2)
:
63
68
.
63.
Longacre
TA,
Ennis
M,
Quenneville
LA,
et al
Interobserver agreement and reproducibility in classification of invasive breast carcinoma: an NCI breast cancer family registry study
.
Mod Pathol
.
2006
;
19
(2)
:
195
207
.
64.
Postma
EL,
Verkooijen
HM,
van Diest
PJ,
et al
Discrepancy between routine and expert pathologists' assessment of non-palpable breast cancer and its impact on locoregional and systemic treatment
.
Eur J Pharmacol
.
2013
;
717
(1-3)
:
31
35
.
65.
Rabe
K,
Snir
OL,
Bossuyt
V,
et al
Interobserver variability in breast carcinoma grading results in prognostic stage differences
.
Hum Pathol
.
2019
;
94
:
51
57
.

Author notes

This research was supported in part through the National Institutes of Health/National Cancer Institute Cancer Center Support Grant P30 CA008748. Also, funding and project support for this research was provided in part by the Center for Translational Pathology at the Department of Pathology and Laboratory Medicine, Weill Cornell Medicine.

Supplemental digital content is available for this article at https://meridian.allenpress.com/aplm in the November 2022 table of contents.

Competing Interests

Authors Mantrala, Mitkari, Joshi, Prabhala, Ramachandra, Kini, and Koka are current employees of Onward Assist.

Mantrala and Ginter contributed equally to this work. Harigopal and Koka share senior authorship. Ginter is now located at NYU Langone Hospital, Mineola, New York

Supplementary data