Context.—

Most deep learning (DL) studies have focused on neoplastic pathology, with the realm of inflammatory pathology remaining largely untouched.

Objective.—

To investigate the use of DL for nonneoplastic gastric biopsies.

Design.—

Gold standard diagnoses were blindly established by 2 gastrointestinal pathologists. For phase 1, 300 classic cases (100 normal, 100 Helicobacter pylori, 100 reactive gastropathy) that best displayed the desired pathology were scanned and annotated for DL analysis. A total of 70% of the cases for each group were selected for the training set, and 30% were included in the test set. The software assigned colored labels to the test biopsies, which corresponded to the area of the tissue assigned a diagnosis by the DL algorithm, termed area distribution (AD). For Phase 2, an additional 106 consecutive nonclassical gastric biopsies from our archives were tested in the same fashion.

Results.—

For Phase 1, receiver operating curves showed near perfect agreement with the gold standard diagnoses at an AD percentage cutoff of 50% for normal (area under the curve [AUC] = 99.7%) and H pylori (AUC = 100%), and 40% for reactive gastropathy (AUC = 99.9%). Sensitivity/specificity pairings were as follows: normal (96.7%, 86.7%), H pylori (100%, 98.3%), and reactive gastropathy (96.7%, 96.7%). For phase 2, receiver operating curves were slightly less discriminatory, with optimal AD cutoffs reduced to 40% across diagnostic groups. The AUCs were 91.9% for normal, 100% for H pylori, and 94.0% for reactive gastropathy. Sensitivity/specificity parings were as follows: normal (73.7%, 79.6%), H pylori (95.7%, 100%), reactive gastropathy (100%, 62.5%).

Conclusions.—

A convolutional neural network can serve as an effective screening tool/diagnostic aid for H pylori gastritis.

The recent approval of whole-slide images for primary diagnosis has rapidly accelerated the adoption of digital pathology in diagnostic settings. In addition, there are now large, publicly available data sets of digital images (eg, ATCG) that enable the rapid training and testing of deep learning (DL) algorithms. Deep learning is a specific type of machine learning that uses complex neural networks capable of unsupervised learning from labeled or unlabeled data sets. The application of DL to anatomic pathology is not surprising, given its dependence on visual pattern recognition in this area. The DL algorithms continually (and automatically) improve the set of features needed to ensure the accuracy of the predictions through an automated feedback loop. They use a cascade of multilayered nonlinear data processing units based on a “neural network.” The key element of DL networks is the “strength” or the “weights” of the interconnections between these individual data processing “neurons,” which enables the prediction process.

In anatomic pathology, numerous studies using convolutional neural networks (CNNs) have shown promising results. The area that has perhaps garnered the most attention is breast pathology, with many studies focusing on carcinoma identification,13  detection of mitoses for Nottingham grade46  hormone receptor status (estrogen receptor [ER], progesterone receptor [PR], human epidermal growth factor receptor [Her2]),79  as well as lymph node metastases.10,11  Notably, CAMELYON16, a nodal metastasis detection contest, resulted in a GoogLeNet algorithm outperforming pathologists in time-limited classification of hematoxylin-eosin lymph node metastasis.11,12  Several studies have also investigated DL in the context of prostatic,10,1319  neurologic,2022  and colonic2325  neoplasms.

Most DL studies have focused on neoplastic pathology, with the realm of medical/inflammatory pathology remaining largely untouched. In this current study, we investigated a very practical and high-volume specimen type for most pathology practices worldwide in nonneoplastic gastric biopsies. Some of the most commonly encountered clinically significant diagnoses rendered on a gastric biopsy include: Helicobacter pylori–associated gastritis, reactive (chemical) gastropathy, and histologically normal gastric mucosa. Gastric injury can incite various inflammatory patterns and reactive features that depend on the severity, type, and length of injury.26  However, there are certain histologic features that favor one etiology over another.

Chronic H pylori gastritis affects two-thirds of the world's population and is one of the most common chronic inflammatory conditions in humans.27,28  The prevalence of H pylori approaches 90% in adults and a large proportion of children in developing countries, indicating early exposure to the organism. In industrialized countries, infection tends to occur later in life, affecting up to 30% of adults age 50 years.29,30  Although chronic H pylori gastritis can be asymptomatic in patients, its impact on human health is great26  given the association with gastric mucosa-associated lymphoid tissue lymphomas and intestinal-type dysplasia and carcinoma.

A host of histologic features are suggestive of chronic H pylori infection, the most classic being: prominent antral superficial bandlike dense lymphoplasmacytic inflammation, intraepithelial acute/neutrophilic inflammation, lymphoid aggregates with germinal centers, and, of course, the identification of curved, rod-shaped H pylori organisms at the apical surface of foveolar cells or superficial luminal mucin pools. Given the high prevalence of proton-pump inhibitor use, significant infections of body/fundic (oxyntic) mucosa are seen as well.3139 

The other classic and common pattern of injury investigated in this study is reactive gastropathy. Reactive gastropathy is typically antral predominant, associated with duodenopancreatic (bile) reflux,40,41  and various medication/chemical agents, including nonsteroidal anti-inflammatory drugs, acid, alkali, and ethanol.4244  Given that millions of people worldwide ingest nonsteroidal anti-inflammatory drugs and ethanol daily, reactive gastropathy is one of the most common diagnoses of gastric biopsies in industrialized countries.26  Classic histologic features of reactive gastropathy include: mucin depletion and thinning of foveolar epithelium, increased foveolar regeneration/hyperplasia (“corkscrewing”), increased mitoses, nuclear hyperchromasia, mucosal edema, lamina propria smooth muscle proliferation, and epithelial erosion.

Experienced gastrointestinal and general pathologists can discriminate between these histologic patterns with relative ease, and often at low power. Accordingly, these patterns are well suited to test the diagnostic capabilities of DL in the realm of inflammatory pathology. We aim to train and test DL technology (via a CNN) on gastric biopsies, and hypothesize, as an initial proof of concept, that the algorithm can discriminate between commonly encountered inflammatory patterns of gastric injury.

MATERIALS AND METHODS

This study was performed in accordance with the Institutional Review Board requirements at the University of New Mexico (Albuquerque, New Mexico).

Phase 1

Gastric biopsy cases that carried a diagnosis of H pylori–related gastritis (200 cases), reactive gastropathy (200 cases), and histologically normal gastric mucosa (200 cases) were identified from our archives for review. Cases of H pylori gastritis and normal gastric mucosa were obtained from 2018, whereas reactive gastropathy cases were obtained from 2016 to 2018. Two fellowship-trained gastrointestinal (GI) pathologists reviewed all 600 cases, selecting 300 cases (aka 300 slides) from 300 different patients (100 H pylori gastritis, 100 reactive gastropathy, 100 healthy) that best displayed the desired pathology. The GI pathologists' diagnoses based on this case selection were considered the gold standard diagnoses.

In general, for H pylori gastritis, cases showing dense superficial bandlike chronic inflammation (with or without lymphoid follicles with germinal centers), with at least focal active/neutrophilic inflammation, were selected. Only cases with a significant number of organisms, identified by either hematoxylin-eosin stain or immunohistochemistry, were included, and cases containing sparse organisms were excluded.

For reactive gastropathy, cases showing foveolar tortuosity with mucin depletion were included. Given that gastric mucosa adjacent to an ulcer can have histologic features similar to those of reactive gastropathy, biopsies submitted as “ulcer” were excluded, as well as cases containing more than mild chronic lamina propria inflammation. Cases containing only mild reactive features (ie, mild foveolar tortuosity or mild mucin depletion) were also excluded.

For the normal biopsies, cases of mild chronic inactive gastritis (ie, more than a few aggregates of plasma cells in the superficial lamina propria) were excluded. Of these 100 normal cases, 50 were selected to train on antral fragments and 50 cases were selected to train on oxyntic fragments. However, all the tissue fragments on the slide (both antral and oxyntic fragments) had to be histologically normal to be included.

The 300 cases were then scanned with the Aperio VERSA 200 slide scanner (Leica, Wetzlar, Germany) at ×40 magnification and imported into a computer containing a 12 Core, 2.2 GHz Intel Xeon Processor E5-2650 chip and an Nvidia Titan XP graphics card. HALO-AI image analysis software (Indica Labs, Corrales, New Mexico) was used to perform the training and testing. HALO-AI uses a fully convolutional version of the VGG architecture45  with padding removed.

All 300 cases were then manually annotated on the software. This was performed by consensus by 2 GI pathologists viewing the digitized slides. Entire tissue fragments best characterizing the underlying pathology were selected, unless a portion of the desired fragment contained prominent intestinal metaplasia (a nonspecific finding that can be seen in both H pylori gastritis and reactive gastropathy), in which case that portion of the fragment was excluded. To prevent gross underrepresentation or overrepresentation of particular biopsies, only 1 to 3 entire tissue fragments were selected per case. This was completed for all of the cases (both for training and test cases), and HALO-AI trained on and was tested on only these annotated fragments. Entire tissue fragments were arbitrarily designated with colored digital labels according to the HALO-AI CNN diagnosis. Helicobacter pylori was given a red digital label, chemical gastropathy a yellow label, and normal mucosa a green label.

Given that nearly all biopsies contained a certain amount of smooth muscle (muscularis mucosa) as well as white space (glass), the algorithm was also trained to categorize these regions. Muscle was given a pink label and white space a white label. In both the phase 1 and phase 2 data sets (see below), regions classified as muscle and white space were excluded from analysis.

Of the 300 cases, 210 (70%) were randomly selected for inclusion in the training set (70 H pylori, 70 reactive gastropathy, 70 normal), and 90 cases (30%) were randomly included in the test set (30 H pylori, 30 reactive gastropathy, 30 normal). Training was performed by HALO-AI on entire tissue fragments, broken down into “image patches” of 400 × 400 pixels (where 1 pixel = 1 μm) at a resolution corresponding to a ×5.5 digital view magnification. This magnification corresponded to a digital view in which the 2 GI pathologists could reliably confirm the gold standard diagnosis in a digital field. Approximately 30% to 50% of the biopsy fragment could be viewed at this magnification.

Within the confines of the previously annotated entire tissue fragments, the image patches (400 × 400 pixels within a 5.5× digital field view) analyzed by HALO-AI were generated by automated selection of random points and cropping a patch around the point. These patches were further augmented with random rotations and random shifts to hue, saturation, contrast, and brightness. Training was performed for a total of 74 444 analytic iterations using RMSProp46  (delta of 0.9) with a learning rate of 1e-3 reducing the learning rate by 10% every 2k iterations and an L2 regularization of 5e-4. Because no padding was used, the tile size was increased to 1867 × 1867, thus enhancing performance without changing the output. During these iterations, the algorithm would change the node-weighted values based off the “gold standard call” of the 2 GI pathologists continuously. The HALO-AI operator stopped the algorithm once an error rate/cross entropy rate of <0.01 was achieved.

After training was completed, for the test set the colored digital labels assigned by HALO-AI corresponded to the area of the tissue assigned a diagnostic category by the DL and was termed area distribution (AD), because they comprised a variable percentage of each tissue fragment for each case. All cases (both in the training and the test set) had been previously assigned a single gold standard diagnosis within the annotated region/fragment (see above). If a region matched the gold standard diagnosis, this was considered a true-positive AD. If a region was contrary to the gold standard diagnosis, this was considered a false-positive AD. All ADs were therefore quantified as percentages for each case.

Phase 2

After this initial portion of the study, an additional 130 cases were identified (all from different patients) for additional testing. The 130 cases represented the last 130 consecutive gastric biopsies from our archives. The GI pathologists then reviewed these, obtaining diagnostic concurrence. Gastric polyps, malignancies, cases of autoimmune gastritis, and cases with abundant intestinal metaplasia were excluded. There were few cases containing at least moderate chronic superficial gastritis in a pattern highly suggestive of H pylori gastritis, with concurrent negative immunostaining for H pylori. Because H pylori infection can be patchy in nature, and the histologic identification of the organism is entirely dependent on sampling from the endoscopist, these cases were excluded because we could not fit them into a diagnostic category with absolute certainty. Ultimately 106 of the 130 cases were included in this second phase of the study.

Five diagnostic categories were identified in this test group: histologically normal mucosa (n = 29); mild chronic inactive gastritis, H pylori negative (n = 28); H pylori gastritis (n = 23); mild reactive gastropathy (n = 15); and reactive gastropathy (n = 11). These 106 cases were scanned and annotated as outlined above. All of the phase 2 cases were then tested using the previously trained HALO-AI algorithm from phase 1.

RESULTS

Phase 1

For the 90 phase 1 test cases (30 H pylori, 30 reactive gastropathy, 30 normal), the HALO-AI CNN assigned colored digital label(s) for each tissue fragment as described previously. The CNN highlighted various proportions of each fragment different labels depending on its preferred regional diagnosis and also provided a percent AD for each diagnostic category (percentage of total annotated area per case for H pylori, reactive gastropathy, and normal after regions classified as muscle and white space were eliminated by investigators; Figure 1). The AD is therefore equal to the total CNN diagnostic percentage for each case regardless of the number of tissue fragments analyzed on the slide. For example, for the case with the most conclusive H pylori AD (HPAD), the CNN classified 98.5% of the biopsy fragments as H pylori, whereas only 0.9% was classified as reactive gastropathy and 0.6% as normal (see supplemental digital content tables 1 through 3 for phase 1 AD data at www.archivesofpathology.org in the March 2020 table of contents).

Figure 1

Representative phase 1 test cases. Normal gastric mucosa (A), characterized as normal (green label) by the algorithm. The green label corresponds to near 100% area distribution for a normal diagnosis in this case (B). An area containing muscularis mucosa is correctly characterized with the pink label. Helicobacter pylori gastritis (red label) without (C) and with (D) algorithm characterization (red indicates H pylori area distribution). Reactive gastropathy (yellow label) without (E) and with (F) algorithm characterization (yellow indicates reactive gastropathy area distribution) (hematoxylin-eosin, original magnification ×5.5).

Figure 1

Representative phase 1 test cases. Normal gastric mucosa (A), characterized as normal (green label) by the algorithm. The green label corresponds to near 100% area distribution for a normal diagnosis in this case (B). An area containing muscularis mucosa is correctly characterized with the pink label. Helicobacter pylori gastritis (red label) without (C) and with (D) algorithm characterization (red indicates H pylori area distribution). Reactive gastropathy (yellow label) without (E) and with (F) algorithm characterization (yellow indicates reactive gastropathy area distribution) (hematoxylin-eosin, original magnification ×5.5).

For H pylori cases, the mean HPAD was 83.5%. For reactive gastropathy, the mean reactive gastropathy area distribution (RGAD) was 82.0%, and for normal cases, the mean normal AD (NAD) was 86.6%. This illustrates that the CNN algorithm correctly assigned the majority of each tissue fragment for all cases with the correct digital label/digital diagnosis.

Receiver operating curves (ROC) were then generated for each case to determine the AD percentage limits for which the algorithm performed best compared with the gold standard. Surprisingly, the ROC showed near perfect agreement with the gold standard diagnoses at an AD percentage cutoff of 50% for normal (area under the curve [AUC] = 99.7%) and H pylori (AUC = 100%), and 40% for reactive gastropathy (AUC = 99.9%; Figure 2).

Figure 2

Receiver operating curves (ROCs) for phase 1 test set. Normal mucosa at area distribution (AD) cutoff 50% (A), reactive gastropathy at AD 40% (B), and Helicobacter pylori at AD 50% (C). Abbreviation: AUC, area under the curve.

Figure 3 Receiver operating curves (ROCs) for phase 2 test set. Normal mucosa at area distribution (AD) cutoff 40% (A), reactive gastropathy at AD 40% (B), and Helicobacter pylori at AD 40% (C). Abbreviation: AUC, area under the curve.

Figure 2

Receiver operating curves (ROCs) for phase 1 test set. Normal mucosa at area distribution (AD) cutoff 50% (A), reactive gastropathy at AD 40% (B), and Helicobacter pylori at AD 50% (C). Abbreviation: AUC, area under the curve.

Figure 3 Receiver operating curves (ROCs) for phase 2 test set. Normal mucosa at area distribution (AD) cutoff 40% (A), reactive gastropathy at AD 40% (B), and Helicobacter pylori at AD 40% (C). Abbreviation: AUC, area under the curve.

Because our CNN achieved optimal diagnostic discrimination with an AD cutoff of 50% (true positive) in 2 of the 3 diagnostic categories as determined by the ROCs (Figure 2), accuracy metrics (sensitivity/specificity pairings, etc) using this true-positive cutoff were calculated with false-positive AD rates defined at ≥10%, ≥20%, and ≥30% (false positives are percent ADs that are contrary to the gold standard diagnosis set by the GI pathologists). The greatest accuracy was achieved with a false-positive rate set at ≥30%.

For the purposes of calculating PPV and NPV for cases of H pylori gastritis (in both phase 1 and phase 2), a false-positive result was established as a case that met or exceeded this 30% threshold while simultaneously failing to meet its correct AD threshold of ≥50%, meaning at least 30% of the tissue had to be labeled with the incorrect AD and ≤50% of the remaining tissue had to be labeled with the correct AD for it to be considered a false positive. For example, a case of H pylori gastritis with a ≥30% NAD call would be considered a false-positive normal HALO-AI diagnosis only if the HPAD did not meet or exceed 50%. If the HPAD was ≥50% for this same case, this AD determination took precedence and the case would be considered a true-positive H pylori result.

Sensitivity, specificity, and F-score values under these parameters were established for each case as follows: normal (96.7%, 86.7%, 0.87), H pylori (100%, 98.3%, 0.98), and reactive gastropathy (96.7%, 96.7%, 0.95). See Table 1 for complete results. Regarding the established gold standard H pylori diagnoses, no non–H pylori case met the false-positive HALO-AI HPAD threshold of ≥30% while simultaneously failing to meet its own true-positive HALO-AI AD threshold of ≥50%, indicating a PPV of 100%. One H pylori case was assigned a HALO-AI HPAD of exactly 50% with a 33.7% NAD. This was still considered a true-positive HALO-AI H pylori diagnosis (as opposed to a false negative) because it met the 50% HPAD cutoff for a final HALO-AI diagnosis of H pylori gastritis (Supplementary Table 1). Only 1 non–H pylori case exceeded the ≥30% HPAD false-positive cutoff because it was assigned a HPAD of 42.1% (Supplementary Table 3). However, this case carried a gold standard diagnosis of normal and was assigned a NAD of 57.9% and was therefore not considered a false-positive result. There were no false-negative CNN H pylori diagnoses, resulting in an NPV of 100%.

Table 1

Phase 1 Accuracy Metrics With a True-Positive Cutoff of 50% and a False-Positive Cutoff of 30%

Phase 1 Accuracy Metrics With a True-Positive Cutoff of 50% and a False-Positive Cutoff of 30%
Phase 1 Accuracy Metrics With a True-Positive Cutoff of 50% and a False-Positive Cutoff of 30%

Phase 2

Because the cases that were selected for training and testing in the phase 1 data set consisted of diagnostically unambiguous examples of normal mucosa, H pylori gastritis, and reactive gastropathy, the algorithm was only trained to classify tissue regions into 1 of these 3 diagnostic groups. The intent of phase 2 was to evaluate how well the algorithm performed on a morphologically heterogeneous set of cases (ie, cases that were not preselected to represent the best histologic example of a particular diagnosis). For H pylori cases, the mean HPAD was 83.5%. For reactive gastropathy, the mean RGAD was 74.1%, and for normal cases, the mean NAD was 65.7% (see Supplementary Tables 4 through 7 for phase 2 AD data). These were similar to phase 1 with the exception of the normal category, which contained significantly less NAD in phase 2.

The ROCs (as generated in phase 1) showed slightly less discriminatory results, with optimal HALO-AI AD cutoffs reduced to 40% (true positive) for all diagnostic groups. The AUCs were: 91.9% for normal, 94.0% for reactive gastropathy, and 100% for H pylori (Figure 3).

Accuracy metrics were then performed as in phase 1. Similarly, the greatest accuracy was achieved with a false-positive AD defined as ≥30%, although the true-positive AD cutoff was set at ≥40% per the aforementioned ROC analysis (Figure 3).

Sensitivity, specificity, and F-score values under these parameters are as follows (Table 2): normal (73.7%, 79.6%, 0.77), H pylori (95.7%, 100%, 0.98), and reactive gastropathy (100%, 62.5%, 0.63). Regarding the established gold standard H pylori diagnoses, no non–H pylori case met the false-positive HALO-AI HPAD threshold of ≥30% while simultaneously failing to meet its own true-positive AD threshold of ≥40%, indicating a PPV of 100%. Two non–H pylori cases exceeded the ≥30% HPAD false-positive cutoff because they were assigned HPADs of 32.2% and 34.3% (Supplementary Tables 5 and 7). However, these cases carried gold standard diagnoses of mild chronic inactive gastritis and normal, each assigned NADs of 58.4% and 57.2%, respectively, and therefore were not considered false-positive results (Supplementary Tables 5 and 7), because HALO-AI diagnoses of normal and mild chronic inactive gastritis were considered clinically equivalent (further discussed below). There was only 1 false-negative HALO-AI H pylori diagnosis (HPAD = 38.5%, NAD = 59.8%), resulting in an NPV of 95.6% (Supplementary Table 4).

Table 2

Phase 2 Accuracy Metrics With a True-Positive Cutoff of 40% and a False-Positive Cutoff of 30%

Phase 2 Accuracy Metrics With a True-Positive Cutoff of 40% and a False-Positive Cutoff of 30%
Phase 2 Accuracy Metrics With a True-Positive Cutoff of 40% and a False-Positive Cutoff of 30%

To directly compare with the phase 1 data, we also calculated phase 2 sensitivity/specificity pairings with a true-positive AD cutoff of 50% and similar false-positive AD cutoff of 30%. These suffered in sensitivity but not in specificity and are as follows (sensitivity/specificity): normal (66.7%, 79.6%), H pylori (91.3%, 100%), and reactive gastropathy (92.3%, 62.5%).

We had an additional diagnostic category of “mild chronic inactive gastritis, H pylori negative” in phase 2 that did not exist in phase 1 (n = 28). Accordingly, HALO-AI was forced to assign HPAD, RGAD, and/or NAD to these cases in various proportions. Normal AD was the majority determination because this cohort showed a mean NAD of 50.5% (Supplementary Table 5) compared with a mean RGAD of 37.2% and a mean HPAD of 12.3%. We therefore chose to group these 28 cases into the normal category for the aforementioned accuracy analysis.

DISCUSSION

To our knowledge, this is the first study using DL technology in inflammatory gastrointestinal pathology. We addressed medical gastric biopsies, an important and high-volume specimen type. In phase 1 of the study, we trained the HALO-AI CNN on ideal examples of histologically normal gastric mucosa, H pylori gastritis, and reactive gastropathy. When this trained CNN was tested on additional classic-pattern biopsies (the same type of biopsies used for training), it achieved remarkable agreement with the gold standard diagnoses, as evident by our highly discriminatory ROC curves (Figure 2) and excellent sensitivity/specificity pairings at optimal true-positive AD cutoffs (Table 1). This level of accuracy shows that the tool can almost perfectly distinguish between these inflammatory patterns under near-ideal circumstances.

In phase 2 of the study, we tested an additional 106 nearly consecutive biopsies using the same DL algorithm we trained in phase 1. These biopsies included cases of normal mucosa, H pylori gastritis, and reactive gastropathy as before but were more representative of the histologic variation one would see in a typical practice. In other words, the phase 2 cases were not preselected to include only the most classic/pure examples for each category. For example, some of the H pylori cases contained oxyntic infected mucosa only, where the characteristic inflammatory bandlike pattern involved only a modest portion of the tissue fragment (Figure 4, A and B), and other cases may have contained only a few organisms only visible on immunostaining (compared with phase 1 cases, where organisms were often abundant).

Figure 4

Representative phase 2 test cases, hematoxylin-eosin. A case of Helicobacter pylori gastritis (A) with superficial involvement of oxyntic mucosa, in which a minority of the biopsy was characterized as H pylori (B). A case of reactive gastropathy (C), displaying the “top-heavy” nature of the injury pattern, with reactive regions highlighted at the luminal surface and normal regions highlighted at the bottom (D). A case showing mixed H pylori and reactive gastropathy patterns (E). The region highlighted as H pylori contains a lymphoid aggregate, whereas the region highlighted as reactive gastropathy contains more foveolar tortuosity with mucin depletion (F). Mild chronic inactive gastritis (G), predominantly characterized as normal (green label = normal area distribution), but with patchy areas characterized as reactive gastropathy (yellow label = reactive gastropathy area distribution) and H pylori (red label = H pylori area distribution) by the algorithm (H). A lymphoid aggregate is highlighted red, bottom left (original magnifications ×2.1 [A and B], ×2.5 [C and D], ×4 [E and F], and ×5.5 [G and H]).

Figure 4

Representative phase 2 test cases, hematoxylin-eosin. A case of Helicobacter pylori gastritis (A) with superficial involvement of oxyntic mucosa, in which a minority of the biopsy was characterized as H pylori (B). A case of reactive gastropathy (C), displaying the “top-heavy” nature of the injury pattern, with reactive regions highlighted at the luminal surface and normal regions highlighted at the bottom (D). A case showing mixed H pylori and reactive gastropathy patterns (E). The region highlighted as H pylori contains a lymphoid aggregate, whereas the region highlighted as reactive gastropathy contains more foveolar tortuosity with mucin depletion (F). Mild chronic inactive gastritis (G), predominantly characterized as normal (green label = normal area distribution), but with patchy areas characterized as reactive gastropathy (yellow label = reactive gastropathy area distribution) and H pylori (red label = H pylori area distribution) by the algorithm (H). A lymphoid aggregate is highlighted red, bottom left (original magnifications ×2.1 [A and B], ×2.5 [C and D], ×4 [E and F], and ×5.5 [G and H]).

Nonetheless, when the CNN analyzed these phase 2 samples, the results were once again impressive, attaining an accuracy of >99% for H pylori gastritis (sensitivity/specificity pairing of 95.7%/100% and PPV/NPV pairing of 100%/95.6%) at an optimal true-positive AD cutoff of 40% and false-positive AD cutoff of 30% (Table 2). This suggests that the inflammatory pattern of H pylori gastritis is very recognizable and histologically distinct to the CNN.

Although this study was primarily implemented as a “proof of concept” endeavor, these findings allow one to envision a scenario in which DL is used to rapidly screen gastric biopsies for inflammatory patterns highly suggestive of H pylori gastritis, and reflexively order a confirmatory immunostain, ready for the pathologist at sign-out. This could help alleviate the common yet costly and inefficient practice of ordering up-front H pylori stains on all gastric biopsies. In time, as high-power scanning resolutions and DL algorithms improve, a CNN may even be able to recognize H pylori on hematoxylin-eosin–stained slides, obviating the need for additional stains when organisms are sparse and difficult to identify.

Regarding the functionality of the CNN, it is important to note that very few cases achieved a perfect 100% AD for the corresponding gold standard diagnosis (Supplementary Tables 1 through 7). Both H pylori gastritis (Figure 4, A and B) and reactive gastropathy (Figure 4, C and D) are “top-heavy” histologic processes; H pylori–associated inflammation is characteristically denser at the superficial/luminal aspect of a biopsy fragment, and reactive gastropathy characteristically incites epithelial and lamina propria changes in the superficial/luminal foveolar compartment. Because of this, the glandular compartment/base of the biopsy is more often histologically normal, which leads the CNN to assign different diagnostic labels to each biopsy fragment (ie, different ADs, each corresponding to a different diagnostic category), and thus render a mixture of final AD determinations for each case. Figure 4, E and F, shows that some cases display both patterns of injury.

Accordingly, assigning final CNN diagnoses by determining the optimal true-positive AD cutoff for each diagnostic category proved to be the most effective way to generate a single DL diagnosis for each case. As stated above, this worked very well for our H pylori cases in both phase 1 and phase 2, but there was a significant drop in diagnostic specificity for cases of reactive gastropathy and normal mucosa in phase 2 compared with phase 1 (Tables 1 and 2). For example, for reactive gastropathy, the mean phase 1 RGAD was 82% (Supplementary Table 2) compared with a mean phase 2 RGAD of 74.1% (Supplementary Table 6). This is clearly a function of the phase 2 cohort containing cases of mild reactive gastropathy in which a larger proportion of the tissue in these “mild” cases looks histologically normal and is therefore designated as such by the CNN.

Regarding the normal biopsies, there was a drop in mean NAD from 86.8% in phase 1 to 65.7% in phase 2 (Supplementary Tables 3 and 7). This was primarily due to an increase in RGAD in phase 2 (mean RGAD of 8% for normal cases in phase 1 compared with mean RGAD of 29.5% for normal cases in phase 2). We interpret this to suggest that more of our phase 2 normal cases had some very mild reactive changes that were deemed insufficient at the time of review to warrant a gold standard diagnosis of “reactive gastropathy” or “mild reactive gastropathy.” Similar cases would have not been selected for inclusion into the phase 1 data set given that we selected for the “most normal” biopsies for training.

Given our study design, the HALO-AI CNN could only categorize mild chronic inactive gastritis (H pylori–negative) cases in phase 2 (Figure 4, G and H) with 3 labels (H pylori gastritis, reactive gastropathy, and normal). Despite this, most of the biopsies were categorized as “normal,” with a mean NAD of 50.5%, mean RGAD of 37.2%, and mean HPAD of 12.3% (Supplementary Table 5). A majority NAD was ultimately obtained on 64% (18 of 28) of the mild chronic inactive gastritis cases. Practically, it is not surprising for the CNN to categorize most cases of mild chronic inactive gastritis as normal, given that most clinicians consider mild chronic inactive gastritis as nonactionable and requiring no follow-up (it is essentially equivalent clinically to a “normal” diagnosis).

We were surprised that in cases of mild chronic inactive gastritis, RGAD was greater than HPAD. The diagnosis implies increased lymphoplasmacytic lamina propria inflammation, which could be confused with an H pylori pattern. Nonetheless, no mild chronic inactive case met the 40% phase 2 HPAD threshold for a diagnosis of H pylori gastritis. The CNN recognized these cases as predominantly normal tissue biopsies, further demonstrating that the CNN accurately differentiates the H pylori pattern from the other patterns, and that these mildly inflamed cases are unlikely to be confused for a clinically significant H pylori infection.

To summarize, DL technology showed very promising results in the interpretation of medical gastric biopsies when trained and tested on classic examples of normal mucosa, H pylori gastritis, and reactive gastropathy (phase 1). The CNN actually performed nearly perfectly for all diagnoses as assessed by AD methods. In our histologically variable phase 2 cohort, the data were very compelling. Most of the examples of normal mucosa, H pylori gastritis, and reactive gastropathy were correctly categorized, although the diagnostic accuracy for H pylori performed the best, indicating that the CNN could accurately identify the H pylori gastritis pattern of injury in a routine practice environment.

With these results, we believe that this trained CNN could potentially serve as an effective screening tool/diagnostic aid for H pylori gastritis given the excellent sensitivity/specificity/PPV/NPV values it displayed. That said, this study was conducted in New Mexico, which contains a high prevalence of H pylori. In the phase 2 group, 23 of 106 cases (22%) contained H pylori gastritis, which we expect is more than is seen in many practices in the United States. Screening tests usually perform best in high-prevalence areas, so the performance of the CNN may not translate to all pathology practices. As scanners and viewing platforms improve in the future, CNNs could very well detect the organisms on hematoxylin-eosin with careful training.

References

1
Bejnordi
BE
,
Zuidhof
G
,
Balkenhol
M
, et al.
Context-aware stacked convolutional neural networks for classification of breast carcinomas in whole-slide histopathology images
.
J Med Imaging (Bellingham)
.
2017
;
4
(
4
):
044504
.
2
Araújo
T
,
Aresta
G
,
Castro
E
, et al.
Classification of breast cancer histology images using convolutional neural networks
.
PLoS One
.
2017
;
12
(
6
):
e0177544
.
3
Cruz-Roa
A
,
Gilmore
H
,
Basavanhally
A
,
Feldman
M
, et al.
Accurate and reproducible invasive breast cancer detection in whole-slide images: a deep learning approach for quantifying tumor extent
.
Sci Rep
.
2017
;
18
(
7
):
46450
.
4
Wahab
N
,
Khan
A
,
Lee
YS
.
Two-phase deep convolutional neural network for reducing class skewness in histopathological images based breast cancer detection
.
Comput Biol Med
.
2017
;
85
:
86
97
.
5
Cireşan
DC
,
Giusti
A
,
Gambardella
LM
,
Schmidhuber
J.
Mitosis detection in breast cancer histology images with deep neural networks
.
Med Image Comput Comput Assist Interv
.
2013
;
16
(
pt 2
):
411
418
.
6
Li
C
,
Wang
X
,
Liu
W
,
Latecki
LJ
.
DeepMitosis: mitosis detection via deep detection, verification and segmentation networks
.
Med Image Anal
.
2018
;
45
:
121
133
.
7
Vandenberghe
ME
,
Scott
ML
,
Scorer
PW
,
Söderberg
M
,
Balcerzak
D
,
Barker
C.
Relevance of deep learning to facilitate the diagnosis of HER2 status in breast cancer
.
Sci Rep
.
2017
;
7
:
45938
.
8
Khosravi
P
,
Kazemi
E
,
Imielinski
M
,
Elemento
O
,
Hajirasouliha
I.
Deep convolutional neural networks enable discrimination of heterogeneous digital pathology images
.
EBioMedicine
.
2018
;
27
:
317
328
.
9
Saha
M
,
Chakraborty
C.
Her2Net: A deep framework for semantic segmentation and classification of cell membranes and nuclei in breast cancer evaluation
.
IEEE Trans Image Process
.
2018
;
27
(
5
):
2189
2200
.
10
Litjens
G
,
Sánchez
CI
,
Timofeeva
N
, et al.
Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis
.
Sci Rep
.
2016
;
6
:
26286
.
11
Ehteshami Bejnordi
B,
Veta
M,
Johannes van Diest
P
, et al
.
Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer
.
JAMA
.
2017
;
318
(
22
):
2199
2210
.
12
Golden
JA
.
Deep learning algorithms for detection of lymph node metastases from breast cancer: helping artificial intelligence be seen
.
JAMA
.
2017
;
318
(
22
):
2184
2186
.
13
Kwak
JT
,
Hewitt
SM
.
Nuclear architecture analysis of prostate cancer via convolutional neural networks
.
IEEE Access
.
2017
;
5
:
18526
18533
.
14
Källén
H
,
Molin
J
,
Heyden
A
,
Lundström
C
,
Åström
K.
Towards grading Gleason score using generically trained deep convolutional neural networks
.
Proc IEEE Int Symp Biomed Imaging
.
2016
:
1163
1167
.
15
Jiménez del Toro
O
,
Atzori
M
,
Otálora
S
, et al
.
Convolutional neural networks for an automatic classification of prostate tissue slides with high-grade Gleason score
.
SPIE Proc Med Imaging Digit Pathol
.
2017
:
10140
. doi: .
16
Zhou
N
,
Fedorov
A
,
Fennessy
F
,
Kikinis
R
,
Gao
Y.
Large scale digital prostate pathology image analysis combining feature extraction and deep neural network
.
ARXIV
.
2017
.
17
Arvaniti
E
,
Fricker
KS
,
Moret
M
, et al.
Automated Gleason grading of prostate cancer tissue microarrays via deep learning
.
Sci Rep
.
2018
;
8
(
1
):
12054
.
18
Kumar
N
,
Verma
R
,
Arora
A
, et al.
Convolutional neural networks for prostate cancer recurrence prediction
.
SPIE Proc Med Imaging Digit Pathol
.
2017
:
10140
. doi: .
19
Schaumberg
AJ
,
Rubin
MA
,
Fuchs
TJ
.
H&E-stained Whole Slide Image Deep Learning Predicts SPOP Mutation State in Prostate Cancer
.
bioRxiv
.
2017
.
20
Xu
Y
,
Jia
Z
,
Wang
LB
, et al.
Large scale tissuehistopathology image classification, segmentation, and visualization via deep convolutional activation features
.
BMC Bioinformatics
.
2017
;
18
(
1
):
281
.
21
Ertosun
MG
,
Rubin
DL
.
Automated grading of gliomas using deep learning in digital pathology images: a modular approach with ensemble of convolutional neural networks
.
AMIA Annu Symp Proc
.
2015
:
1899
1908
.
22
Mobadersany
P
,
Yousefi
S
,
Amgad
M
, et al.
Predicting cancer outcomes from histology and genomics using convolutional networks
.
Proc Natl Acad Sci U S A
.
2018
;
115
(
13
):
E2970
E2979
.
23
Korbar
B
,
Olofson
AM
,
Miraflor
AP
, et al.
Deep learning for classification of colorectal polyps on whole-slide images
.
J Pathol Inform
.
2017
;
8
:
30
.
24
Sirinukunwattana
K
,
Ahmed Raza
SE,
Yee-Wah Tsang,
Snead
DR,
Cree
IA,
Rajpoot
NM.
Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images
.
IEEE Trans Med Imaging
.
2016
;
35
(
5
):
1196
1206
.
25
Ribeiro
E
,
Uhl
A
,
Wimmer
G
,
Häfner
M.
Exploring deep learning and transfer learning for colonic polyp classification
.
Comput Math Methods Med
.
2016
:
6584725.
26
Goldblum
JR,
Odze
RD
.
Odze and Goldblum Surgical Pathology of the GI Tract, Liver, Biliary Tract, and Pancreas. 3rd ed
.
Philadelphia, PA
:
Elsevier Saunders;
2015
:
352
401
.
27
Marshall
BJ
,
Armstrong
JA
,
McGechie
DB
,
Glancy
RJ
.
Attempt to fulfil Koch's postulates for pyloric Campylobacter
.
Med J Aust
.
1985
;
142
:
436
439
.
28
Marshall
BJ
.
The Campylobacter pylori story
.
Scand J Gastroenterol Suppl
.
1988
;
146
:
58
66
.
29
Malaty
HM
.
Epidemiology of Helicobacter pylori infection
.
Best Pract Res Clin Gastroenterol
.
2007
;
21
(
2
):
205
214
.
30
Sonnenberg
A
,
Lash
RH
,
Genta
RM
.
A national study of Helicobactor pylori infection in gastric biopsy specimens
.
Gastroenterology
.
2010
;
139
(
6
):
1894
1901
.
31
Dubois
A.
Intracellular Helicobacter pylori and gastric carcinogenesis: an “old” frontier worth revisiting
.
Gastroenterology
.
2007
;
132
(
3
):
1177
1180
.
32
Dubois
A
,
Boren
T.
Helicobacter pylori is invasive and it may be a facultative intracellular organism
.
Cell Microbiol
.
2007
;
9
(
5
):
1108
1116
.
33
Necchi
V
,
Candusso
ME
,
Tava
F
, et al.
Intracellular, intercellular, and stromal invasion of gastric mucosa, preneoplastic lesions, and cancer by Helicobacter pylori
.
Gastroenterology
.
2007
;
132
(
3
):
1009
1023
.
34
Genta
RM
.
Acid suppression and gastric atrophy: sifting fact from fiction
.
Gut
.
1998
;
43
(
suppl 1
):
S35
S38
.
35
Genta
RM
.
Atrophy, acid suppression and Helicobacter pylori infection: a tale of two studies
.
Eur J Gastroenterol Hepatol
.
1999
;
11
(
suppl 2
):
S29
S33
.
36
Genta
RM
,
Rindi
G
,
Fiocca
R
, et al.
Effects of 6-12 months of esomeprazole treatment on the gastric mucosa
.
Am J Gastroenterol
.
2003
;
98
(
6
):
1257
1265
.
37
Kuipers
EJ
,
Lundell
L
,
Klinkenberg-Knol
EC
, et al.
Atrophic gastritis and Helicobacter pylori infection in patients with reflux esophagitis treated with omeprazole or fundoplication
.
N Engl J Med
.
1996
;
334
(
16
):
1018
1022
.
38
Kuipers
EJ
,
Uyterlinde
AM
,
Pena
AS
, et al.
Increase of Helicobacter pylori-associated corpus gastritis during acid suppressive therapy: implications for long-term safety
.
Am J Gastroenterol
.
1995
;
90
(
9
):
1401
1406
.
39
Kuipers
EJ
,
Klinkenberg-Knol
EC
,
Festen
HP
,
Meuwissen
SG
.
Lansoprazole, H. pylori, and atrophic gastritis
.
Gastroenterology
.
1997
;
113
(
6
):
2018
2019
.
40
Dewar
EP
,
Dixon
MF
,
Johnston
D.
Bile reflux and degree of gastritis after highly selective vagotomy, truncal vagotomy, and partial gastrectomy for duodenal ulcer
.
World J Surg
.
1983
;
7
(
6
):
743
750
.
41
Dewar
P
,
Dixon
MF
,
Johnston
D.
Bile reflux and degree of gastritis in patients with gastric ulcer: before and after operation
.
J Surg Res
.
1984
;
37
(
4
):
277
284
.
42
Laine
L
,
Weinstein
WM
.
Histology of alcoholic hemorrhagic “gastritis”: a prospective evaluation
.
Gastroenterology
.
1988
;
94
(
6
):
1254
1262
.
43
el-Zimaity
HM
,
Genta
RM
,
Graham
DY
.
Histological features do not define NSAID-induced gastritis
.
Hum Pathol
.
1996
;
27
(
12
):
1348
1354
.
44
Sobala
GM
,
O'Connor
HJ
,
Dewar
EP
, et al.
Bile reflux and intestinal metaplasia in gastric mucosa
.
J Clin Pathol
.
1993
;
46
(
3
):
235
240
.
45
Long
J
,
Shelhamer
E
,
Darrell
T.
Fully convolutional networks for semantic segmentation
.
Proc IEEE Soc Conf Comput Vis Pattern Recognit
.
2015
:
3431
3440
.
46
Tieleman
T
,
Hinton
G.
RMSProp: Divide the gradient by a running average of its recent magnitude
.
COURSERA
:
Neural Networks for Machine Learning
.
2012
;
4
:
26
31
.

Author notes

Supplemental digital content is available for this article at www.archivesofpathology.org in the March 2020 table of contents.

The authors have no relevant financial interest in the products or companies described in this article.

Supplementary data