More people receive a diagnosis of skin cancer each year in the United States than all other cancers combined. Many patients around the globe do not have access to highly trained dermatopathologists, whereas some biopsy diagnoses of patients who do have access result in disagreements between such specialists. Mechanomind has developed software based on a deep-learning algorithm to classify 40 different diagnostic dermatopathology entities to improve diagnostic accuracy and to enable improvements in turnaround times and effort allocation.
To assess the value of machine learning for microscopic tissue evaluation in dermatopathology.
A retrospective study comparing diagnoses of hematoxylin and eosin–stained glass slides rendered by 2 senior board-certified pathologists not involved in algorithm creation with the machine learning algorithm’s classification was conducted. A total of 300 glass slides (1 slide per patient’s case) from 4 hospitals in the United States and Africa with common variations in tissue preparation, staining, and scanning methods were included in the study.
The automated algorithm demonstrated sensitivity of 89 of 91 (97.8%), 107 of 107 (100%), and 101 of 102 (99%), as well as specificity of 204 of 209 (97.6%), 189 of 193 (97.9%), and 198 of 198 (100%) while identifying melanoma, nevi, and basal cell carcinoma in whole slide images, respectively.
Appropriately trained deep learning image analysis algorithms demonstrate high specificity and high sensitivity sufficient for use in screening, quality assurance, and workload distribution in anatomic pathology.
More people receive a diagnosis of skin cancer each year in the United States than all other cancers combined.1 More than 18 million skin lesions are biopsied2–4 in the United States every year. The number of suspected skin lesions is growing because of an aging population as well as environmental and lifestyle factors.
Diagnosis in dermatopathology requires specialized training because of the large number of skin tumor subtypes and significant variability of visual representation within every morphologic class. Misdiagnosis5 and late diagnosis are the repercussions of high workloads and differentiation difficulty, resulting in frequent disagreement among pathologists, and affecting, for example, the recognition of melanoma versus the melanocytic nevi.6,7
The rise in the adoption of digital pathology8 provides an opportunity for the use of computer vision deep learning methods to address the error rate associated with human medical image interpretation and to capture efficiency gains in terms of turnaround times and labor cost.9 Pathology laboratories can reap efficiency benefits not only from centralization10 but also from caseload triage in accordance with diagnostic difficulty and optimal distribution of pathologists’ workloads. The latter can reduce diagnostic turnaround times and labor costs associated with the diagnostic process. For example, when case types associated with higher diagnostic difficulty or rarity are interpreted initially by a dermatopathologist with a standard level of training who then shows it to a more experienced subspecialist, delays and effort duplication may occur. With an automated triage system in place straightforward cases can be directed to the general pathologists while helping to unburden senior dermatopathologists, eliminate bottlenecks, increase laboratory capacity, prevent burnout, improve turnaround times, and reduce the cost of the diagnostic process.
Coupled with the increase in detected cancer overall, the growing shortage of pathologists11 results in the need to develop automated, artificial intelligence (AI)–based tools to support pathology workflows, improve diagnostic accuracy, and to provide productivity benefits for dermatopathology labs.
Convolutional neural networks (CNNs) have proven capable of identifying diagnostically relevant patterns in pathology12–15 while dealing with unique challenges. Among these obstacles is the large image size of microscopic images (up to 2 gigabytes per single image with a single focal plane depending on compression). Although recent progress in compression algorithms, processing speeds, and scanning speeds has been made, the high magnification and resolution of the images still require efficient analysis algorithms. Additionally, there is variation in image quality due to significant variability in staining16 and tissue preparation, scanning resolution of various devices, presence of artifacts, such as folds or ink markings, as well as a wide variety of visual patterns for each pathology.17 Nevertheless, deep learning–based methods have recently shown promise in whole slide image (WSI) classification,18 with most applications primarily dealing with binary classifications19 or classifications into broad categories20 and with less focus on the actual diagnosis of images via multiclass categorization of all morphologic variants of various diseases, which would be a necessity for a diagnostic support tool, quality control, or workload distribution among pathologists of different levels of expertise and compensation.
In this work, we present the performance of a supervised convolutional neural network–based algorithm pretrained to classify 40 morphologic diagnoses of the tumors of the skin (Table 1) on WSIs containing hematoxylin-eosin (H&E)–stained skin biopsies.
TECHNOLOGY
The WSI classification system used is presented in Figure 1. The system receives as an input a WSI and produces probability scores for each of the 40 classes on which it was trained. The training set consisted of punch, shave, and excisional skin biopsy cases from different body parts. Analysis of the WSI is performed in 2 stages—local and global. First, local image patches containing tissue are extracted from WSI at 2 magnifications, and each patch is classified by a convolutional neural network (via a dedicated CNN for each magnification level). This produces a local semantic probabilistic description of the WSI, where each 32 × 32 pixel square of the input WSI is represented by 2 vectors (1 for each magnification) of local probability scores for each tissue type class. This feature map, because of the multiple-magnification approach, provides a good representation of both fine and coarse image features.
At the second stage of classification, we used high probability values (local maxima) of the feature map described above as cues for possible tumor locations. At each such location we extracted 200 × 200 pixel patches, which were then classified by the third convolutional neural network. This step allowed us to identify global WSI features, because each 200 × 200 pixel patch in the feature map represents a 6400 × 6400 pixel patch in the original image.
At the second stage of classification all local maxima produce vectors of class probabilities. The classification of the entire WSI is then based on the aggregated average of these vectors.
Employing transfer learning for both stages of classification, we have used the ResNet architecture21 (specifically, ResNet34 for the first classification stage and ResNet18 for the second classification stage) pretrained on the ImageNet data set. For the first-stage classifiers, the fully convolutional version of ResNet was used, whereas the last, fully connected layer was replaced by the convolution, thus producing local class probabilities for each 32 × 32 pixel patch of the image.
During training of the first-stage classifiers, the tissue patches at 2 magnifications were randomly selected. Batches of 256 images and a 1:4 ratio of tumor to normal patches in each batch were kept constant. In order to account for color and cell size variations in tissue, noise was added to the extracted patches with a randomness factor varying between 0.9 and 1.1. Cross-entropy loss with label smoothing11 was used as the cost function.
Second-stage classifier was trained on additional annotated WSIs, each analyzed by the first-level classifiers to produce the feature maps. These feature maps were augmented by adding small Gaussian noise with the mean of zero.
MATERIALS AND METHODS
H&E-stained glass slides for 300 punch, shave, and excisional biopsy cases of face, neck, back, and arms were selected for the study. Each case consisted of a single slide. Of these, 261 (87%) were received from Associated Laboratory Physicians (Harvey, Illinois) and belonged to White patients, whereas 27 (9%) were received from Muhimbili National Hospital (Dar es Salaam, Tanzania), and 12 (4%) were received from Butaro District Cancer Hospital (Butaro, Burera District, Rwanda), for a total of 39 (13%) belonging to Black patients to introduce skin color variability (Table 2). All cases had been diagnosed previously by light microscopy. Two senior board-certified pathologists uninvolved in algorithm development served as the primary evaluators. Both pathologists are medical directors of hospital pathology laboratories in the Chicago, Illinois, area. The 300 cases were randomly divided into 2 sets, and each of the 2 pathologists-evaluators blindly reassessed the scanned images of the 150 cases.
The slides had notable variability in quality of staining, histology, and contained artifacts, such as folds and ink markings representative of real-world pathology slides (Figure 2). The cases in the validation set had the following diagnoses confirmed by board-certified pathologists via light microscopy: 102 with basal cell carcinoma (BCC; 34% of 300) belonging to White patients; 59 with nodular melanoma (20% of 300), including 33 cases (11% of 300) belonging to Black patients and 26 cases (9% of 300) belonging to White patients; 17 with lentigo maligna melanoma (6% of 300), including 6 cases from Black patients (2% of 300) and 11 from White patients (4% of 300); 15 with superficial spreading melanoma (5% of 300) from White patients; 10 with dysplastic nevus (3% of 300) from White patients; 79 with intradermal nevus (26% of 300) from White patients; and 18 with compound nevus (6% of 300) from White patients.
H&E-stained glass slides were scanned using a high-resolution Motic Digital Pathology EasyScan Pro6 scanner at 0.26 μm per pixel and the Ventana iScan Coreo scanner, both set at ×40 magnification. The scanned WSIs were then independently interpreted by the Mechanomind algorithm at the University of Chicago (Chicago, Illinois), Ingalls Memorial Hospital medical campus (Harvey, Illinois), and classified into 1 of 3 diagnostic classes: BCC, melanoma, and nevus (Figure 3). BCC was selected for the high incidence among malignant cases and to address the recognition of keratinocytic tumors versus the melanocytic ones. Melanoma was chosen as the most severe common skin cancer type. The nevus category was selected for the high incidence overall as well as for the need to address the recognition of melanoma versus the melanocytic nevi, because this topic may sometimes lead to disagreements among pathologists. No patient history, and no clinical or gross examination information was used in the process, with only the de-identified WSI being presented to the software algorithm. Batches of 10 images were processed in up to 2 minutes per batch. One of the evaluators compared the algorithm’s results with the confirmed diagnoses, classifying each case as a true positive, true negative, false positive, or a false negative for each of the 3 classes.
RESULTS
Compared with human-rendered diagnoses, the image recognition algorithm identified 89 of the 91 melanoma cases (97.8% sensitivity), 107 of the 107 nevus cases (100% sensitivity), and 101 of the 102 BCC cases (99% sensitivity). The algorithm also identified 204 of the 209 cases as not containing melanoma (97.6% specificity), 189 of 193 cases as not containing a nevus (97.9% specificity), and 198 of 198 cases as not containing BCC (100% specificity), as seen in Table 3. One of the 2 cases of melanoma that the algorithm did not recognize contained desmoplastic melanoma (Figure 4, A), which was absent in the training set, and the other contained superficial spreading melanoma (Figure 4, B) with atypical presentation of small standalone clusters of cancer. The 1 case of BCC not recognized by the algorithm is shown in Figure 4, C. One of the cases initially misdiagnosed by the primary pathologist as a melanoma and even initially overlooked by the evaluator during validation was correctly recognized by the algorithm as a nevus, as later confirmed by both evaluators during the final review of results by the evaluating pathologists.
CONCLUSIONS
Although additional studies are needed to demonstrate diagnostic accuracy for rare entities, deep learning techniques applied to microscopic imaging have the potential to solve the shortage of pathologists worldwide, improve diagnostic accuracy and turnaround times, raise the standards of care, reduce healthcare costs, deploy diagnostic expertise to underserved locations, and improve access to quality care.
References
Competing Interests
Brodsky is on the advisory board and has stock options at Mechanomind Inc. Levine, Polak, and Chervony are employees of and own shares of Mechanomind Inc. The other authors have no relevant financial interest in the products or companies described in this article.