The goal of the lymphocytosis diagnosis approach is its classification into benign or neoplastic categories. Nevertheless, a nonnegligible percentage of laboratories fail in that classification.
To design and develop a machine learning model by using objective data from the DxH 800 analyzer, including cell population data, leukocyte and absolute lymphoid counts, hemoglobin concentration, and platelet counts, besides age and sex, with classification purposes for lymphocytosis diagnosis.
A total of 1565 samples were included from 10 different lymphoid categories grouped into 4 diagnostic categories: normal controls (458), benign causes of lymphocytosis (567), neoplastic lymphocytosis (399), and spurious causes of lymphocytosis (141). The data set was distributed in a 60-20-20 scheme for training, testing, and validation stages. Six machine learning models were built and compared, and the selection of the final model was based on the minimum generalization error and 10-fold cross validation accuracy.
The selected neural network classifier rendered a global 10-class classification validation accuracy corresponding to 89.9%, which, considering the aforementioned 4 diagnostic categories, presented a diagnostic impact accuracy corresponding to 95.8%. Finally, a prospective proof of concept was performed with 100 new cases with a global diagnostic accuracy corresponding to 91%.
The proposed machine learning model was feasible, with a high benefit-cost ratio, as the results were obtained within the complete blood count with differential. Finally, the diagnostic impact with high accuracies in both model validation and proof of concept encourages exploration of the model for real-world application on a daily basis.
Both absolute lymphoid counts (ALCs) above 5 × 103/μL and/or presence of the atypical lymphoid flag trigger the analysis of the peripheral blood smear1 to classify the alteration as benign or malignant2,3 from the assessment of morphologic features. The consensus for morphologic diagnosis,4 and subsequently the International Council for Standardization in Hematology (ICSH),5 established that atypical lymphocytes should be further divided into “suspect reactive” (reactive lymphocytes), “suspect neoplastic” (abnormal lymphocytes), and “uncertain nature” categories.
The correct morphologic identification of lymphoproliferative neoplasms with peripheral expression is of utmost importance to trigger a subsequent clinical decision.2,3 Nevertheless, the assessment of the correct performance evaluated through external quality assessment (EQA) schemes reveals that the mean correct diagnosis of reactive cases is performed by 75% of participants,6–8 while that for neoplastic assignments is below 70%.6–9
Modern analyzers, such as the Beckman Coulter DxH 800 (Miami, Florida), record cell population data (CPD) parameters of the leukocyte subpopulations by using impedance, conductivity, and scatter measures with the VCS (volume-conductivity-scatter) technology.10–12 These parameters have specific mean and standard deviation values for normal cells, and any variation points to a particular pathology.12
Machine learning (ML) is a tool that uses information from different sources, finding patterns from training data and allowing predictions to be made with classification purposes.13,14 Building on prior experiences using ML and CPD to address lymphocytosis diagnosis in benign or neoplastic categories,15,16 excluding spurious lymphocytosis,17 the current work increases the diagnostic complexity by combining (1) 3 benign causes of lymphocytosis (viral infection [VI], persistent polyclonal B-cell lymphocytosis [PPBL], and postsplenectomy lymphocytosis), (2) 4 different types of neoplastic lymphocytosis (chronic lymphocytic leukemia [CLL], monoclonal B-cell lymphocytosis [MBL] CLL-like, and peripheral expression of follicular lymphoma [FL] and splenic marginal zone lymphoma [SMZL]), and (3) 2 spurious categories with lymphocytosis due to erythrocyte-lysis resistance (anemia and liver disease)—which together with the normal controls make a total of 10 different categories.
The goal was to develop a CPD-based ML model as an aid to the lymphocytosis diagnosis approach and evaluate its feasibility, standardization, and benefit-cost ratio. Moreover, as it is expected that this ML model would be integrated into the general routine workflow, a proof of concept (POC) was also designed to evaluate its daily-basis performance.
MATERIALS AND METHODS
Problem Selection
The focus was the classification of samples with lymphocytosis (ALC > 3.5 × 103/μL) and/or the presence of an atypical lymphoid flag into the categories recommended by the ICSH5: (1) benign, (2) neoplastic, or (3) spurious.
Data Collection
Human Specimens
A total of 1565 samples were included corresponding to (1) 458 normal controls; (2) categories with lymphocytosis including benign causes such as 346 cases of VI, 107 cases of expansion of polyclonal B cells (with or without lymphocytosis) with binucleated lymphocytes, and 114 cases involving postsplenectomy patients; (3) neoplastic cases including 191 CLL, 90 MBL CLL-type, and cases with peripheral blood expression of SMZL (n = 87) and FL (31 cases from 15 patients); and (4) spurious causes due to erythrocyte-lysis resistance in patients with anemia (n = 109) or liver disease (n = 32). Data use authorization was obtained from the institutional review board. The treatment of data was carried out in accordance with the requirements of the legislation issued for protection of the data and privacy of the patients.
Hairy cell leukemia and mantle cell lymphoma B-cell neoplasms were not included in the present study. The former because hairy cells are misinterpreted as monocytes in the DxH 800, and previous studies have resulted in specific CPD flags for the identification of hairy cells within the abnormal monocyte cluster.21 Mantle cell lymphoma B-cell neoplasms were disregarded because information on cyclin D1 was not available in the suspected cases for the diagnosis confirmation.
Venous blood samples were collected into tubes containing EDTA as an anticoagulant, and the analyses were performed within 6 hours after collection. The diagnostic confirmation and classification of the CLL, MBL, SMZL, and FL samples was based on flow cytometric immunophenotype, following the World Health Organization (WHO) consensus guideline and according to the CLL scoring system.22,23 The B-cell panel included CD19/CD5/CD23/CD10/CD103/CD79b/CD20/K/L/FMC7/CD38. The PPBL cases showed an expansion of B lymphocytes (CD19+) that were CD5−, CD23+ with representation of both κ and λ light chains.24 Flow cytometric analysis was performed with a BD FACS Calibur cytometer (Becton-Dickinson, Franklin Lakes, New Jersey). Mouse anti-human monoclonal antibodies (DAKO Corporation, Carpinteria, California) were used as the primary antibodies, while anti-human κ and λ light-chain polyclonal antibodies were used for immunoglobulin light-chain expression analysis (Thermo Fisher Scientific, Waltham, Massachusetts); all stainings were performed with appropriate controls. Viral infections were confirmed by chemiluminescence immunoassays (Liaison, DiaSorin, Italy), with positivity for Epstein-Barr or cytomegalovirus immunoglobulin M antibodies being the inclusion criteria.
Features
The included features corresponded to population parameters (age and sex), hematologic parameters including leukocyte counts and ALC, hemoglobin concentration and platelet counts, and lymphoid CPD, using a Beckman Coulter DxH 800 instrument with software version 3.9.0.31. Table 1 details the clinical characteristics of the patients, the percentage of patients with lymphocytosis, and the analyzer flags. For adults, lymphocytosis was considered when ALC was above 3.5 × 103/μL, while for children the upper limit was considered by taking into account the different normal ranges by age.10
All groups were composed of adult samples, with the exception of the VI group, with 53% of the samples corresponding to a pediatric population (younger than 18 years) owing to the high specific age-related frequency of VIs due to Epstein-Barr virus in adolescents and young adults.10
The DxH 800 provides CPD corresponding to cell volume, conductivity, and 5 scatter measurements, yielding the mean and SD values for all 7 of these data, for a total of 14 CPDs.
Cell population data corresponding to scatter measurements were obtained by sensors placed at different angles around the flow cell collecting the scattered light of interest. Such scatter angles corresponded to 5.1° (low angle light scatter, LALS), 10° to 20° (lower median angle light scatter, LMALS), 20° to 42° (upper median angle light scatter, UMALS), a channel that is the sum of the UMALS and LMALS regions (median-angle light scatter), and the axial light loss at 0°.11
Data Set Distribution
The 1565 cases from the total data set were distributed into the training, testing, and validation data sets in a 60-20-20 scheme. A total of 1278 cases (81.7%) were randomly assigned into the model-building data set (training-testing), and the remaining 287 cases (18.3%) were randomly assigned to the validation data set, used exclusively during the validation phase.13 The model-building data set was used for model tuning, training-testing, evaluation, and selection18,20 (Figure).
Subsequently, the model-building data set was randomly divided into training and testing data sets, controlled by setting the random state equal to zero.20 Of the 1278 cases from the model-building data set, 958 (75.0%) were included in the training data set, and the remaining 320 cases (25.0%) were included in the testing data set. The training-testing data sets represented 61.2% (n = 958) and 20.4% (n = 320) of the total data sets (n = 1565), respectively.
Machine Learning Development
All the ML processes were performed by using 6 different classification algorithms: decision trees (DTs), random forests (RFs), naive Bayes classifier (NBC), k-nearest neighbor (KNN), neural networks (NNs), and support vector machines (SVMs).19 To build the KNN, SVM, and NN models, the features were previously standardized by using the Scikit-Learn MinMaxScaler transformer.14
Model Tuning
The selection of the best combination of parameters for each ML algorithm14 was performed with the Scikit-Learn GridSearchCV, using 10-fold cross (10-FC) validation.14,19 Table 2 details the parameters and the range evaluated for each classifier. For the remaining parameters, default values as described in the Scikit-Learn library were used.
Model Training-Testing
Training-testing accuracies and the variance among them were obtained for each of the 6 ML algorithms to analyze model overfitting and its generalization error.14
Model Evaluation: 10-Fold Cross Validation
Model Selection
The selection of the final model for the validation and performance analysis was based on the minimum generalization error13 (from the training-testing analysis) and the 10-FC validation accuracy.
Model Validation and Diagnosis Impact
The performance of the chosen model was evaluated by using the validation data set (n = 287) by means of the global accuracy, precision, recall (sensitivity), and F-1 score performance metrics.14,20 To obtain the diagnostic impact, the distribution of the validation data set within the confusion matrix was grouped into the recommended diagnostic categories (normal controls, benign, neoplastic, and spurious).4,5 Globally weighted and individual class accuracies were obtained. The confusion matrices for the grouped categories and individual classes were detailed.
Proof of Concept
Prospectively, a total of 100 cases with lymphocytosis and/or the presence of the atypical lymphoid flag were selected as a POC data set, to evaluate the model functioning beyond the development phase.13,25 The morphologic diagnosis was performed by 2 independent observers (1 clinical pathologist and 1 hematologist) as the ground truth. Cell morphology was evaluated with a microscope. The morphologic classification of the cases in the POC data set was performed according to ICSH recommendations for the standardization of peripheral blood cell morphologic features.5 A comparison between observers and ML diagnosis was performed. Finally, diagnostic category accuracy and weighted global accuracy were also obtained for the POC evaluation.
RESULTS
Machine Learning Development
Model Tuning
The chosen parameters are detailed in Table 2. No tuning was performed for the NBC, and standard settings were used.
Model Training-Testing
Table 2 also displays the obtained classification accuracies for the training, testing, and 10-FC validation of each particular model. The variance (model overfitting) reflects the difference between training and testing accuracies.
Overall, with the exception of the NBC, the algorithms provided training accuracies superior to 90.0%. Testing results revealed the higher accuracies for NN, RF, and SVM classifiers, respectively, with values above 85%. Finally, the variance revealed that KNN, DTs, and RFs presented the highest overfitting of the training data, with values between 14% and 25%.
Model Evaluation: 10-Fold Cross Validation
The best model after 10-FC validation corresponded to the NN classifier, with an accuracy of 83.8%, followed by the SVM (83.0%) and RF (79.5%) classifiers (Table 2). Analysis of variance (ANOVA) with post hoc Tukey honest significance test revealed no statistically significant differences between the obtained 10-FC validation accuracies for the NN, SVM, and RF classifiers. ANOVA was applied as a Gaussian distribution and corroborated by the Shapiro-Wilk test.
Model Selection
The selected model was the NN classifier, as it presented the highest 10-FC validation accuracy. Even if no statistical differences were found among NN, SVM, and RF classifiers, the former had the lower variance or generalization error (8.0% versus 10%–14%).
Model Validation and Diagnosis Impact
The ML model based on NN rendered classification accuracies of 95.0% for the training set, 87.0% for the testing set, and 83.8% for the 10-FC validation (Table 2). For the validation, 287 new samples were used, and the obtained performance metrics corresponded to a global accuracy of 89.9%, precision of 91.0%, recall of 90.0%, and an F-1 score of 90.0%. The distribution of the validation-set cases grouped into the recommended diagnostic categories (normal controls, benign, neoplastic, and spurious) yielded a global weighted accuracy of 95.8%.
Table 3 details the confusion matrix of the validation data set into the 4 aforementioned diagnostic categories. The rows display the true diagnosis and the columns the predicted diagnosis by the algorithm.14,20 All normal controls were correctly classified within the normal control category, while 93.0%, 96.0%, and 92.3% of benign, neoplastic, and spurious lymphocytosis, respectively, were correctly classified by the NN model. The overall misclassification was below 5.0%.
Finally, Table 4 presents a detailed confusion matrix with distribution of the 287 cases of the validation data set into the 10 different categories. The values over the diagonal in bold correspond to the true-positive rates (TPRs) for each cell class. The absolute number of cases is shown between parentheses.
Noteworthy, both normal controls and postsplenectomy patients had TPRs corresponding to 100%. Moreover, all but 3 of the categories presented TPRs above 75.0%. Among benign categories, following postsplenectomy patients, VI showed a TPR corresponding to 95.1%, while PPBL showed a lower outcome (77.8%), with misclassification with SMZL and normal control categories. Among neoplastic categories, CLL and MBL CLL-like presented higher TPRs corresponding to 88.6% and 76.5%, respectively. On the other hand, with the exception of SMZL cases, when misclassification occurred it only took place within the neoplastic category. Finally, the outcome for cases from spurious lymphocytosis showed a 95.0% TPR for anemia with erythrocyte lysis-resistance cases, while the lower outcome for the patients with liver disease (66.7%) implies mostly misclassification into the other spurious category.
For a better analysis, from the 287 cases of the validation set, 86 were normal controls, and the 201 remaining were distributed as follows: 75 corresponded to neoplastic lymphocytosis, 100 to a benign lymphocytosis, and 26 to spurious lymphocytosis. The error analysis revealed that normal control and SMZL categories included most of the incorrect cases from other categories. Still, only 5 of the 201 pathologic cases (2.5%) were incorrectly classified in the normal control category, and a total of 5 of the 212 nonneoplastic cases (2.4%) were incorrectly classified as SMZL. The higher misclassification rates reflected a total of 3 PPBL cases (16.7%) and 1 liver disease case (16.7%) incorrectly classified as SMZL.
Proof of Concept
From the 100 cases in the POC data set, 36 were diagnosed as benign, 46 as neoplastic, and 18 as spurious lymphocytosis. The comparison between the observers' consensus diagnosis and that obtained for the ML model presented an overall concordance of 76%. Considering the diagnostic categories, Table 5 shows the obtained accuracy for each. True-positive rates were above 88.0% in the 3 categories, and the weighted global accuracy corresponded to 91.0%.
DISCUSSION
The diagnostic scenario for B-cell neoplasms has changed significantly from the Rappaport classification to the last WHO classification of tumors.22,26 The latter established a starting point, where the diagnosis of lymphoid malignancies is based on objective, standardized, and reproducible markers established by consensus of a panel of experts. However, some entities remain within a gray zone owing to morphologic or immunophenotypic overlap.26
Although stand-alone morphologic classification has some limitations, it still serves as the starting point for the diagnosis approach and triggers other ancillary confirmatory techniques. Our proposed ML model is also based on morphologic features, as CPD are automated data from the physical properties of the different leukocyte subpopulations, using a combination of impedance (for cell size), conductivity (for nuclear density, granularity, and nucleus-cytoplasm ratio), and scatter measures (for structure, shape, and reflectivity on the cell).10–12 Nevertheless, these measurements are objective, reproducible, and importantly, standardizable.
The ultimate aim of this work was to evaluate the diagnostic impact of an ML model on the lymphocytosis diagnosis approach into the different categories. For this purpose, we evaluated the performance of the models, but also their feasibility for their actual use in the clinical laboratory, for which the input data must be standardizable and subject to rigorous quality controls (QCs) in a quality regulation framework.13
In terms of performance, the ultimate goal in the fine-tuning of an ML model is the achievement of the Bayes optimal error, generally equated to human-level performance.27 Considering the results obtained from EQA schemes as a measure of human-level performance, they were comparable with those achieved by our ML model in the detection of VI (64.3%–86% versus 95.1%) and neoplastic CLL and FL (CLL: 77%–93.1% versus 88.6%; FL: 56.7%–57.9% versus 50%) categories.6–9 Error analysis of our ML model highlighted that misclassification among the categories was as expected given the morphologic overlap among them. Such similarities included those observed among CLL and MBL CLL-like categories, as the differentiation among them is based on the size of the clonal population,22 and those observed between FL and SMZL with CLL. The overlap between FL and CLL could be explained by the atypical CLL morphology (associated with trisomy 12) with more than 15% of cells with cleaved nucleus resembling follicular lymphoma cells, and the presence of lymphoplasmacytic morphology, as found in SMZL.28,29 Similar misclassification rates were also found for FL as CLL in the results of EQA schemes.8 Besides, the resembling morphologic patterns of PPBL and SMZL explain the observed overlap between these 2 categories.16,23 Of note, it is important to highlight that 100% of MBL CLL-like cases were classified within the neoplastic category, even with mean ALCs as low as 4.8 × 103/μL and with 26.7% of the cases without the presence of the atypical lymphoid flag.
Secondly, we evaluated certain aspects of the feasibility of the model. Our proposed ML model uses numerical data from the Beckman Coulter DxH 800 analyzer.11 Although CPD have no specific QC yet, the VCS technology that provides them in the analyzer has already been used for other reported parameters, and hence subjected to QC, making possible its inclusion with specific analytic studies and proper software modifications, as in the case of the Monocyte Distribution Width, previously evaluated and cleared as a hematology sepsis biomarker by the US Food and Drug Administration (FDA).30
Moreover, the diagnosis of neoplastic categories was based on flow cytometry, a “high quality” reference standard. In summary, the used input data are objective, standard, potentially subject to QC, and previously validated by the FDA for other applications. Such characteristics are encouraging in terms of feasibility for its application in the clinical laboratory.
Finally, the diagnostic impact of the ML model was assessed through validation studies (with an independent validation data set)31 and by means of an independent, prospective POC.25 The ultimate goal of the lymphocytosis diagnosis approach is to classify the encounter as benign or neoplastic2,3,5 ; moreover, we included the spurious category, as it was previously reported that the erythrocyte-lysis resistance phenomenon may lead to spurious lymphocytosis when using the DxH 800 analyzer.17
The model validation revealed global, weighted, validation accuracy for the different diagnostic categories, corresponding to 95.8%, and a 91.0% accuracy in the POC. An accuracy above 90% is remarkable in view of the difficult diagnostic problem achieving, or even outperforming, the human-level performance reported by the EQA schemes.6–9
The main limitations affecting any ML model are related to data and interpretability/explainability of the model. In our model, data limitations could affect the mantle cell lymphoma category because of unavailability of the reference standard for its confirmation, but also because of the type of features included to feed the algorithm, which are based on morphologic parameters, with known grey areas, especially to differentiate among B-cell neoplasms. As usual, this model also was built with data from a single institution, and the exchange of data between several institutions is mandatory for further implementation.32 Regarding model interpretability, as Aristotle noted: “the ability to explain how results are produced can be less important than the ability to produce such results and empirically verify their accuracy.”33 Thus, the empirical verification of an ML model presupposes a real-world prospective evaluation to assess its performance,31,34–36 using truly independent data sets to capture the variety of the real-world conditions.33,34,37 Moreover, an important part of the explainability of a model goes through a data transparency policy,32 as ML models involve not only algorithms but also data, and the same algorithm could lead to different outcomes when trained with different data.38 For such reasons, detailed information on the characteristics of the included groups was fully detailed for the readership to determine if the data set was appropriate.
To conclude, the proposed ML model not only achieves comparable human-level performance, but also meets specific requirements for its feasibility, such as high-quality labeled data in terms of a reference standard (flow cytometry) and standardized and objective features following calibration and QC procedures. Of note, the application of the model has a high benefit-cost ratio, as the data required are obtained from the complete blood count with differential. The diagnostic impact with high accuracies in model validation and the performed POC provide encouraging possibilities for taking the model to the next level and training it with more categories and cases, and with a multicentric perspective, for its deployment for real-world application on a daily basis. Importantly, the application of ML requires a structure of regulatory and legal frameworks to ensure safe delivery of medical agents or other medical devices,35 being an artificial-intelligence–based technology defined as “Software as a Medical Device” by both the FDA and the International Medical Device Regulators Forum.32 Hopefully, after proper verification and regulatory process, an ultimate version of this model could serve as a diagnostic support tool, which could be integrated into the laboratory workflow,13 where the specialist's responsibility would be to manage the extracted information by the ML model in the clinical context of the patient39 as a physician-in-the-loop or in symbiosis with ML models, in the path to achieving high-performance medicine.31
References
Author notes
The authors have no relevant financial interest in the products or companies described in this article.