ABSTRACT
Considering the nonspecific nature of gastrointestinal complaints and the broad differentials of gastrointestinal symptomatology, imaging plays a vital role in the formulation of diagnoses. As a result, artificial intelligence (AI) tools have emerged to assist radiologists in the interpretation of gastrointestinal imaging and to mitigate diagnostic errors. Among the main subtypes of AI applied in this field is deep learning (DL), a subfield of machine learning (ML) that uses artificial neural networks to analyze data and has proven to be superior to traditional ML methods in radiologic imaging analysis. In this review, we discuss DL applications in gastrointestinal imaging across different modalities, including x-ray imaging, ultrasonography, computed tomography, magnetic resonance tomography, and positron emission tomography. Moreover, we outline the challenges and ethical considerations facing the growing role of AI in clinical practice.
INTRODUCTION
Gastrointestinal (GI) diseases are a wide range of disorders that affect the digestive tract, including the esophagus, stomach, small bowel, and colon, and the accessory organs including the liver, gallbladder, and pancreas.[1] Patients with these disorders can present with nonspecific symptoms such as abdominal pain, nausea, vomiting, diarrhea, and constipation, which can lead to diagnostic challenges. Radiologic imaging techniques such as ultrasonography, x-ray imaging, computed tomography (CT), magnetic resonance imaging (MRI), and positron emission tomography (PET) play an essential role in diagnosing and managing various GI disorders.[2] Artificial intelligence (AI) was first defined in 1956 and significant advancements in healthcare have been made possible with the use of AI since then.[3] AI is a computer system that can perform human tasks by making decisions learned from previous experiences.[3] AI systems have many applications in healthcare, including assisting in patient diagnosis, predicting patients’ prognosis, transcribing notes, and organizing patients’ data.[4] Machine learning (ML) and deep learning (DL) are two AI methods widely used in healthcare settings.[5] ML emphasizes the use of machines to perform intelligent tasks by learning from their previous mistakes, while DL is a subfield of ML that uses artificial neural network architectures for processing and extracting data.[6] A conventional neural network (CNN) is a type of DL that uses fully connected layers with standard weights for image classification and analysis.[5] The use of AI in diagnostic radiologic imaging has evolved significantly over the current decade.[7] ML has demonstrated great performance in classifying and analyzing radiologic images.[7] DL has also been used in image analysis by using neural networks and has shown greater performance than traditional ML models.[7] AI-based radiologic imaging can improve diagnostic efficiency, reduce interpretation errors, and optimize the imaging workflow.[8] The use of AI in radiologic imaging of GI conditions has been reported widely in the literature, such as organ segmentation, automated lesion detection, and computer-aided diagnosis.[9]
AI has significantly advanced GI imaging and played a crucial role in developing predictive models, leading to improvements in the detection, diagnosis, and prognosis of many diseases.[10,11] The radiologic techniques that have benefited from AI include ultrasonography, x-ray imaging, CT, MRI, and PET-CT. Furthermore, the automated analysis of GI images commonly involves ML, natural language processing (NLP), and CNN.[9] The rise of AI in healthcare, especially in diagnostic radiology, has opened unparalleled possibilities for improving the standard and effectiveness of patient care. However, this rapid advancement brings numerous challenges, including the need for sufficient data quality and quantity, incorporation into clinical routines, software security, and numerous ethical and legal issues.[10,12] Although there is extensive discussion in the literature regarding the ethical aspects and challenges associated with the application of AI, a universally accepted and comprehensive framework for the ethical development and implementation of AI in healthcare is yet to be established.[13] To fully leverage the potential of AI, it is essential to invest in building the necessary infrastructure that fosters data sharing, adheres to uniform information governance policies, and encourages innovation while ensuring the confidentiality of patient information. Additionally, it is crucial to establish the groundwork for educating and preparing the workforce for an AI-driven healthcare system in the future.[14] In this review, we aim to explore the current role of DL in radiologic imaging of GI diseases.
METHODS
We searched through Google Scholar and PubMed databases and included articles that discussed DL application in radiologic imaging of different GI conditions. Our focus was on studies that fit within the scope of our review purpose from 2018 onward, and we cited older studies only if they had an impact on the role of DL in GI imaging. Our search terms in the field of DL in radiologic imaging included gastrointestinal, esophageal, gastric, pancreatic, hepatobiliary, small bowel, and colorectal diseases.
Throughout this review we will be discussing articles that report a Dice similarity coefficient (DSC) or simply, Dice score, which is a measure of segmentation performance and spatial overlap between two regions by considering both false positives and false negatives. The DSC ranges from 0 to 1, where 0 means no overlap, and 1 is perfect agreement between the proposed method and ground truth.[15] We will also be discussing radiomics, which is the process of quantifying medical images into high-dimensional data related to predictive targets and extracting valuable information to enhance clinical decision-making.[16] In addition, the receiver operating characteristic curve is a graphical representation used to evaluate the performance of a classification model or diagnostic test. A commonly reported statistic is the area under the curve (AUC), which quantifies the overall ability of the model to discriminate between positive and negative values. AUC values range from 0.5 to 1.0. An AUC of 0.5 means the test performs no better than random chance in distinguishing between benign and malignant tumors, for example, whereas an AUC of 1.0 signifies perfect discrimination.[17]Another commonly reported statistic, the C-index, or concordance index, is primarily used in survival analysis and binary classification to evaluate the discriminatory power of a predictive model. It measures how well the model can distinguish between different outcomes or risk levels.[18]
ORGAN SEGMENTATION
GI organ segmentation on radiologic images is a critical but time-consuming task for radiologists to obtain manually. Consequentially, multiple semi-automatic and fully automatic DL methods to segment different GI organs on CT and MRI images have emerged over the past couple of years.
In a study reported by Gibson et al,[19] the authors proposed a DL-based segmentation model on abdominal CT images that can segment the pancreas, esophagus, stomach, liver, spleen, gallbladder, left kidney, and duodenum more accurately than previously reported DL methods. In another study, Xue et al[20] developed a cascade multitask 3D conventional network to address the pancreatic segmentation challenges on CT images that outperformed the state-of-the-art methods. In a study from China, researchers developed a 2.5D CNN to automatically segment the pancreas on 3D CT images, with a DSC of 86.21 ± 4.37%, sensitivity of 87.49 ± 6.38%, and specificity of 85.11 ± 6.49%.[21] In another study reported by Zheng et al,[22] the authors proposed a 2D DL-based iterative pancreas segmentation workflow method to describe the uncertain regions on pancreatic MRI images. This method improved pancreatic segmentation by addressing the pancreatic segmentation challenges, especially the critical areas for pancreatic cancer diagnosis, which include the pancreatic heads, tails, boundaries and disconnected topology. Furthermore, Panda et al[23] developed a highly accurate and precise 3D CNN model for fully automated segmentation of the pancreas on a large CT scan dataset. A limitation of this study was that the proposed segmentation method was for the morphologically normal pancreas rather than the diseased pancreas.
In 2020, Kim et al[24] proposed a CNN model for abdominal multiorgan segmentation, including the liver, stomach, duodenum, and kidneys, from 3D-patch–based CT images. The average segmentation time without automation was 22.6 minutes, compared to 7.1 minutes with automation using the CNN model. The proposed CNN also showed comparable accuracy to radiographers, especially for the liver and kidneys. However, the CNN model showed a relatively weak performance for the stomach and duodenum owing to their low-intensity gradients. One of the limitations of the study was using only the time reduction parameter to show the efficiency of CNN auto-segmentation.[24]
Furthermore, DL models were also developed to automatically segment different GI organs on MRI images. Nainamalai et al[25] assessed the applicability of two automated DL algorithms for liver parenchyma segmentation of contrast-enhanced T1-weighted MR images to generate patient-specific models that aid clinicians in planning surgeries. They achieved a Dice score of 0.9696 and concluded that around one-third of the segmentations using the DL algorithms can produce results without any manual correction necessary.
Fu et al[26] performed a retrospective study on 3D MRI images and proposed a CNN model to automatically segment the liver, kidneys, stomach, duodenum, and bowel, which showed good accuracy. In another study, Chen et al[27] developed a CNN model with a high accuracy for fully automated multiorgan segmentation on abdominal MRI images, including the liver, spleen, pancreas, left and right kidneys, stomach, duodenum, small intestine, spinal cord, and vertebral bodies. One limitation of the study is the small testing dataset with only 20 cases.
LESION DETECTION AND DIAGNOSIS
The integration of DL models in imaging modalities has shown significant promise in enhancing diagnostic accuracy for various GI conditions and cancers. Studies have demonstrated that DL models can often outperform traditional methods in certain aspects.
A study from Japan assessed the performance of a CNN-based DL identification system for esophageal cancer from CT images, which was able to detect the cancer with higher accuracy than radiologists.[28] The CNN-based system demonstrated a diagnostic accuracy of 84.2%, a specificity of 90%, and a sensitivity of 71.7%. Another study by Lin et al[29] compared the diagnostic efficacy of a DL-based model built on noncontrast CT imaging with the performance of radiologists in detecting esophageal cancer. They retrospectively analyzed and trained the DL model with a group of patients with esophageal cancer and with pathologically confirmed noncontrast CT imaging and another group of healthy individuals with endoscopy-proven esophageal cancer–free status. The model achieved an accuracy level of 88.2%, which was higher than that of junior radiologists but lower than the rate of senior radiologists.[29] In this study, a complex case featuring a 77-year-old man with T1 stage esophageal cancer was overlooked by the radiologists but was successfully identified by the DL model. However, a significant limitation of this study was that most patients with early-stage esophageal cancer who were enrolled in the study were not identified by either the DL model or the radiologists. Further studies are needed to determine whether implementing this model will improve patient care.
DL has been applied in different imaging modalities to aid in the detection of hepatocellular carcinoma (HCC). Xu et al[30] developed an AI pipeline with an integrated CNN segmentation model using a large-scale ultrasound dataset, which was able to detect liver masses with an AUC of 0.990 (95% CI, 0.986–0.992), a sensitivity of 95.1%, and a specificity of 96.1%. However, owing to the limited number of pathology results, this model was tested only on three subtype classifications for benign and malignant masses. Besides ultrasonographic imaging of the liver, Ghoniem[31] proposed a DL algorithm for the detection of liver masses on CT images. This method achieved a Dice score of 0.968 and 0.97 across two different datasets, outperforming previous segmentation methods. Considering the different primary and metastatic liver masses, the differentiation of liver tumors is a fertile ground for the application of AI.[32] Hu et al[33] developed a contrast-enhanced ultrasound (CEUS)–based model that achieved a higher accuracy in the differentiation of liver masses than resident physicians and was found to be on par with the performance of hepatic CEUS experts. DL has also been used in CT imaging to differentiate benign and malignant liver masses; Midya et al[34] developed a deep neural network to differentiate liver tumors, using single portal venous phase contrast-enhanced CT images. This model identified four types of liver masses: intrahepatic cholangiocarcinoma, HCC, colorectal liver metastasis, and benign masses, with an accuracy of 96.27%.[34] Despite the promising results, this study is limited by the small sample size of benign tumors, and selection bias for patients who had undergone surgical intervention for liver masses. Another recent study has evaluated the use of DL to differentiate the primary source of liver metastases. The DL radiomics-based model in this study outperformed ML models in identifying the source of liver metastasis, with an AUC of 0.907 and 0.809 for digestive and nondigestive tract malignancies, respectively.[35] One more potential benefit of DL is exposing the patients to less radiation. In one study, three-phase CT images yielded a similar diagnostic accuracy to a four-phase protocol in distinguishing HCC from other focal liver lesions.[36]
In addition to the detection and differentiation of malignancies, DL has a promising future in grading the stages of liver fibrosis. Lee et al[37] developed a deep convolutional neural network–based model for the classification of liver fibrosis, using ultrasonography. The model achieved an accuracy of 83.5% and 76.4% on the internal and external datasets, respectively.
In terms of the application of DL in biliary tract imaging, Obaid et al[38] used four different CNN models for the detection of different gallbladder pathologies (including cholecystitis, gallstones, perforation, gallbladder wall thickening), using ultrasound imaging. Of these four models, MobileNet achieved the highest accuracy of 98.35%. DL has also been used in differentiating intrahepatic cholangiocarcinoma from HCC; Ponnoprat et al[39] have achieved an accuracy of 88% in differentiating intrahepatic cholangiocarcinomas and HCC on multiphase abdominal CT, using a machine-based model that integrates CNN.
Another application of DL is the automated detection of pancreatic cancers on CT images. A retrospective study from Taiwan, which included two local datasets and one dataset from the United States, used a CNN model to distinguish cancerous from noncancerous pancreatic tissues on CT images.[40] The model showed acceptable accuracy and higher sensitivity (98.3%) than radiologists (92.9%). Additionally, CNN missed 3 (1.7%) of 176 pancreatic cancers, while radiologists missed 12 (7%) of 168 pancreatic cancers, of which 11 (92%) were correctly classified by using CNN.[40] However, the CNN-based model achieving higher sensitivity than radiologists should be interpreted cautiously, as radiologists had additional clinical information when interpreting CT images, whereas the CNN model only had access to the CT images. Additionally, the study used a small dataset of Asian participants from one institution, so it may not cover the full range of pancreatic cancer CT-imaging manifestations. Zhu et al[41] also used a DL model to analyze CT images from 303 patients with pancreatic cancer and 136 controls and achieved 94.1% sensitivity and 98.5% specificity in detecting pancreatic cancer. One study from the Netherlands developed a DL model to automatically detect pancreatic ductal adenocarcinoma (PDAC) on CT images.[42] The proposed model achieved a maximum AUC of 0.914 in the external test set and 0.876 for the subgroup of tumors smaller than 2 cm, suggesting that it can assist radiologists in diagnosing PDAC. However, the model was trained with a relatively low number of patients and only included tumors in the pancreatic head. In a recent study in 2023 that included 6239 patients across 10 centers, the DL model showed an AUC of 0.986–0.996 for lesion detection on noncontrast CT images, outperforming the mean radiologist performance by 34.1% in sensitivity and 6.3% in specificity for PDAC identification.[43] Moreover, for lesion detection in a real-world multiscenario validation consisting of 20,530 patients, the DL model showed a sensitivity of 92.9% and specificity of 99.9%. The DL model was trained by using multicenter data but mainly from the East Asian population and hospitals. Therefore, further validation in external real-world centers, international cohorts, and prospective studies is necessary.
Another application of DL in radiologic imaging is detecting small-bowel obstruction (SBO). Kim et al[44] developed a DL model to identify SBO on plain abdominal radiographs with high accuracy, achieving an AUC of 0.961, sensitivity of 91%, and specificity of 93%. However, the clinical relevance of these findings is limited because abdominal radiographs are less sensitive than CT scans for diagnosing SBO. A single abdominal x-ray image should never be the sole method to diagnose SBO and must be correlated with the clinical context. Furthermore, Cheng et al[45,46] used CNN to analyze abdominal radiographs for diagnosing SBO. In their first study, 2210 abdominal radiographs were evaluated by using a CNN diagnostic model and showed a sensitivity of 83.8% and specificity of 68.1%.[45] In their second study, abdominal radiographs were increased to 7768, which resulted in the increase of diagnostic sensitivity and specificity to 99.4% and 99.9%, respectively.[46] These two studies suggested that expanding the number of radiographs has significantly improved the accuracy of SBO detection by CNNs.[45,46] Another recent retrospective cohort study from South Korea developed a 3D CNN model to detect high-risk patients with acute SBO, based on abdominal CT images.[47] The study analyzed 578 cases from 250 normal subjects, 209 high-grade SBOs, and 119 low-grade SBOs, based on over 38,000 CT images. The 3D CNN model detected high-risk acute SBO with high efficiency and accuracy. However, the study was done in a single tertiary center with a potential selection bias. Additionally, the study findings are relatively weak in terms of usability and generalizability because the model lacks external validation.
DL has also been studied in the detection of colorectal polyps and cancers. One study from Germany that used DL was able to differentiate between benign (regular mucosa or hyperplastic) and premalignant (adenoma) colorectal polyps detected on CT colonography in correspondence to histopathology, with an AUC of 0.83.[48] A limitation of this study was the small sample size, and the study was only applicable to polyps detected clearly on CT colonography, resulting in a potential selection bias. Optical coherence tomography (OCT) has been used to diagnose retinal diseases, but recent research suggests it could detect colorectal cancer (CRC). A study from Washington University in Saint Louis reported the first study to differentiate normal colorectal tissues from neoplastic ones by using DL-based pattern recognition OCT.[49] The proposed CNN model achieved 100% sensitivity and 99.7% specificity with an AUC of 0.998, suggesting a promising potential for real-time diagnosis of CRCs. Although this study showed promising results, the DL model was tested on a limited number of tumors that had received radiation and chemotherapy treatments. Table 1 summarizes multiple studies on DL application in the detection of different GI lesions on radiologic images.
PREDICTION OF RESPONSE TO THERAPY
AI is being increasingly used in radiologic imaging to predict the response of various GI conditions to therapy. DL models can analyze large volumes of radiologic imaging data to identify patterns and features that may not be discernible to the human eye. These DL models can predict how patients with various GI conditions, such as esophageal, gastric, pancreatic, and colorectal cancers, respond to different therapies.
Advancements in pattern recognition, medical image analysis, and AI have laid the groundwork for the rapid evolution of radiomics, particularly in applications to esophageal carcinoma.[16] Several studies have incorporated PET and CT radiomics features with deep convolutional neural networks to predict treatment response in patients with esophageal carcinoma.[17,18] Several limitations of these current studies need to be highlighted. A significant challenge in this research area is the creation of a comprehensive public database with ample annotated medical imaging data to train numerous parameters in neural networks. Such a database would greatly enhance the ability to provide clinically relevant features, leading to improved model performance. Additionally, the limited number of tumor volumes available for these studies could result in overfitting, which might impair the algorithm’s ability to generalize to new, unseen test examples. Another group of researchers developed a pretreatment CT-based DL model for predicting the response to chemoradiation therapy in esophageal carcinoma.[23] This model was able to accurately and consistently stratify patients by their objective response to treatment, demonstrating its potential to guide clinical decisions and bridge the gap between precision medicine and biological imaging data.[23] A major challenge for this study was that the inner workings of the DL algorithm remain largely opaque, functioning as a “black box.” The processes of feature extraction and selection are automated and not transparent. This presents a broader challenge for physicians, who may find it difficult to intuitively and clearly interpret the results produced by the model. Consequently, understanding the reasons behind these results and determining whether specific features have a positive or negative impact can be quite challenging.
Researchers from China developed a DL model, Nomo-LDLM-2F, which uses multilesion and time series CT images to predict the benefits of anti-HER2 targeted therapy in stage IV gastric cancer.[50] The study found that patients with a low Nomo-LDLM-2F score derived significant survival benefits from anti-HER2 targeted therapy as compared to those with a high score (all p < 0.05). Thus, the Nomo-LDLM-2F score, based on these imaging techniques, shows promise for effectively predicting overall survival in patients with HER2-positive stage IV gastric cancer who are undergoing this targeted therapy.[50] Another study aimed to identify the diagnostic accuracy of an AI model and compare its performance against clinical methods and imaging methods (IMs) through a systematic review of head-to-head comparative studies.[51] Results from nine studies involving 3313 patients showed that the AI model had a pooled sensitivity of 0.75 and specificity of 0.77. The AI model was found to have higher sensitivity but lower specificity than the IM in most subgroups. The study concluded that AI is a viable tool for predicting the response of patients with gastric cancer to neoadjuvant chemotherapy. Particularly, CT-based DL models were effective in extracting tumor features and predicting responses.[51] However, in this review, there were three times as many male participants as female participants, with all articles originating from Asia. This uneven representation may have led to a model that is better suited to a specific population cohort, potentially introducing selection bias. Additionally, differences in baseline characteristics across the articles—such as cutoff values, cohort size and type, and gold standards—required extensive subgroup analysis to assess their impact on the stability of the conclusions. Moreover, none of the included articles reported the cutoff values, and most did not explain how these values were determined, which undermines the reliability of the diagnostic conclusions.
DL models have also been studied to predict the response of pancreatic tumors to neoadjuvant therapy. Watson et al[52] reported the first study to examine the ability of a DL model to predict pancreatic adenocarcinoma response to neoadjuvant therapy, based on pretreatment and posttreatment CT images. The pure CNN model achieved an AUC of 0.738, which increased to 0.785 when they incorporated a hybrid model that accounted for a 10% decrease in carbohydrate antigen (CA) 19–9 levels. Interestingly, the pure CA 19-9 model yielded an AUC of 0.564 for response prediction, suggesting that the pure CA 19-9 model alone essentially performs no better than random guesswork. However, this single-institution study included only patients who underwent resection, resulting in a potential selection bias. Recently in 2023, Shao et al[53] reported the first study to evaluate the feasibility of DL models in predicting the efficacy of neoadjuvant chemotherapy in pancreatic cancer, based on contrast-enhanced ultrasound videos. The DL models showed high performance with AUCs of 0.892 and 0.908. They were also able to accurately classify over 40% of the original videos that were misclassified by the naked eye.[53]
Another application of DL models is detecting CRC response to therapy. Zhang et al[54] developed a DL model to predict rectal cancer response to neoadjuvant chemoradiotherapy from diffusion keratosis and T2-weighted MRI images. The study found that radiologists had a higher error rate than the DL model for predicting pathologic complete response (26.9% and 24.8% for raters 1 and 2, respectively, versus 2.2% for the DL model). In addition, this study found that radiologists had lower error rates (13% and 14% for raters 1 and 2, respectively) with the assistance of the DL model. The main limitation of the study is that the DL model was based on the date of a single center so there is a need for multicenter investigation with a large dataset to generalize and validate the DL model. In another study, Shi et al[55] used radiomics and CNN based on pretreatment and mid-radiation follow-up MRI images to predict rectal cancer pathologic response to neoadjuvant radiation therapy. The study included 51 patients: 45 with pretreatment MRI, 41 with mid-radiation therapy MRI, and 35 with both sets. Using radiomics to predict pathologic complete response (pCR) versus non-pCR, the AUC was 0.8 for pretreatment, 0.82 for mid-radiation, and 0.86 for both MRI sets. Using CNN to predict pCR versus non-pCR, the AUC was 0.59 for pretreatment, 0.74 for mid-radiation, and 0.83 for both MRI sets. The CNN model’s prediction accuracy was lower than for radiomics owing to the small case number insufficient for training.[55] In another study that analyzed 1028 patients with metastatic CRC, Lu et al[56] developed a DL model that characterized tumor morphologic and size changes on CT images in response to anti–vascular endothelial growth factor therapies. Table 2 summarizes multiple studies on DL applications in the prediction of GI lesions’ response to therapy using radiologic imaging.
PREDICTION OF PROGNOSIS AND SURVIVAL
DL models have shown great promise in predicting the prognosis and survival of various GI cancers, including esophageal, gastric, pancreatic, and colorectal. In addition, DL models have been studied to predict the severity of GI conditions such as Crohn disease and acute pancreatitis.
A study from Taiwan trained a 3D CNN model with PET scans to predict esophageal cancer prognosis with acceptable accuracy (AUC = 0.738).[57] One limitation in this study is the inherent opacity of DL networks. The processes of defining, extracting, and selecting features are automated and occur implicitly, making the imaging characteristics they assess quite obscure. This lack of transparency contrasts sharply with the clearly defined radiomic features developed by experts. For instance, morphometric measures like core muscle size from cross-sectional imaging have been linked to sarcopenia, which independently predicts clinical outcomes in various GI cancers. In contrast, CNN classification might rely on features related to tumors or nontumors. Another multicenter retrospective study involved 411 patients with pathologically confirmed esophageal squamous cell carcinoma (ESCC) from two hospitals.[58] Researchers developed radiomics signatures for each feature set and combined these with clinical risk factors to create multiple radiomics models. The results showed that the integrated radiomics model outperformed traditional clinical factors and existing methods, achieving C-statistics of 0.875, 0.874, and 0.840 in the development, internal validation, and external validation cohorts, respectively. Nomogram and decision curve analysis confirmed the model’s clinical utility, indicating it can effectively predict lymph node metastasis in patients with ESCC and assist in treatment planning.[58] This study was limited because authors used 2D features extracted from the maximum tumor rather than 3D features. Additionally, previous studies have demonstrated that genetic events, such as ZNF750 mutations, are linked to metastasis in patients with ESCC. In the future, when genetic data become available, incorporating these gene markers could enhance the predictive value of the model.
Survival prediction for esophageal cancer is crucial for personalized treatment planning, but traditional methods relying on handcrafted features from medical images are limited by incomplete medical knowledge, resulting in poor predictions. To overcome this, a novel DL-based framework has been proposed, featuring a 3D coordinate attention convolutional autoencoder (CACA) and an uncertainty-based jointly optimizing Cox model (UOCM). The CACA captures latent representations and encodes 3D spatial characteristics with precise positional information, whereas the UOCM jointly optimizes the autoencoder and survival prediction tasks to model interactions between patient features and clinical outcomes, predicting reliable hazard ratios. Extensive experiments on a dataset of 285 patients with esophageal cancer demonstrated that this method achieved a C-index of 0.72, outperforming existing state-of-the-art methods.[59]
Because current staging methods fall short in accurately predicting the risk of relapse in patients with gastric cancer, Jiang et al[60] devised a prognostic tool based on DL by using preoperative CT images to predict disease-free and overall survival and predict the potential benefits of adjuvant chemotherapy. The introduced AI model demonstrated enhanced prognostic precision, offering a valuable means to identify patients who are most likely to benefit from adjuvant chemotherapy. This study had several limitations. First, as a retrospective study, it is subject to inherent biases and unknown confounders, despite including many patients from two centers and adjusting for common clinicopathologic factors. Second, the use of adjuvant chemotherapy was not randomized; patients might have declined chemotherapy owing to personal preferences, concerns about side effects, or financial reasons. To address potential selection bias, a propensity score matching strategy was used. Finally, the model was developed and validated by using data from East Asian patients, and its generalizability to Western populations has yet to be established. Huang et al[61] similarly developed a CNN model for the diagnosis of occult peritoneal metastasis in patients with advanced gastric cancer, using preoperative CT imaging. This DL model exhibited a diagnostic accuracy surpassing that of a traditional clinical model constructed through logistic regression for comparison.[61] In this study, however, retrospective datasets were used to develop the CNN model, and a relatively small number of clinical factors were examined. Other factors, such as serologic tumor markers, which are not initially available from CT scans, may have contributed to any gaps in the data.
Another domain where DL was used is assessing the prognosis of HCC. For example, CEUS-based DL radiomics models have been developed to predict the prognosis of HCC that had been treated with radiofrequency ablation (RFA) or surgical resection (SR).[62] When applied to predict 2-year progression-free survival, these models achieved AUCs of 0.820 and 0.815 in training and validation cohorts for RFA, as well as 0.863 and 0.828 for SR.[62]
In 2023, Vezakis et al[63] developed a 3D CNN model to predict the prognosis of PDAC. The researchers trained the model using an external dataset to generate automated pancreas and tumor segmentations. They extracted radiomics features from CT images of 40 patients with PDAC. These features, along with the TNM system staging parameters and the patient’s age, were combined to predict survival. The results were promising, with a mean C-index of 0.731 for survival modeling and a mean accuracy of 0.76 in predicting 2-year survival. A potential limitation of this study is that the fully automated model may result in suboptimal radiomics feature extraction due to segmentation inaccuracies. Another limitation is the small dataset from a single institution. In another study, Chang et al[64] also proposed a 3D CNN to predict the presence of lymph node metastasis and the postoperative positive margin status, based on 881 preoperative CT scans of patients with PDAC. The study showed an accuracy of 90% for per-patient analysis and 75% for per-scan analysis for lymph node metastasis prediction, whereas for postoperative margin prediction, the study showed an accuracy of 81% for per-patient analysis and 76% for per-scan analysis. The major limitation of the study is that it included a small sample size of 110 patients and most were Caucasian, which makes the study less representative of the general PDAC population. DL models have also been studied for predicting the severity of acute pancreatitis. One recent study from China developed a combined clinical and imaging DL model that was able to accurately predict the severity of mild (AUC = 0.820) and severe (AUC = 0.920) acute pancreatitis by using nonenhanced CT images in a cohort of 783 patients.[65]
Meng et al[66] developed a DL model to classify bowel fibrosis in patients with Crohn disease, using CT enterography images. The study included 312 bowel segments of 235 patients with Crohn disease and compared the performance of the DL model to that of a radiomics model and two radiologists. The DL model accurately distinguished none to mild from moderate to severe bowel fibrosis, with an AUC of 0.811 in the total test cohort. In addition, the DL model performance was superior to radiologists (AUCs of 0.579 and 0.646) and not inferior to the radiomics model (AUC of 0.813) with much shorter total processing time than the radiomics model. However, most of the enrolled patients had moderate-severe bowl fibrosis, raising the potential of selection bias.
Accurate identification of lymph nodes on MRI is crucial for proper CRC staging, but it can be time-consuming. Several DL models have been studied for staging CRCs. Two studies from China reported the use of a faster region-based CNN (Faster R-CNN) model for identifying metastatic lymph nodes of rectal cancers in pelvic MRI images.[67,68] Their CNN model provided lymph node staging similar to that of radiologists, but the model’s average diagnostic time was 20 seconds compared to 600 seconds by radiologists. These results indicate that the Faster R-CNN model enables accurate and rapid diagnosis of lymph node metastasis in rectal cancer. Another study from China developed an automated lymph node detection and segmentation DL model based on multiparametric MRI images.[69] The study demonstrated that this model can accurately identify and segment lymph nodes from MRI images with higher performance than that of junior radiologists. In addition, it showed that the model’s average lymph node detection and segmentation time was 1.3 seconds compared to 200 seconds by radiologists.
Accurate prediction of circumferential resection margins (CRMs) is also essential for predicting the prognosis of CRC, as CRM invasion has a low survival rate. In a single-center retrospective study, Wang et al[70] reported the use of the Faster R-CNN model to identify rectal cancer–positive CRMs on high-resolution MRI images. The study showed that the Faster R-CNN model can automatically recognize an image in 0.2 seconds, with high accuracy and feasibility. However, as it is a single-center study with insufficient data, the DL model accuracy could not be validly assessed, including its ability to identify lymph nodes or extramural vascular invasion. Peritoneal carcinomatosis is considered a terminal stage in CRC with poor outcomes, so early diagnosis is essential to start treatment. In another study, a CT-based DL algorithm was developed and showed great potential for predicting synchronous peritoneal carcinomatosis in patients with CRC, with high accuracy, sensitivity, and specificity.[71] The limitations of the study included the small sample size and potential bias as all the data were collected from a single center in East Asia. Table 3 summarizes multiple studies on DL application in the prediction of GI lesions’ prognosis and survival using radiologic images.
MUTATION STATUS
AI models have been studied to determine the Kirsten rat sarcoma viral oncogene (KRAS) mutation, DNA mismatch repair (MMR), and microsatellite instability (MSI) status in patients with CRC without the need for invasive biopsies. He et al[72] conducted a study to determine if a CT-based DL model could predict KRAS mutation status in colorectal cancer. The DL model had the potential to noninvasively estimate the presence of KRAS mutations, with an AUC of 0.9. In another recent study, Cao et al73] developed a DL model to predict the MMR status in patients with CRC, based on pretreatment CT images, which showed promising results with an AUC of 0.986 in the internal validation cohort and an AUC of 0.915 in the external validation cohort. Zhang et al[74] proposed MRI-based DL models to predict MSI in patients with rectal cancer. The imaging-based DL models were compared to the performance of a clinical model that recognizes MSI status using only clinical factors. The pure imaging-based DL model and combined DL model correctly classified 75% and 85.4% of MSI status with AUC values of 0.820 and 0.868, whereas the clinical model classified 37.5% of MSI status with an AUC value of 0.573. Table 4 summarizes two studies on DL applications in the prediction of CRC mutation status, using radiologic images.
CONCLUSION
DL has enhanced the diagnostic applications of radiologic imaging in various GI conditions. This includes organ segmentation, lesion detection, prognosis, and response to therapy. For lesion detection, DL has been studied to detect esophageal, gastric, hepatic, pancreatic, and colorectal cancers. Additionally, DL models demonstrated potential in differentiating benign from malignant liver masses and regular mucosa or hyperplastic polyps from adenomas. DL models also help predict various GI lesions’ responses to different types of treatment, including surgery, radiotherapy, chemotherapy, and immunotherapy. DL has also shown promising results in predicting the prognosis, patient survival, and severity of various GI conditions such as pancreatic and colorectal cancers. It can also help predict mutation status in CRC, based on radiologic imaging. Some of the main recurring limitations to DL application in imaging was the need for large datasets for validation of results, potential selection bias, and the fact the training DL and ML models relies on annotated data to define the ground truth, making the imaging characteristics they assess quite obscure. Finally, the legal and ethical implications of using DL models in clinical decision-making is the problem. Radiologists must take responsibility for clinical decisions and cannot delegate this responsibility to AI developers. Despite some DL applications outperforming radiologists in certain tasks, the ultimate legal accountability for any diagnostic errors made by these applications rests with radiologists and clinicians. Further research is necessary to address potential challenges, errors, biases, and ethical considerations associated with the implementation of DL in radiologic imaging of GI diseases.
References
Competing Interests
Source of Support: None; Conflict of Interest: None.