Learning-Based Stopping Power Mapping on Dual-Energy CT for Proton Radiation Therapy

Abstract Purpose Dual-energy computed tomography (DECT) has been used to derive relative stopping power (RSP) maps by obtaining the energy dependence of photon interactions. The DECT-derived RSP maps could potentially be compromised by image noise levels and the severity of artifacts when using physics-based mapping techniques. This work presents a noise-robust learning-based method to predict RSP maps from DECT for proton radiation therapy. Materials and Methods The proposed method uses a residual attention cycle-consistent generative adversarial network to bring DECT-to-RSP mapping close to a 1-to-1 mapping by introducing an inverse RSP-to-DECT mapping. To evaluate the proposed method, we retrospectively investigated 20 head-and-neck cancer patients with DECT scans in proton radiation therapy simulation. Ground truth RSP values were assigned by calculation based on chemical compositions and acted as learning targets in the training process for DECT datasets; they were evaluated against results from the proposed method using a leave-one-out cross-validation strategy. Results The predicted RSP maps showed an average normalized mean square error of 2.83% across the whole body volume and an average mean error less than 3% in all volumes of interest. With additional simulated noise added in DECT datasets, the proposed method still maintained a comparable performance, while the physics-based stoichiometric method suffered degraded inaccuracy from increased noise level. The average differences from ground truth in dose volume histogram metrics for clinical target volumes were less than 0.2 Gy for D95% and Dmax with no statistical significance. Maximum difference in dose volume histogram metrics of organs at risk was around 1 Gy on average. Conclusion These results strongly indicate the high accuracy of RSP maps predicted by our machine-learning–based method and show its potential feasibility for proton treatment planning and dose calculation.


Introduction
Proton radiation therapy has been one of the emerging treatment modalities that may have better clinical outcomes for a wide range of patients due to favorable dosimetric properties related to the Bragg peak and virtually no exit dose compared with photon

Materials and Methods
Workflow Figure 1 outlines the schematic workflow of our prediction method. The DECT is able to provide 2 CT image datasets acquired with low-and high-energy spectra. For a given series of low and high DECT images and their corresponding RSP maps, the RSP maps were used as the regression target. The use of both energy images gives the model higher fidelity since the CT values at 2 energy levels demonstrate 2 different energy dependences of material attenuation, which indicates material information. During training, 96 3 96 3 32 voxel patches were extracted from low-and high-energy images by a sliding window with an overlap of 80 3 80 3 22 between any 2 neighboring patches. Furthermore, data augmentations were used to enlarge the training data variation. Then, prepared patches were fed into a 3-dimensional (3D) deep learning framework as a 2-channel input and mapped to a 1-channel output patch, which was correlated to training target, that is, RSP image. To enforce a 1-to-1 mapping, a cycle-GAN-based framework was used to introduce an inverse mapping, which took the RSP map as 1-channel input and mapped it to low-and high-energy CT images as 2-channel output. In order to learn the specific differences among these 3 datasets, 6 residual blocks were used as short skip connections in the end-to-end U-Net-based generator architecture. To further highlight significant features that can fully represent these 3 image datasets, 3 attention gates were used for the other 3 long skip connections as shown in the generator architecture in Figure 2. To judge the realism of the predicted maps, a fully convolutional network architecture-based discriminator was used to discriminate the true image from a predicted image. After training, the paired patches of low and high DECT images were extracted from a new arrival patient's DECT dataset and are fed into the trained networks to obtain the predicted RSP patches. Finally, by using patch fusion, the RSP map of a new arrival patient's DECT was predicted.
Network Architecture Figure 2 shows the network architecture of both generators and discriminators used in the proposed method. Both generators share same architecture with different trainable weights. Both discriminators (discriminator of RSP and discriminator of DECT) share same architecture with different trainable weights. We used a traditional fully convolutional network to implement the discriminators [34]. The generator architecture (of both DECT-to-RSP and RSP-to-DECT) is an end-to-end U-Net, including encoding and decoding paths. The encoding path is composed of 3 convolution layers with a stride size of 2 to reduce the feature maps size and several further convolution layers with stride size of 1. The decoding path is composed of 3 deconvolution layers to enlarge the feature map size and match the size of input image (end-to-end), several further convolution layers with stride size of 1, and a tanh layer to perform the regression. To collect features extracted from encoding path and decoding path with same feature map size, 1 short skip connection and 3 long skip connections were used to concatenate equal-sized feature map that is extracted from encoding and decoding convolution layers, respectively. The short connection included 6 residual blocks. The aim of using residual block is to lead the network focusing on the difference between source and target images. The details of implementation can be shown in Harm et al [37]. As shown in Figure 2, the long skip connection was implemented by concatenating these feature maps extracted from the layer of encoding path with same-sized feature maps extracted from the layer of decoding path. Attention gate could capture the most relevant semantic (segment) information without enlarging the receptive field [31]. Since in this work, the target RSP image is close to a semantic image, we proposed integrating attention gate into the long skip connection to highlight the semantic features from feature maps extracted from the previous layer of encoding path. The details of the implementation of attention gate can be found in our previous work [31].

Loss Function
By using the specific design of loss functions for generators and discriminators, the learnable parameters of the proposed network can be learned iteratively via Adam gradient descent minimization of these losses. Because generators and discriminators have different goals in this work (the generator's goal is to generate a synthetic image that can fool the discriminators; the discriminator's goal is to discriminate synthetic from real), in each iteration, the generators and discriminators were optimized in an alternative manner. For generator loss, in order to fool the discriminators, adversarial loss was used. Adversarial loss relies on the output of the discriminators, that is, the distribution of feeding synthetic RSP image (generated from DECT-to-RSP generator G DECT-RSP ) into the discriminator of RSP and the distribution of feeding synthetic DECT image (generated from RSP-to-DECT generator G DECT-RSP ) into the discriminator of DECT. For clarity, we present only formulation for G DECT-RSP .
where I DECT is the DECT 2-channel image and G DECT-RSP (I DECT ) the output of the DECT-to-RSP generator, that is, the predicted RSP. D RSP is the RSP discriminator, which is designed to return a distribution indicating whether a pixel region is real (from RSP) or fake (from predicted RSP). The function SCE (Á,1) is the sigmoid cross entropy between the discriminator map of the predicted RSP and a unit mask.
In addition, to guide the generators to learn a mapping that could be closed to a 1-to-1 mapping, cycle consistency loss was used. In this work, mean absolute error (MAE) and gradient difference error (GDE) were used as a compound loss to calculate the cycle consistency loss of generator [34]. The MAE loss focuses on the accuracy of voxel intensity level. The GDE losses focus on the accuracy of image structure, where k is a balancing parameter that controls the MAE and GDE loss for cycle consistency. G RSP-DECT (G DECT-RSP (I DECT )) is the output of first feeding I DECT into the generator G DECT-RSP and then feeding the output into the generator G RSP-DECT ; namely, the output of this term denotes the cycle RSP. Finally, the optimization of generator is obtained by The optimization of discriminator is obtained by

Data Acquisition
In this study, we retrospectively analyzed the images from 20 patients with squamous cell carcinoma in the head and neck region. The 20 patients included 13 men and 7 women ranging in age from 25 to 89 years. The locations of tumors in these 20 patients included larynx, buccal mucosa, tongue, etc, and 12 of them underwent excisions. Each patient had CT simulation by TBCT in DECT mode with a 110-second delay after 100 mL Omnipaque 300 iodine contrast injected at 2.5 mL/s. Their targets and organs-at-risk (OARs) were delineated by physicians. Institutional review board approval was obtained with no informed consent required for this Health Insurance Portability and Accountability Act-compliant retrospective analysis.
The DECT images were acquired by a Siemens SOMATOM Definition Edge twin-beam CT scanner (Erlangen, Germany) at 120 kVp with the patient in treatment position (pitch: 0.45, rotation time: 0.5 seconds, scan time: around 30 seconds, CTDIvol: around 20 mGy, reconstruction algorithm: SAFIRE, reconstruction kernel: Q30f, tube current ranges from 500 to 650 mA, and metal artifact correction was in use). The 120 kVp x-rays were split into high and low spectra by 0.6 mm tin and 0.05 mm gold filters, yielding high-energy and low-energy scans for reconstruction, respectively. Composed images were also reconstructed from raw projection dataset by disregarding spectral differences. All the above images were reconstructed by Siemens Syngo CT VA48A with spacing 0.98 3 0.98 3 1.5 mm 3 and 512 3 512 pixels of each slice.
The ground truth RSP map of each patient was created by manually classifying the composed DECT images into different materials and assigning corresponding RSPs. In this study, we differentiated 7 types of material as listed in Table 1, each of which has RSP calculated using the chemical compositions published in Su et al [9] and ICRP Publication 23 [36]. Equations and details of calculation were done as previously reported [10]. Note that although the learning target is bulk-assigned RSP maps, the predicted RSP maps are still patient-specific with continuous values.
The cohort of 20 patients was used to evaluate our method using leave-one-out cross-validation. For each test patient, the model was trained by the remaining 19 patients. The model is initialized and retrained for next test patient by training another group of 19 patients. The training datasets and testing datasets were separated and independent during each study. For our training, we used data augmentation and 3D patch-based method to increase training data variation. Flipping, rotation, scaling, and rigid warping were used to enlarge the data size by 72 times. Patch size was set to 96 3 96 3 32.
For comparison, a physics-based dual-energy stoichiometric method was implemented to generate RSP maps as well. We used the Gammex RMI 467 electron density phantom (Sun Nuclear, Melbourne, FL) with the chemical compositions of inserted materials specified by the manufacturer. The RSP value of each rod was calculated from the known chemical compositions using the Bethe equations as reported [10]. The rods were randomly placed throughout the phantom for 5 different scans so an average HU value, relatively independent of positioning, could be obtained. The same CT scanning Table 1. Ground-truth relative stopping power of different materials calculated based on chemical compositions published in ICRP Publication 23 [36]. protocol as used for patient scans was used during the phantom scan. The HU was measured on each rod at both high-and low-energy images and calibrated with the calculated RSP values using equations reported [9]. The calibration was then applied on patient DECT images to generate physics-based RSP maps.

Relative stopping power
To investigate the performance of the proposed method at different noise levels, we simulated additional noise on the original DECT image datasets. On each pixel, random noise was added with a probability distribution of Gaussian with mean ¼ 0 and standard deviation ¼ 0.5%, 1%, 2%, and 5% of the value of that pixel þ 1000 HU, respectively.
To further demonstrate the feasibility of dose calculation using the predicted RSP by the proposed method in proton treatment planning, we compared the difference between dose maps calculated on the ground truth RSP and the predicted RSP using the same plan parameters. Of the 20 patients, 19 had clinical proton treatment plans using pencil beam scanning and optimized with multi-field optimization technique. All plans had 4 to 5 beams and were robustly optimized with 3-mm setup uncertainty in all directions, which was covered by the 3.5% range uncertainty. The average prescription dose to clinical target volumes (CTVs) was 60.2 Gy among all patients. The structures and original treatment plans were duplicated onto the RSP maps of ground truth or prediction for dose calculation using the same algorithm (proton convolution superposition) in Eclipse 13.7 (Varian Medical Systems, Palo Alto, California). For each plan, we visually checked the similarity of the dose distributions calculated on ground-truth RSP and predicted RSP. Quantitatively, clinically relevant dose volune histogram (DVH) metrics were extracted for comparison of dose to CTVs and relevant OARs. For plans with multiple CTVs, the DVH of each CTV was counted separately.
In addition to the patient study, we also included a phantom study to evaluate our method in a controlled condition. We scanned an anthropomorphic phantom (CIRS Adult Male 701) with same scanning protocol for patients with head and neck cancer. Half of the slices (about 150 slices) were trained, and the rest were tested in an alternating pattern. The training slices and testing slices were alternating. The same data augmentation and 3D patch-based method was used to increase training data variation. The phantom mainly contained 3 materials, that is, soft tissue, lung, and bone, and the ground-truth RSP of each material was calculated based on the material information found in the manual.

Image-Quality Metrics
In this study, we evaluated the accuracy of generated RSP maps by comparing it with ground-truth RSP maps. We quantitatively characterized the overall accuracy by normalized mean square error (NMSE) within the patient body. The NMSE measures the average error among all pixels, which can be described as where RSP and RSP 0 are generated and ground-truth RSP maps, respectively, and jj Á jj 2 is the L2 norm. To quantify the accuracy of each material, we calculated the mean error (ME) and NMSE on each material VOI except air. ME quantifies the error of averaged RSP in certain materials, which is defined as, where i indicates the index of the ith pixel in the VOIs of the jth material. Figure 3 shows the anthropomorphic phantom results of our proposed method as well as ground truth. Most RSP differences were in lung and at the boundary of 2 materials. In such case with simple geometry and piece-wise constant RSP distribution, the proposed method achieved NMSE of 0.60% within the patient body, and ME/NMSE of -2.20%/4.08%, 0.31%/0.55%, and -0.10%/0.43% for lung, soft tissue, and bone, respectively.

Results
In Figure 4, the quality of RSP maps is shown using a side-by-side comparison with ground-truth and physics-based results from 1 patient without additional simulated noise. By the proposed method, the RSP maps maintained comparable resolution, contrast, and most of the details as the ground truth. Errors were observed due to misclassification between adipose and muscle, as well as between spongiosa bone and cortical bone, in addition to boundaries between different materials (column g). Compared with the results from the proposed method, physics-based results showed a larger error in lungs (e3). It also showed degraded quality caused by noise and artifacts. For example, a large discrepancy in (e3) can be seen around the corresponding region of streaking artifacts in (a3) and (b3).
Similarly, an exemplary result with additional 5% simulated noise is also shown in Figure 5. RSP maps generated by the proposed method and their corresponding profiles ( Figure 6, C2) remained close to the results without additional noise, while those generated by the physics method were highly affected by the noise and demonstrated severe inaccuracies (column b). The averages of measured NMSE within the patient body at different noise levels among all 20 patients are summarized in Table 2. The proposed method maintained a comparable accuracy when noise increased from 0% to 5%, while the physicsbased method showed severe degradation in performance.
The averages of measured NMSE and ME within different VOIs at different noise levels among all 20 patients are summarized in Table 3. With several exceptions of lower but comparable NMSEs in adipose and muscle at the lowest noise level, overall, our method outperformed the physics-based method in NMSE and ME. When the noise level increased, physicsbased results were more affected in accuracy, while the accuracy of the proposed method was successfully maintained. Thus, the advantage of the proposed method over the physics-based method was more prominent at higher noise levels, which indicates greater robustness to noise.   Figure 7 compares the calculated proton dose maps at selected axial, sagittal, and coronal planes of a patient as an example in the case of no additional noise simulated. The dose maps calculated on the ground truth qualitatively appeared to be more similar on those of the predicted RSP maps by the proposed method than by the physics-based method. Most dose errors of the proposed method were around the distal end of the beams, while that of the physics-based method happened at all the high dose-gradient regions to a much wider and severe extent. The gamma index maps (3%/3 mm) also showed more failed areas in physics-based results than our results. The gamma passing rates (threshold ¼ 10%) for the physics-based method were 79.67%, 83.03%, and 91.43% for the axial, coronal, and sagittal planes at iso-center, and 97.81%, 97.75%, and 98.92% for the proposed method at the corresponding planes.
The differences in DVH metrics of clinically relevant OARs between ground truth and predicted RSP among all patients are summarized in Table 4. For the proposed method, the mean differences were less than 0.2 Gy for CTVs with no statistical significance. A 2-tailed paired t-test was performed to compare the mean values of each DVH metric between ground truth and predicted RSP maps. It was shown that DVH metrics had no significant difference in all OARs except esophagus and brainstem with overestimation less than 1.2 Gy. Compared with the results by the physics-based method, the results by our method had less variation in most of the DVH metrics.

Discussion
In this study, we proposed a novel machine-learning based method to predict RSP maps from DECT for proton radiation therapy. As an alternative to the current physics-based method, our proposed method aims to provide accurate RSP values with more resistance to noise. We evaluated the accuracy of predicted RSP maps using our method in the context of patients with head-and-neck cancer. The RSP maps showed an average NMSE of 2.83% across the whole body volume and average ME less than 3% in all VOIs. With additional simulated noise added in DECT datasets, the proposed method still maintained comparable performance, while the physics-based method suffered from degraded inaccuracy with increased noise level. Based on the statistical analysis of comparative DVH metrics of dose maps calculated on ground truth and predicted RSP maps among 19 pencil beam scanning proton treatment plans, we showed that the average differences in DVH metrics for CTVs were less than 0.2 Gy for D95% and Dmax with no statistical significance. Maximum difference in DVH metrics of OARs was around 1 Gy on average. These results strongly indicated the high accuracy of RSP maps predicted by our machine-learning-based method and showed potential feasibility for proton treatment planning and dose calculation.
In this study, we proposed a method to generate RSP maps by learning from DECT image datasets and the corresponding ground-truth RSP. Note that the patients' true RSP values were unavailable. The ground truth of this study was then assumed by assigning calculated RSP values to manually segmented regions of materials on patient DECT images, where the calculation was based on the known chemical composition. Such a material assignment method is used in current proton and photon studies when the required material information cannot be readily or accurately derived from CT simulation images [9,38,39]. Although the learning target was bulk-assigned RSP maps, the predicted RSP maps were still patient-specific with continuous values, preserved fine image textures and contrasts. Potential error may have been caused by the inaccuracy in segmentation and inter-and intrapatient variation in composition, which could be a limitation in the implementation of this study. Another limitation is that the training and testing images in the anthropomorphic phantom study, although from different slices, were still very similar in terms of geometry and material.  (1) and (2) are the profiles of results in Figure 3 (without simulated noise) and 4 (5% additional simulated noise), respectively. Testing on a different anthropomorphic phantom with different materials and reasonable geometry variation would be helpful in investigating the fundamental correlation between learning algorithm and physical artifacts in DECT images under a well-controlled condition. However, it is worth noting that this study did not aim to evaluate the absolute error of the predicted RSP from patients' true RSP values which were unavailable. Instead, we demonstrated the performance of the proposed method by evaluating its discrepancy of the prediction with its training dataset. Such performance should remain at a similar level when the training dataset is selected differently. With the prediction performance maintained, a better knowledge of RSP in the training stage, such as more and finer types of materials segmented and assigned on a patient, differentiation of the same type of tissue among diverse demographics [40], or estimation from animal tissue models [41,42], would directly lead the predicted results to be closer to physical reality. Machine-learning-based methods are relatively new for DECT RSP map prediction. Only a few existing studies have relied on patient data, rather than phantom data. The RSP prediction accuracy by our method was competitive to others. Su et al [9] investigated the performance of historical centroid, random forest, and artificial neutral networks in solving this problem. Models were trained on tissue substitutes and tested on patient data. Results showed that MEs were around 5% for all VOIs in the abdominal region compared with calculated RSP values as ground truth. Our study moved a step further in integrating a  deep attention strategy and cycle-GAN into the mapping from DECT to RSP, directly training on patient datasets, and demonstrating the dosimetric feasibility.
In this study, we used a twin-beam scanner for DECT acquisition. Note that the proposed method does not specify the scan scheme; thus, it is applicable to other DECT modalities. We presented it in the context of TBCT in order to demonstrate the noise robustness of the proposed method. TBCT has inferior energy separation compared with other DECT modalities, which leads to a higher sensitivity to artifacts and noise on DECT image datasets [22]. It potentially explains why the results by the physics-based method demonstrated larger error and higher noise level than previously reported [15], where DECT images were acquired by 2 sequential scans at 2 different energy levels, which have a larger separation between the 2 energy spectra than the twin-beam scan scheme used in this study. However, TBCT has unique advantages over other DECT modalities for radiation therapy simulation in good temporal coherence, full field of view, and low hardware complexity and cost. Thus, our method overcomes the drawback of TBCT in RSP map generation by its insensitivity to noise, thus facilitating the clinical use of TBCT in radiation therapy workflow. Moreover, the recently developed multispectral CT, which allows concurrent identification of multiple materials with increased accuracy [43], is also compatible with the proposed method by adding more input channels for CT images at multiple energy levels.
We investigated the performance of the proposed method at different noise levels and compared with the physics-based method. The DECT images without any simulated noise represents the most likely noise level in clinical practice. With simulated noise, the DECT images may be much noisier than usual, while it may still happen when the patient is large or a low-mAs scanning protocol is preferred. In such cases, the advantage in noise robustness of the proposed method is desired. In addition to the noise robustness investigated in this study, deep learning-based methods may have an advantages over physics-based methods in other aspects. For example, a major struggle in physics-based methods is the systematic uncertainty in reconstructed images, which includes, but is not limited to, beam-hardening artifact, scatter artifact, CT number variations on reconstruction algorithms, and patient size [44][45][46][47]. Deep learning-based methods, on the other hand, are expected to be less affected by these uncertainties, as long as the training datasets are representative and the corresponding training target (RSP maps) reflects the true RSP distributions. It can be attributed to the data-driven approach used in deep learning that focuses on images features, unlike the sole dependence on pixel values in the physics-based method. Given these potential merits, deep learning-based methods may overcome the current obstacles in physics-based proton RSP estimation, while the specific improvement needs further investigations.
In the present study, we found that the dose difference in columns (c) and (f) of Figure 7 mostly happened at the distal end of the beam. It is consistent with current studies about range uncertainties of proton beams caused by the error in RSP maps [39]. The overestimation in dose could have resulted from the underestimation of muscle RSP with less compensation from the overestimation of adipose RSP. Future studies could include investigations in determining the range deviations to the ground truth from the systematic prediction error on each material.
Deep learning-based methods usually involve large computational costs to train a model. We implemented the proposed algorithm with Python 3.7 (Python Software Foundation, Wilmington, DE) and TensorFlow (Company, City, State) as in-house software on a NVIDIA Tesla V100 GPU (Santa Clara, CA) with 32 GB of memory. The Adam gradient optimizer with a learning rate of 2e-4 was used for optimization. In the present study, the training stage required~30 GB and~16 hours for the training datasets of 19 patients and~2 minutes for each patient in the testing stage.
We limited our study to the head-and-neck region. Patients with head-and-neck cancer have high anatomical complexity and variability between patients. The tumor shape, size, and location can vary greatly for different patients, and it is common to see the tumor changing the exterior body shape, which is challenging for the learning-based method. A comprehensive evaluation with a larger population of patients with diverse anatomical abnormalities should be included in future studies during the model training. Different testing and training datasets from other centers would also be necessary to validate the clinical utility of our method. Moreover, the proposed method can be applied to other treatment sites of clinical importance for proton therapy, which would be of great interest for expanding this work to the clinic.