Introduction.—Due to early detection and improved therapies, the prevalence of long-term breast cancer survivors is increasing. This has increased the need for more inclusive underwriting in individuals with a history of breast cancer. Herein, we developed a method using algorithm aiming facilitating the underwriting of multiple parameters in breast cancer survivors.
Methods.—Variables and data were extracted from the SEER database and analyzed using 4 different machine learning based algorithms (Logistic Regression, GA2M, Random Forest, and XGBoost) that were compared with Kaplan Meier survival estimates. The performances of these algorithms have been compared with multiple metrics (Log Loss, AUC, and SMR). In situ (non-invasive) and metastatic breast cancer were excluded from this analysis.
Results.—Parameters included the pathological subtype, pTNM staging (T: tumor size, N; number of nodes; M presence or absence of metastases), Scarff-Bloom-Richardson grading, the expression of estrogen and progesterone hormone receptors were selected to predict the individual outcome at any time point from diagnosis. While all models had identical performance in terms of statistical metrics (AUC, Log Loss, and SMR), the logistic regression was the one and only model that respects all business constraints and was intelligible for medical and underwriting users.
Conclusion.—This study provides insight to develop algorithms to set underwriter-friendly calculators for more accurate risk estimations that can be used to rationalize insurance pricing for breast cancer survivors. This study supports the development of a more inclusive underwriting based on models that can encompass the heterogeneity of several malignancies such as breast cancer.
According to the global estimate of the World Health Organization,1 2.3 million women have been diagnosed with breast cancer with 685,000 deaths in 2020. As of the end of 2020, 7.8 million women who were diagnosed are alive with a medical history of breast cancer in the past 5 years, making it the world’s most prevalent cancer.2 While the increased incidence is poorly explained, the improvement of survival that has been observed since the 1980s are primarily explained by early detection programs combined with improved multimodality cancer treatments.3
As the number of breast cancer survivors is increasing, it is now estimated that 1 among 8 women are likely to develop breast cancer over their lifetime. Survivorship post-cancer programs including prevention, medical follow up, psychological support, professional, and social reinsertion have been developed to restore normal life in breast cancer survivors.3 Many groups across the world have been starting to advocate for changing the social perception of cancer survivors, including a more inclusive approach to access insurance coverage such as for real estate loans.4 From the insurance standpoint, breast cancer stands as a highly heterogenous tumor compared to many other malignancies with multiple parameters used to estimate the risk of relapse and survival at diagnosis.5 This high heterogeneity increases the complexity of survival risk estimation and appears particularly important in insurance to ensure data-driven risk estimation and to justify ratings in cases of disagreement with clients and trial.6
At the time of diagnosis, the pathological subtypes, TNM staging (T: tumor size, N; number of nodes; M presence or absence of metastases), Scarff-Bloom-Richardson grading, the expression of estrogen and progesterone hormone receptors, HER2-neu overexpression, the Ki67 rate or mitotic index, genomic signatures, the presence of microvascular or perineural invasions have been used for several years by physicians in multivariate analyses to estimate patient prognosis and prescribe appropriate therapies.7 To allow a better understanding of the TNM classification, it is important to remember than cTNM account for the clinical evaluation performed using physical and radiological exams, pTNM account for the classification performed using pathological specimens obtain from surgical resection, and ypTNM account for post-operative pathological specimens obtain from surgery after neo-adjuvant chemotherapy. For several years, multivariate algorithms have been routinely used by physicians through web-based calculator interfaces using machine learning.8 For example, the UK Breast Cancer Predict web-based algorithm (https://breast.predict.nhs.uk/) was used to estimate patient risk and justify the use of systemic therapies such as chemotherapy, hormonotherapy, and monoclonal antibodies such as trastuzumab, a monoclonal antibody directed against HER2-neu receptor and restricted to patients with HER2 2+ with positive FISH and HER2 3+ cancer cells.
In the medical literature focusing on breast cancer patient outcome, it has been demonstrated that the risk of relapse and death varies over time.9 Conditional survival was shown to be a useful concept to estimate the probability of surviving further years, given that a patient has already survived for several years after the diagnosis of a chronic disease, such as malignancies where a large number of patients have long-term survival.10-12 In the perspective of insurability, it may be used to address the prognosis at the time of insurance inception. However, conditional survival does not discriminate individuals who are free-from-cancer from those who have already relapsed and are not amenable to insurance subscription.
This paper was not aimed to describe and select a specific model to be readily used for reader but was rather aimed to provide insight on methodologies that can be used to reach such a model. Thereby, the presented logistic regression model was taken as an example to support the stepwise development of this methodology for various medical applications, understanding that individual working in the field may be developing other medializations.
Methods
Database Selection
The SEER (Surveillance, Epidemiology and End Result) database has been used (https://seer.cancer.gov/data/). This database has been collecting data since 1973 and stands as the world’s largest database specialized in cancer and recognized by the entire scientific community.13 More than 400,000 observations are added every year. In terms of breast cancer, SEER has recorded more than 1.6 million observations of breast cancer since its creation and each observation is represented by 133 variables. Biometric variables (age at diagnosis, ethnicity, marital status, etc), medical variables (tumor size, tumor stage, etc) or therapeutic variables (surgery, chemotherapy, etc) are defined.
The reasons that led us to use this database included its size (the largest available cancer database in the world representing 30% of the American cancer population in US), its representativity (this database is representative of the American population in terms of socio- professional criteria), its reliability (partnerships with several laboratories and government agencies ensure the reliability and accuracy of the data), and its popularity (this database is used by physicians, researchers, scientists, and statisticians to conduct their work and publish figures on cancer incidence, prevalence, and mortality).
Survival of in-situ (non-invasive) breast carcinomas were not considered in this study and metastatic breast cancer risk estimation were not calculated because, while anyone can apply for insurance, only those in remission (ie, with no metastatic relapse) are generally considered candidates for insurance.
The selection was arbitrarily made to look primarily at death occurring within 10-15 years of follow up as this duration is frequently used in insurance as a reference for cancer risk evaluation.
Constraints to Apply to Select Variables
Variables were selected based on the following constraints: (1) Business constraints: Variables must be easily available to insurance companies. This means that the variable must be included in the insured’s medical file at the time of application. For management issues, the number of variables that the underwriter must fill in the calculator must be kept reasonably limited. (2) Commercial constraints: The consistency of prices is a fundamental aspect for underwriters. Rates must be consistent with the medical knowledge, meaning that the extra premium must be lower for an individual with a small tumor size than one with a larger tumor. (3) Medical constraints: The selected variables must also be considered as prognostic variables that are clearly identified by oncologists and recognized as such in the medical literature. (4) Statistical constraints: The chosen variables must have a high “Feature Importance.” That means that it must have a significant impact on prediction.
The prognostic value of selected variables was further evaluated according to the Kaplan Meier estimate of survival.
The outcome of metastatic breast cancers was used in the method to evaluate the probably of death following metastatic relapses but were not used for rating, primarily because the estimated life duration in metastatic breast cancer was too short to make any relevant insurance offer in those patients.
Risk of Relapse Model
A person in cancer remission (ie, after treatment) may relapse into this disease.
A person with breast cancer (or in remission) can only die of breast cancer if he or she is metastatic (or has a metastatic relapse).
The longer the remission period, the lower the risk of relapse.
Importantly, the “metastatic relapse” is not updated in the database. Thus, in SEER database, a non-metastatic person who dies from specific breast cancer death is considered to have necessarily relapsed into metastasis during the observing period.
It is also important to note that only insured people in remission will be included in this analysis as only individual in remission will be able to apply for insurance. Thus, the statistical tool must consider the fact that insureds are exposed to the risk of relapse. In addition, it is considered that the longer the remission period, the lower the risk of extra mortality.
We estimated that those who died specifically from breast cancer using the SEER database, must have experienced a relapse at some point before death. This point was determined using the survival estimate of those with metastatic cancer at diagnosis. At the estimated point of relapse, these individuals were removed from analyses in relevant future time points. We estimated the risk of death of patients with metastatic breast cancer according to the number of metastatic lymph node whenever available, considering that lymph nodes are the most important prognostic parameters.
Models selected for analyses were aimed to consider medical findings and the limitations of the database, a model which considers this risk of relapse and the conditional survival (the probability of being alive knowing y years without having relapsed).
Step 1: To determine the probability qx of” specific breast cancer death” for non-metastatic observations and the probability of dying from” all causes” for metastatic observations, x years after the diagnosis. With x = {1, ..., 15} the time since diagnosis.
Step 2: To plot the survival curves Sx and , respectively the survival curve for non metastatic and the survival curve for metastatic x years after the diagnosis, using the predictions from step 1 with Sx+1 = Sx×(1 − qx+1) and S0 = 100%.
This value represents the excess mortality between the year x and x+1 of the contract of an individual subscribing after y years of remission. However, these extra mortalities are not the final extra premium that the insured should pay because they are not constant.
Presentation of Algorithms and Metrics
The first objective was to determine the probability of death of an individual with breast cancer using various methods of machine learning.14,15 This probability is thought to depend on the selected prognostic variables. The following machine learning processes were applied: Logistic Regression, GA2M, Random Forest, and XGBoost.
Regression Logistic: Parametric model of the GLM family, it approaches each variable by a vector . Then using its link function logit, the probability of death of everyone is deduced.
Random Forest: This algorithm is a model based on decision trees. However, instead of using a single decision tree (CART), the Random Forest creates several trees with re-sampling (Bagging), to limit overfitting. Then it aggregates the predictions of all the predictions to create one.
XGBoost: Also based on the principle of decision trees, it uses the Boosting method to make these predictions. This means that it aggregates different predictors (or trees) sequentially to correct its predictions.
GA2M: It is also a parametric model, of the GAM family. It approaches each variable with a parametric function. But the particularity of this model is that it considers the interactions between each pair of variables, in its link function logit.
The performances of these algorithms have been compared with metrics such as Log Loss (evaluating the accuracy of predicted probabilities), AUC (evaluating the accuracy of classifications), SMR (evaluating the calibration of death probabilities), loss Ratio (evaluating the profitability of pricing), and Consistency tests. GG-plot, GGsurv-plot and SurvMiner packages were used as analytical tools. For building models, we use the Python programs. As we have been looking at various models, we didn’t consider specifically the end odd ratios because it was not applicable in all cases.
Results
Selection of Variables Associated with a Risk of Relapse and Death
Selection among Machine Learning Algorithms
Moreover, predictions from XGBoost, Logistic Regression and GA2M were confused for non-metastatic observation, shows that they predicted individual outcome in the same way. For metastatic observations, predictions are near to the Kaplan-Meier estimator, but have not confused each other. While all models had identical performance in terms of statistical metrics (AUC, Log Loss, and SMR), the logistic regression was the one and only model that respects all business constraints and was intelligible for medical and underwriting users.
Setting up a Breast Cancer Calculator
Based on the above-mentioned model, we have set a breast model calculator using easy to recover information from patients with a history of breast cancer. The model estimates from the selected variable the 15-year conditional disease-free survival rate at any time from diagnosis and can, therefore, be used for individuals with a history of breast cancer, the individual risk of death, and define individual risk-adjusted pricing.
Specific Medical Situations
Two medical situations were shown to require particular attention to use the breast cancer calculator. One is the risk estimation of synchronous multifocal lesions occurring in the same breast. In this situation, the calculation shall consider the risk of each separate breast lesion, considering the overall risk estimation, the lesion of poorest prognosis. The other peculiar situation is the use of neoadjuvant chemotherapy for locally advanced/poor prognosis breast cancers that is expected to downsize the breast tumor (ypTNM), and this downstaging is likely to lead to underestimation of the risk. In this later situation, parameters to consider shall not be recovered from the surgical pathological report but from the clinical information clinical cTNM staging and the biopsy performed prior to any chemotherapy.
Discussion
This project was aimed at providing a global method to assist worldwide underwriters in accurately addressing the risk-derived rating for individuals with a history of breast cancer using data-driven evidence based on machine-learning algorithms. Data were obtained from the SEER database that accounts for the largest, regularly updated, dataset of patients with cancer. As statistical analyses of breast cancer had to be consistent with the most advanced medical knowledge, we ensured that the process used in this project, as well as the risk evaluation, fulfills criteria that could be acceptable by multiple parties by validating each step in our team that included at least one of the following experts: a statistician, an actuary, an underwriter, a programmer, and a medical oncologist.
In this study, we redefined parameters predicting survival in individuals with a history of breast cancer at any time point from diagnosis providing they are still disease-free at the time of insurance inception. As expected, the duration since diagnosis was associated with survival estimates, the longer the disease-free survival duration, the better the decreased risk of further death. Furthermore, the age at the contract signature and the duration of the contract inception were also important factors to estimate the insurability.
Although variables primarily used for diagnosis were evaluated, not all medical parameters used at diagnosis by treating physicians for prognostication were relevant in our model, years after diagnosis. Relevant medical parameters already evaluable at the time of diagnosis and still relevant for our calculator were the histological subtype, pTNM classification, SBR grading, and hormone receptor expression. In our model, the overexpression of HER2-receptors, that is acknowledged as a factor of poor prognostic at diagnosis, was not associated with subsequent poor outcomes, which was eventually due to the systematic use of adjuvant antibody-based therapy with trastuzumab (Herceptin TM) targeting HER2 3+ positive cancer cells for 1 year.
Interestingly, prognostic estimates using disease-free conditional survival at the time of insurance inception was shown to be consistently better than prognosis at the time of diagnosis. This reduction of the risk of death overtime has been described elsewhere and is mainly related to the occurrence of harvesting death of poor prognostic patients leading to an apparent improvement of the estimated survival in still disease-free patients. In our study, this has led to an improved estimated of risk and rating overtime, allowing better rating offers for long-term disease-free breast cancer survivors. As eluded to above, the duration since diagnosis appears to be an important parameter that has been subject to further political debates in many countries such as France where the right to be forgotten after 10 years of disease-free survival has been set by law for breast cancer survivors. Independently, our model was shown to better predict survival of breast cancer survivors allowing a more accurate estimation of risk that eventually will lead to better ratings.
Our method was set to propose individualized risk estimation to derive justifiable pricing based on 7 easy-to-recover characteristics, which are: age, duration since diagnosis, size of tumor, number of lymph nodes affected, grade of tumor, hormone receptors, and histology. These variables were selected with respect to medical constraints (medical relevance of these variables), underwriting constraints (accessibility of these variables at the time of underwriting), and statistical constraints (important variable for all models). Thus, our method appears consistent with the expected heterogeneity of breast cancer risk with more than 20,000 combinations of variables, making it possible to derive adjusted rating for each specific case.
The main challenge was the choice of the prediction model to make the pricing. While all models had similar performance in terms of statistical metrics (AUC, Log Loss, and SMR), the logistic regression was the only model that respects all business constraints and was intelligible for users. In fact, the coherence with pricing will be probably the most stringent constraint to be usable by underwriters.
Conclusion
The development of more inclusive underwriting requires a comprehensive knowledge of the heterogeneity of breast cancer. Recent advances in Machine Learning and the availability of rich databases specialized in breast cancer were shown to allow significant improvement in data-driven risk predictions. In this study, we were able to develop a method for an accurate estimation of individual risks, which further lead to the global development of an insurance specific and underwriter-friendly calculator with justifiable pricings for people in remission from breast cancer.