Association Analysis of Obesity/Overweight and Breast Cancer Using Data Mining Techniquesand
Growing evidence has shown that some overweight factors could be implicated in tumor genesis, higher recurrence and mortality. In addition, association of various overweight factors and breast cancer has not been extensively explored. The goal of this research was to explore and evaluate the association of various overweight/obesity factors and breast cancer, based on obesity breast cancer data set.
Material and Methods:
Several studies show that a significantly stronger association is obvious between overweight and higher breast cancer incidence, but the role of some overweight factors such as BMI, insulin-resistance, Homeostasis Model Assessment (HOMA), Leptin, adiponectin, glucose and MCP.1 is still debatable, So for experiment of research work several clinical and biochemical overweight factors, including age, Body Mass Index (BMI), Glucose, Insulin, Homeostatic Model Assessment (HOMA), Leptin, Adiponectin, Resistin and Monocyte chemo attractant protein-1(MCP-1) were analyzed. Data mining algorithms including k-means, Apriori, Hierarchical clustering algorithm (HCM) were applied using orange version 3.22 as an open source data mining tool.
The Apriori algorithm generated a list of frequent item sets and some strong rules from dataset and found that insulin, HOMA and leptin are two items often simultaneously were seen for BC patients that leads to cancer progression. K-means algorithm applied and it divided samples on three clusters and its results showed that the pair of <Adiponectin, MCP.1> has the highest effect on seperation of clusters. In addition HCM was carried out and classified BC patients into 1-32 clusters to So this research apply HCM algorithm. We carried out hierarchical clustering with average linkage without purning and classified BC patients into 1–32 clusters in order to identify BC patients with similar charestrictics.
In the developed world, the majority of the adults population falls in the overweight and obese categories as determined by body mass index (BMI: 25-29.9 and >30, respectively) . Unfortunately, some factors such as nutrition, economic and changes in technologies increasing the obesity prevalence. Some malignances are directly related to obesity, such as breast cancer. Breast cancer is the leading cause of cancer-related mortality worldwide [2, 3]. However, the American instate for cancer research highlighted that being overweight/obese decreased the risk of breast cancer . The overweight factors and breast cancer have been the subject of many studies but the role of some clinical and biochemical factors such as BMI, insulin-resistance, Homeostasis Model Assessment (HOMA), Leptin, adiponectin, glucose and MCP.1 is still debatable . Growing evidence has shown that these alterations could be implicated in tumor genesis, higher recurrence and mortality [6, 7].
In recent years, data mining techniques has been applied to solve various problems especially in health care. Data mining algorithms has been greatly improved in order to make support doctors making decision by analyzing data and discovering patterns in present datasets . Analyzing the obesity factors in order to discover new patterns can be helpful for clinicians. So we hypothesize that the breast cancer distribution in overweight and obese women may differ from normal weight and this puts them at higher risk for breast cancer. Furthermore, association of above mentioned various overweight factors and breast cancer has not been extensively explored. In this study, we intended to determine the relation between these characteristics in a group of overweight/obese women breast cancer.
MATERIAL AND METHODS
For the experimental purpose we use an obesity breast cancer project data file to analyze, this dataset that contains 64, include both continuous and categorical features and it was clear and there are not any missing values and duplicate data. To perform experimental works of this study, take orange version 3.22 as an open source data mining tool and then apply data mining algorithm and different evaluation methods. We aim to identify groups of obese BC patients with similar characteristics, clusters by using a clustering method and make a comparison between them. We have use data preprocessing method and the method that we were going to handle is K-means. In addition two different association mining algorithm was used in this work. The data set attributes descriptions are as below in Table 1.
Preprocessed data is shown in Fig 1. A complete overview of data mining process which was followed in current project, is displayed in schematic model (Fig 2). We have described these pathways in the methods section separately.
the obese breast cancer data set attribute
In preprocessing step of dataset, considering the total number of features (F) in dataset, we have discretized features in different groups by using Equal-width discretization method as displayed in Fig 3.
There are 64 patients in this dataset and the result of K-means clustering and association rule mining methods gives separated pathway after preprocessing (Fig 2). Fig 4 shows the working model of this research work.
First pathway: Association Rule Mining implementation
ARM is a method to find the associations and/or relationships among items in large databases. So, we can use it to detect relations among inputs of any system and later eliminate some unnecessary inputs.
The Apriori algorithm is a state of the art algorithm most of the association rule algorithms are somewhat variations of this algorithm. The Apriori algorithm works iteratively. It first finds the set of large 1-item sets, and then set of 2- item sets, and so on. The number of scan over the transaction database is as many as the length of the maximal item set. Apriori is based on the following fact: The simple but powerful observation leads to the generation of a smaller candidate set using the set of large item sets found in the previous iteration . The Apriori algorithm presented as follows:
So, we applied Apriori algorithm in form of a pre-defined library of Orange Canvas. The algorithm generated a list of frequent item sets and some strong rules from dataset. The strongest rule was:
It was described that every patient with an insulin value of less than 16.439, has a HOMA value less than 6.64 with support and confidence 75% and 100%, respectively. The results were shown in Table 3 and 4. Table 2 showed rules with minimum support value of .30.
In Table 2 and figure 5, we listed some meaningful and valuable association rules found between overweight factors in BC patients. For example, in this study we found that in 48 samples of total with support=75% and confidence=100% every patient with Insulin<=16.439 has HOMA<=6.64. This means insulin and HOMA are two items often simultaneously were seen for BC patients and we find out insulin, HOMA and leptin with support=60% and confidence= 95% are item sets related to breast cancer development and progression. We can report a positive association between obesity (patient with pre-diabetes) and cancer, regardless of these factors, diabetes may be an independent risk factor for cancer.
This proves the method implemented in current study can indeed find meaningful and valuable association rules to provide useful suggestions for relevant treatment.
Second pathway: Implementation using K-means method
Steps of k-means algorithm was including:
Input: Input breast cancer dataset
The first step of K-means clustering is to define the number of the cluster and their centroids (In this case k=3).
K clusters are created by associating every data point with the nearest mean.
The centroid becomes the new means
Steps (b) and (c) are repeated until convergence has been reached . Frequent item sets of cancer diagnosed patients is presented in Fig 5. Results obtained applying this algorithm is shown in Fig 6.
The results showed that the pair of <Adiponectin, MCP.1> has the highest effect on seperation of clusters.
Third pathway: Implementation using HCM (Hierarchical Clustered Method)
This research work used HCM algorithm for screening of obese breast cancer patients using orange environment. In this algorithm clusters are created from previously initiated clusters. So this research apply HCM algorithm. We carried out hierarchical clustering with average linkage without purning and classified BC patients into 1–32 clusters. By applying this algorithm the following outcome generated (Fig 7).
In previous years, a lot of similar works has been done using data mining techniques in healthcare sector, the result of this work demonstrate the ability of applied algorithms that confirm the results those reported in , as the original creators of the used dataset, obtaining 95% confidence interval for sensitivity, specificity and AUC was (82%, 88%), (84%, 90%) and (0.87, 0.91) respectively (using SVM), also by Alickovic and Subasi  who used the Random Forest and Genetic Algorithm, obtaining the highest accuracy of 99.48 to diagnosis of breast cancer. In addition the authors in  applied different classification model to find the best biomarker for breast cancer. The result of artificial neural network comparatively better than the KNN algorithms results and get an accuracy of 80%.
Some data mining techniques including Apriori, K-means and HCM algorithms were applied to illustrate how we can using these algorithms on datasets. Identifying which overweight factors robustly linked to breast cancer is critical for designing effective interventions, so good results obtained for our purpose too, and we believe that, a strict comparison could not be made between results obtained in other studies, because the algorithms and aim of studies are different.
Literature Review Summary Table
The attractiveness of these algorithms is the ability to involve them in the health sector in order to support the decision making of clinicians, in this way, the process of diagnosis and treatment can be carried out with reliable results.
In this study we applied the data mining algorithms to evaluate the overweight/obesity factors on breast cancer data set. The outcome of the research is justified that k-means and Apriori algorithms are helpful to make a comparison between overweight factors in BC patients to provide useful suggestions for relevant treatment. In addition HCM algorithm was convenient for this research work to identify groups of overweight breast cancer patients with similar characteristics. The results of the original papers [5, 13] of this data set is consistent with that found on the present study. This issue should be considered in the design of future studies to confirm the presented results. Thus, our data need to be analyzed further and will require a larger sample in order to prove causality and provide further insights into the role of such parameters of obesity in breast malignancy.
The authors agree on this final form of the manuscript, and attested that all authors contributed in the final draft of the manuscript.
CONFLICTS OF INTEREST
The authors declare no conflicts of interest regarding the publication of this study.
No financial interests related to the material of this manuscript have been declared.