Prediction breast cancer risk: Performance analysis data mining techniquesand
Early detection breast cancer Causes it most curable cancer in among other types of cancer, early detection and accurate examination for breast cancer ensures an extended survival rate of the patients. Risk factors are an important parameter in breast cancer has an important effect on breast cancer. Data mining techniques have a growing reputation in the medical field because of high predictive capability and useful classification. These methods can help practitioners to develop tools that allow detecting the early stages of breast cancer.
Material and Methods:
The database used in this paper is provided by Motamed Cancer Institute, ACECR Tehran, Iran. It contains of 7834 records of breast cancer patients clinical and risk factors data. There were 4008 patients (52.4%) with breast cancers (malignant) and the remaining 3617 patients (47.6%) without breast cancers (benign). Support vector machine, multi-layer perceptron, decision tree, K nearest neighbor, random forest, naïve Bayesian models were developed using 20 fields (risk factor) of the database because database feature was restrictions. Used 10-fold crossover for models evaluate. Ultimately, the comparison of the models was made based on sensitivity, specificity and accuracy indicators.
Naïve Bayesian and artificial neural network are better models for the prediction of breast cancer risks. Naïve Bayesian had accuracy of 93%, specificity of 93.32%, sensitivity of 95056%, ROC of 0.95 and artificial neural network had accuracy of 93.23%, specificity of 91.98%, sensitivity of 92.69%, and ROC of 0.8.
Strangely the different artificial intelligent calculations utilized in this examination yielded close precision subsequently these techniques could be utilized as option prescient instruments in the bosom malignancy risk considers. The significant prognostic components affecting risk pace of bosom disease distinguished in this investigation, which were approved by risk, are helpful and could be converted into choice help devices in the clinical area.
In worldwide more than one million new instances of female bosom malignant growth are analyzed every year. Bosom malignancy presents genuine danger to the existences of individuals and it is the subsequent driving reason for death in ladies today and It is the most well-known reason for female demise in industrialized nations, the second most normal reason on the planet and the third generally regular in non-industrial nations [1-3]. Roughly 10–15% of patients with bosom malignancy bite the dust of disease metastasis or repeat, and early determination of it can improve guess. Death pace of ladies due to bosom malignant growth can be diminished if can be distinguished at a generally beginning phase. Early forecast of bosom disease assumes a basic part in fruitful treatment and saving existences of thousands of patients consistently. Information mining has become a basic approach for processing applications in the space of medication [4-6]. With the assistance of most recent, proficient and early showing techniques, most of such malignant growths are analyzed when the sickness is still at a restricted stage [7, 8]. The utility of artificial intelligent (AI) strategies in medical care examination is developing dynamically [9-11]. Albeit gigantic clinical information identified with the patients is being gathered and put away by medical services associations, just a little subset of the prescient elements has been utilized in anticipating results [2, 12-14]. The vast majority of the current methodologies center around applying factual strategies on little arrangement of properties suggested by the space specialist’s illness diagnostics [15-18]. These traditional methodologies typically make unreasonable suspicions, e.g., ordinariness, freedom or linearity connections, which may not be in every case valid in down to earth information. Then again, progressed factual methodologies may address a portion of the above deficiencies; in any case, they are computationally costly and may not appropriate to gigantic datasets . For the most part, in the clinical world, there are two stages for settling on the choices. These two stages are (1): differential diagnosis (DD); in this stage, all data of patients including their clinical history, side effects of sickness, consequences of different testing, for example, blood testing and so forth are seen by specialists as the info information. This information is handled by specialists dependent on their clinical information for infection analysis. In some cases, a few infections have some comparative side effects; accordingly, clinical specialists should be relegated subjective loads to every last one of info sand make designs, match these examples with the examples of different illnesses lastly select the nearest match and analysis the specific sickness. (2) Final or provisional diagnosis (FD): in this stage, the fundamental proposals and medicines would be beginning as per the distinguished illness. In this progression, a doctor with clinical information and his/her rationale, proceeds with exams and records the consequences of consistently sees or tests, and chooses the last medicines and anticipation . Information mining has different procedures (for example, classification, clustering, regression, association rules and so forth) and calculations (for example, decision trees, genetic algorithm, nearest neighbor strategy and so on) for breaking down the tremendous measure of crude or multi-dimensional information . In different words, information digging has capacities for clever information examination to remove concealed information from huge data sets of clinical or clinical information that are gathered from clinical focuses or clinics. This information gives helpful data to improve choice help, counteraction, finding and therapy in clinical world. Further, information mining has capacity to recognize affiliation administers or build up connections between different highlights like patient's very own information, infection side effects and so on [22, 23]. This examination endeavors to address the consequences of a few exploration works which are distributed identified with information mining applications in expectation, conclusion or treatment of breast cancers [24, 25]. The recent breakthrough of machine learning and data mining techniques has opened a new door for healthcare diagnostic and prediction. In this paper, we utilize these advancements for predicting breast cancer risk based on a labeled dataset [26, 27]. Particularly, we utilize the clinical data as well as the associated symptoms of patients to construct predictive models which can classify patients into different breast cancer categories, benign or malignant [28, 29]. The model is constructed by using the industry standard software package Rapidminer (ver. 20.9). The proposed approaches offer significant advantages over the conventional approaches with high classification accuracy. The organization of the paper is as follows . We describe our dataset, methods and background on data mining techniques. We discuss our data exploration and predictive construction for predicting breast cancer. We present and discuss our results.
MATERIAL AND METHODS
The present study is of a fundamental analytical type. Databases used in this paper is provided by Motamed Cancer Institute, ACECR Tehran, Iran which contained the data of patients who referred to this center from 2010 to 2020. This study aimed to compare the performance of six data mining algorithm including support vector machine (SVM), multi-layer perceptron (MLP), decision tree (DT), K nearest neighbor (KNN), random forest (RF), naïve Bayesian (NB) for the prediction of breast cancer risk. This database contained 7834 information breast cancer records. Firstly, all columns that were not related to disease risk factors were removed from the database. Records that lost 50 percent of their data were omitted on condition that three important risk factors, such as personal history of breast cancer, family history of breast cancer, and hormone therapy were also lost, the database did not contain complete gene mutation modifications. The number of missing data was very large, so it was not included in the study because the placement of the lost data with the techniques would create fictitious data for use in the models. Eventually, the number of database records reached 7687. Records with lost data were then filled in with the median placement method  in the SPSS (ver. 26) software; with this action, the number of records in the database reached 7600 records. After this stage, the number of malignant records was 1900 (25%) and the number of benign records was 5700 (75%), which was used to balance the classes in the database using the SMOTE technique. This technique is capable of balancing unbalanced databases [32, 33].
The total number of records increased to 13460, 6999 records (52.4%) were data related to malignant breast cancer patients and 6460 records (47.6%) were data related to benign breast cancer patients. After balancing and selecting the risk factors, prioritization was performed with the CART decision tree algorithm . The models were constructed using seventeen risk factors for breast cancer in Rapidminer (ver. 20.9) based on sensitivity, specificity and accuracy, taking into account 9 parts of the data for training and 1 part for testing. Cross Validation 10-Fold evaluation method (10-point mutual validation) were divided.
Factors Risk Cancer Breast Considered in the Study
In order to evaluate the performance of a classification model, several approaches can be used. We employ the confusion matrix for the evaluation of classification models. According to the confusion matrix, the accuracy of a classification model is calculated as follows:
The accuracy of a classifier is the probability of correctly predicting the class of an unlabeled instance and it can be estimated in several ways.
Sensitivity describes how well classifier catches all of positive cases. Sensitivity is calculated by dividing the number of true-positive results by the total number of positives (which include false positives).
Specificity describes how well classifies catches negative cases as negatives. Specificity is calculated by dividing the number of true-negative results by the total number of negatives (which include false negatives).
One way to evaluate and evaluate the performance of binary classification is the receiver operating characteristic or ROC in short. The performance of binary classifier algorithms is usually measured by indicators called sensitivity or recall. But in the ROC diagram, both of these indicators are combined and displayed as a curve. ROC curves are often used to evaluate the performance of classification algorithms or to generate string data. This issue has received more attention in the field of supervised machine learning.
A total of 7834 records were extracted from the reports and included in the dataset. The dataset contained 1900 records, which were related to the women who had breast cancer (BREAST CANCER=1), and 5700 records, which were for those who had not breast cancer (BREAST CANCER=0).
The NB model is the most sensitive and performance classifier with the values of 95.56% and 95% respectively. The ANN classifier has the best sensitivity and accuracy with the values of 92.69% and 93.23% respectively. Performance comparison of the breast cancer prediction models shown in Table 1 and ROC classifier curve comparison shown in Fig 1.
In this study, classifier models have been proposed for helping clinicians to decide about referring the woman to biopsy or not. Although the NB model has been shown to have the best ROC and sensitivity (84.06%), it seems more sensitive models such as random forest and ANN with the sensitivity about 92% in this case might be recommended to avoid misdiagnosing breast cancer. Treatment choices are especially difficult in beginning phase of breast cancer patients with clashing prognostic risk highlights, particularly hub negative ones, in which whether or not to seek after an adjuvant treatment with chemotherapy or endocrine treatments is as yet hazy .
Performance comparison of the breast cancer prediction models
Utilization of information mining strategies in a choice emotionally supportive network for foreseeing bosom malignancy helps and help doctor in making ideal, precise and convenient choice, and lessen the general expense of treatment .
Researchers have used a wide variety of machine learning methods for predicting susceptibility diagnosis, recurrence and survivability. Several studies reported a higher accuracy for predicting breast cancer, this could be due to the difference in chosen datasets. This shows that more records can help create more efficient models. Kate et al.  used 1.4 million records for breast cancer prediction, reported logistic regression was able to provide higher performance. Diaz et al.  proposed a CAD for breast cancer diagnosis. The best accuracy reported for the proposed system was 89.3%. The researchers used naive Bayesian, random forest, and SVM classifiers. Another study by Kaushik et al.  utilized data mining techniques for the prediction of breast tissue biopsy results. The researchers proposed a model with an accuracy of 83.5% and a ROC of 0.907. In , the breast cancer prediction was studied using various classification algorithms such as RepTree, J48, and random forest. Dataset used in this study was extracted from SEER repository with 762,691 samples and 134 features. Furthermore, data cleaning and dimensionality reduction techniques were used and seven features were selected for the final classification. Finally, different data mining techniques were compared and analyzed using the WEKA software. Results revealed that the RepTree algorithm in the set of decision tree algorithms acted better in the breast cancer prediction.
Using more features can help create better results. In , cancer patient data was collected from Wisconsin dataset of the UCI machine learning. This dataset contains 35 features, which are selected using the feature selection methods and are computed using classification algorithms. According to the results, naive Bayesian algorithm and decision tree provide better and higher accuracy.
This suggests that the use of modified and non-proprietary publicly accessible databases can be effective in creating models with higher results. In this study, the dataset was obtained specific registry dataset format. Kaushik et al.  used naïve Bayesian, RBF network, and decision tree techniques to predict breast cancer on Wisconsin dataset provided by the University of California Irvine machine learning repository . They achieved the best accuracy of 97.36% related to the naïve Bayesian classifier. Pritom et al.  used decision tree, naïve Bayesian, KNN, and SVM techniques to predict breast cancer risk, they achieved an accuracy of 97.13% related to SVM classifier. Another study  classified breast tissue based on mammography reports on public datasets have reached to the accuracy of 99%. Li et al.  performed on two databases, which are separately from BCCD and WBCD reported random forest had higher AUC (0.99).
In Iran, studies have been conducted to predict breast cancer, which are mentioned below. In  researchers attempt to predict the period of staying alive among those diagnosed with breast cancer. Data were gathered from 5673 patients at Shiraz University of Medical Sciences. The proposed algorithm was logistic regression. The study has achieved the highest level of sensitivity and detection criteria, and it has used the smallest set of attributes like age, tumor size, the ratio of lymphatic glands involvement, invasion and so on. The results demonstrate that although the effect of age on the patient is a controversial matter, aging leads to 60 month decrease of survival. Lotfinezhad Afshar et al.  used 899 record from Omid treatment and research center Urmia, Iran, reported C5 in all the evaluation criteria had higher sensitivity (92.21%). Sohrabi et al.  used data mining models. Results showed that MLP had higher sensitivity of 90.88%. Tanha et al.  used risk factors features from breast cancer patients from Mahdieh clinic, Kermanshah, Iran database which contains 2555 records. They used 3 data mining models. Results showed that J48 reported higher recall 0.94. Zand  used SEER public data base with 3 data mining models for analysis. Results showed that C4.5 accuracy was 86.7%. Mehri Dehnavi  used a hybrid method for breast cancer prediction which proposed for identifying more predictive gene signatures from microarray datasets with 1338 records, for analysis utilize neuro-fuzzy classifier, result reported the model has a good ability to analyze data. Lotfnezhad Afshar et al.  used 3 models for SEER data analysis for patient survival predicts, result shown SVM reported higher sensitivity of 97.7%.
Like some other examination, this investigation has various limits. One of the impediments of the examination is that, because of the accessibility and accessibility of information identified with hazard factors for bosom malignancy, a portion of the information of more persuasive elements identified with bosom disease, for example, sorts of quality changes in this assortment. Changes of two qualities known as BRCA1 and BRCA2 have for quite some time been known to bring about higher dangers of bosom and ovarian malignant growth in ladies. Researchers have additionally as of late found that men with specific transformations of these two qualities may have an expanded danger of beginning stage prostate disease was not accessible. Albeit the data set contained the consequences of some quality change tests, their utilization didn't prompt the development of genuine models because of their missing qualities, so finishing the information identified with patients' quality transformation tests could be a significant point in information assortment. Utilizing various quantities of folds of cross-approval for a specific arrangement of information sources, the presentation of a choice tree at contrast sizes would be to some degree extraordinary. For similar arrangement of perceptions estimated on a set of information sources, the mistake rate for this set won't be the very same when assessed by three, five, or ten folds at specific sizes of choice tree. The outcome has been shown how information mining models can help specialists in Libyan emergency clinics better comprehend disease hazard factors to make a precise diagnosis.in the end new examination showed prescient bosom disease hazard can be improved by consolidating segment hazard factors, germ line hereditary variations, and mammography highlights. The worth of mammography anomaly includes gives proof that this moderate aggregate consolidates hereditary and ecological danger factors as well as fuses data to improve sickness forecast.
All authors contributed to the literature review, design, data collection and analysis, drafting the manuscript, read and approved the final manuscript.
CONFLICTS OF INTEREST
The authors declare no conflicts of interest regarding the publication of this study.
No financial interests related to the material of this manuscript have been declared.