1. Medical Informatics Research Center, Institute for Future Studies in Health, Kerman University of Medical Sciences, Kerman, Iran., 2. Department of Computer Engineering, Faculty of Engineering, Islamic Azad University, Bardsir Branch, Bardsir, Iran.
Nowadays, medical sciences and physicians face a huge amount of data. Diabetes is one of the most expensive glands in the world. Since it is not always easy to diagnose the disease, the physician should examine the outcome of patient tests and decisions made in the past for patients with similar conditions to make an appropriate decision. Due to the large number of patients and the multiple tests performed on each patient, an automated tool for exploring previous patients is needed.
One of the most important methods used to derive data is data mining. Due to the high number of diabetic patients, timely diagnosis and treatment of this disease can reduce the risk of death and its associated medical costs. So far, different systems have been proposed for the diagnosis and prediction of diabetes, but fuzzy logic based systems are used in this study to increase accuracy and efficiency. In the proposed model, fuzzy clustering is first grouped into separate clusters, and then the radial neural network is predicted for each patient with diabetes mellitus. A compatible neuro-fuzzy inference system has also been used to diagnose diabetes.
In this paper different classification techniques have been used in MATLAB software to diagnose diabetes mellitus and to classify patients as diabetic and nondiabetic. The dataset used is extracted from the UCI database. The accuracy of the proposed method is 97.14% which is significantly higher than other models of diabetes diagnosis.
The application of two fuzzy models has significantly improved the accuracy of diagnosis of diabetes compared to other models proposed in this field.
Received: 2019 September 4; Revision Received: 2019 October 12; Accepted: 2019 November 3
The intensity of competition in the scientific, social, economic, political and military spheres has doubled the importance of speed or time of access to information. Therefore, the need to design systems that are capable of quickly discovering information of interest to users with an emphasis on minimal human intervention, on the one hand, and turning to analytical methods commensurate with the bulk of data on the other hand, makes good sense.
Currently, data mining is the most important technology for the efficient, accurate and rapid utilization of bulk data and its importance is increasing. Data mining is the bridge between statistics science, computer science, artificial intelligence, pattern recognition, machine learning. Identifying the factors that influence the success and effectiveness of medicines and medical practices is one of the topics that has attracted the attention of researchers in various fields of medicine, statistics and computer science in recent years.
Each of these disciplines, using a set of laws and relationships related to their field, examines and predicts the factors affecting the effectiveness of medications and medical practices on various patients; however, among different methods, the use of soft computing and relatively new sciences such as data mining have shown better results and have the power to provide interpretations and results that can be milestones in medical knowledge.
The incidence of diabetes has more than doubled in the last ten years worldwide, with about 200 million people suffering from the disease, and the prevalence of diabetes worldwide is increasing by about six percent annually. Given that diabetes is a very chronic disease and causes irreparable damage to vital organs and organs, the use of intelligent data mining tools can improve detection and control methods. Illness is a great help to doctors. Research shows that 80 percent of chronic type 2 diabetes can be prevented or delayed by early identification of at-risk people. Diabetes is the fourth leading cause of death in most developed countries [1]. According to the World Diabetes Federation statistics in 2012, more than 371 million people worldwide have diabetes, with an increase every year, with more than half of those without diabetes aware of their illness. It costs more than $ 470 billion for diabetic patients. It causes 50 million deaths. Also, according to the 2013 census, more than 6 million Iranians have diabetes [2].
Given the prevalence of type 2 diabetes around the world, the use of new methods in biomedical research has become increasingly popular. Data mining represents a significant advancement in the types of analytical tools available and is considered as a valid, sensitive and reliable method for discovering patterns and their binary relationships [3]. Today, data mining tools are used extensively to understand marketing patterns, customer behavior, review patient data, and identify fraud. Increased accuracy of diagnosis, cost reduction and reduction of human resources have been proven as advantages of data mining in medical analysis by euthanasia [4, 5].
Using smart data and data mining methods, this study seeks to discover part of the hidden knowledge by analyzing data collected for diabetes and to improve its diagnosis. The structure of the article will be described in five sections. The first part deals with this area, the second part deals with tools and materials. In the third part, the proposed method will be discussed and finally the fourth and fifth sections are devoted to the evaluation of the method and the conclusion of the work.
In order to elucidate the structure of the proposed method, it is necessary to first mention the basic materials and methods, so in this section we describe these in order to develop the proposed model.
Clustering
Clustering is one of the branches of unsupervised learning and is a process of automation in which the samples are divided into clusters whose members are similar to each other. A cluster is therefore a set of objects in which the objects are similar to each other and are not similar to the objects in the other clusters. For similarity, different criteria can be considered, for example, the distance criterion can be used for clustering, and objects that are closer to one another can be considered as a cluster based on this type of clustering, based on clustering. For example, in Fig. 1, each of the small circles represents a vehicle (object) and denotes the whole coordinate system where the specimens are represented.
Reduced fuzzy clustering
The fuzzy clustering algorithm is one of the most important methods used to identify an unsupervised pattern. Following the definition of fuzzy sets by Zadeh [6], the first step in this direction was taken by Raspini in 1969 [7]. In recent questions this method has been used in a wide range of areas including pattern classification, fuzzy control and machine vision.
In classical clustering, each input sample belongs to one and only one cluster and cannot be a member of two or more clusters. For example, in Fig. 1 each of the vehicles is a member of one cluster and the samples are not members of two clusters and do not overlap in other languages. Now consider a case where the similarity of a sample to two or more clusters is the same. In classical clustering, it should be decided which cluster this sample belongs to. The main difference between classical clustering and fuzzy clustering is that one sample can belong to more than one cluster. To illustrate, consider Fig. 4. If the input samples conform to the above figure, it is clear that the data can be divided into several clusters, but the problem is that some of the data specified in the middle can be more than one cluster. Therefore, it is necessary to decide which cluster the data belongs to, the right cluster, or the left cluster; but if using fuzzy clustering the target data belongs to the 0.5 cluster of the right cluster and the same cluster of the left cluster. Another difference is that, for example, the input specimens on the right of Fig. 2 can also be of a very low degree of belonging to the left cluster, which is the same for the specimens on the left. There are various techniques for fuzzy clustering that FCM algorithm is one of the most common and fastest, but the main limitation of this method is the need to determine the initial number of clusters [8].
The reduced clustering algorithm does not need to specify the number of clusters and guesses the initial centers and number of clusters and optimizes them during the algorithm [9]. In this method, each data is assumed to be the potential center of the cluster and its density is calculated based on the surrounding data. For this purpose, the primary potential of each data set is defined as follows:
Given the above relationship, any data with the highest density is selected as the first center of the cluster. Assuming x_{1} * is the first cluster with potential p_{1} *, the density of the other data is calculated as follows:
In the above relation, r_{b} is a constant value, which is usually greater than r_{a}. For this reason, the density of other points near the first cluster is reduced and the next highest density data is selected as the center of the next cluster. The density of the remaining points is calculated based on their distance from the center of the cluster. Therefore, the final formula for calculating the potential of each point can be calculated as follows:
In the above relation x_{i }is the center of the i^{th} cluster and p_{i} is the potential of this point. The process of finding new clusters continues until a sufficient number of clusters are found. The termination condition of the algorithm is that the potential of all remaining points is less than the fraction of the center potential of the first cluster [10]:
Neural Network RBF
A well-known type of neural network can be called radial-based networks. Radial networks based on the post-propagation network need more neurons, but their advantage is at shorter design times than standard post-propagation networks. These networks exhibit remarkable performance when training vectors are very large [11]. A radius-based grid with the input R is shown in Fig 3.
The input of radial neurons differs from that of perceptron neurons. For these input neurons, the radial transfer function is the distance vector between the weights vector W and the input vector p multiplied by b. The base radius neuron transfer function is as follows:
This function is shown in Fig 4.
This function has a maximum value of 1 with zero input. Thus, by decreasing the distance between W and p, the output of the radial transfer function increases. Thus, the radius-based neuron acts as a diagnostic of the matching of W and p. That is, when W and p are equal, the radial transfer function produces an output of 1. The value of b bias allows adjusting the sensitivity of the base radius neuron. For example, if a neuron has a bias value of 0.1, the output of the neuron for the distance 0.8326 between W and p will be 0.5. Because by multiplying the bias by 8.326 0.1, the value is 0.8326, with the output of the base radius function being 0.5. Now if the bias is 0.2 for a distance of 4.163, the output is 0.5, which means that the sensitivity of the neuron has increased.
ANFIS Network
Because of their special structure, neural networks have great ability to learn, adapt and expand, but in many cases they have higher education time. On the other hand, in designing fuzzy systems, they benefit from linguistic concepts and innovative rules and do not require training, but choosing the best membership functions is one of the major drawbacks of these systems, with no need for training, capability. Adaptation eliminates them, so the advantages and disadvantages of each of the neural networks and fuzzy systems have led the researchers to combine them. The purpose is to achieve the benefits of both, while eliminating the shortcomings of each of them through other features. Neural networks that use fuzzy inputs and outputs and are used to perform addition or multiplication of fuzzy generalized rules, are known as fuzzy neural networks and fuzzy systems using neural networks. Training is known as neural network. Since neural networks (especially multilayer perceptron neural networks) are themselves a special case of adaptive networks, so if in a fuzzy system the membership functions of the adaptive networks are trained and utilized in the past, adapted to them, the system will be the result of an adaptive fuzzy inference system (adaptive neural fuzzy network). In fact, in the adaptive fuzzy inference system, the necessity of purely empirical selection of membership functions is eliminated, although experience reduces training time [12].
In this paper, we combine fuzzy clustering algorithms and RBF neural network to improve the separation of healthy individuals from diabetic patients. To do this, we first divide the data by the fuzzy clustering algorithm into a certain number of clusters. Each cluster represents part of the data space. In each cluster, the data is arranged in a distance from the center of the cluster, respectively. Then, the data for each cluster are divided into 3 categories: training, validation and experiment. The number of clusters is obtained by a reduced clustering algorithm. The steps of applying this algorithm to the data are as follows:
1. The target data including diabetic patients are clustered by fuzzy clustering algorithm.
2. Thus the centers of the clusters Cj (j = 1,2, ..., k) and the degree of membership of each data to the center of the cluster μij are obtained.
3. If µij> µjl is ؛ l = 1,2,… k, l ≠ j,. 2 Then Si is placed in the cluster j. (All data are segmented into K clusters).
4. The data in cluster j are sorted in descending order by the degree of membership relative to the center of Cj. Then the sorted data are divided into three groups as percentages (training, validation, and testing).
5. Repeat step 3 until all data is clustered.
6. The training set is used for network training
Assessment
The proposed model of this paper identifies people with diabetes using fuzzy clustering algorithms and RBF neural network as well as ANFIS neuro-fuzzy inference system. Each network (two separate models) is combined with fuzzy clustering algorithm. All of the above algorithms have been implemented in MATLAB R 2015b software environment. To evaluate this model, classification criteria for Pima dataset have been calculated and compared with other studies.
In binary classification, each classifier tries to correctly classify the input data into two classes of positive (p) and negative (n). Table 1 shows the outputs of the binary classifiers and their specific terms.
Classification of the proposed model | Real categorization | |
---|---|---|
With Error | Without Error | |
With Error | TP | FP |
Without Error | FN | TN |
The parameters used in Table 1 are defined as follows:
TP: The number of classes with errors, which are correctly predicted error prone.
FP: The number of error-free classes, which are mistakenly predicted by error.
TN: Number of error-free classes, which are correctly predicted error prone.
FN: The number of error-prone classes, which are mistakenly predicted error-prone.
Accuracy
The accuracy criterion is the ratio of the number of correctly predicted classes (including error-free and error-free) to the total number of classes [13]. The criterion of accuracy (or success rate) is to measure the overall prediction accuracy obtained from the 6 relationship.
Sensitivity
This criterion shows the sensitivity of the prediction model and is defined as the percentage of correctly predicted error-prone classes. The sensitivity criterion is calculated according to the following relation:
Specificity
Like the sensitivity criterion, this criterion is also used to measure the accuracy of the prediction model, which is the percentage of classes that are correctly predicted error prone.
Precision
The precision criterion represents the number of error-prone classes correctly predicted by the model as error-prone. The best value is 1. Moreprecision, less FP.
Recall
Recall, the number of error-prone classes correctly identified as error-prone. The best value is 1. High recall rate means lower FN [13].
Database
The Pima Indians Diabetes Data Set dataset provided by the UCI University Machine Learning Archive is used to prepare the standard dataset for system implementation and testing.
Summary of the specifications of this database is presented in the Table 2.
Database Specifications | Multivariable | number of samples | 768 | theme | life |
---|---|---|---|---|---|
Feature Specifications | Natural and real numbers | Number of features | 8 | Data time | 1990 |
Related tasks | Classification | Missing data | Yes | Number of visits | 115087 |
This collection was developed by the National Institute of Diabetes, Digestive and Kidney Disease Prediction and has all of the following data:
The mean and standard deviation of the data in the Pima database are as shown in Table 3.
Feature Number | average | Standard deviation |
---|---|---|
1 | 3.8 | 3.4 |
2 | 120.9 | 32 |
3 | 69.1 | 19.4 |
4 | 20.5 | 16 |
5 | 79.8 | 115.2 |
6 | 32 | 7.9 |
7 | 0.5 | 0.3 |
8 | 32.2 | 11.8 |
For the Pima dataset the baseline settings of the RBF network and the number of clusters are in accordance with Table 4.
With the adjustments of the Pima dataset, we measured against the quality assessment criteria introduced in the previous section, the results of which are presented in Fig. 5.
As is clear from the figure, the hybrid model accurately separates healthy and susceptible diabetic patients.
Evaluation of ANFIS Inference System
There are two methods of grid partitioning and sub-clustering to use the fuzzy system. The major difference between these two methods is how to select the fuzzy membership function. In the network discretization method, the type and number of input vector functions are specified by the user, but in the partial clustering method, the type and number of input functions are input according to the characteristics of the input information and the classifications contained in it. They are determined by the fuzzy inference system itself.
In this study, we used Gussmf Gussian membership function with 8 input variables and three membership functions for each of them and 600 iterations to generate the desired output (Fig. 6). Different combinations of the number and type of membership functions have been investigated and finally the optimal model has been used for training the ANFIS system. The hybrid method is also used to train the model, which is a combination of neural and fuzzy networks.
In this section, the ANFIS neuro-fuzzy inference system was used to identify susceptible individuals in the Pima database. Fig. 7 shows the results of the criteria for accuracy, sensitivity, specificity and recall for this system.
Comparison of the proposed model and other algorithms
In this section, we evaluate and compare the quality of the methods used in this paper with other algorithms for the diagnosis of diabetes.
Algorithm | Precision | sensitivity | specification | |
---|---|---|---|---|
Modified feed forward neural network | 80.07 | 74 | 84.37 | - |
LDA+ANFIS | 84.61 | 83.33 | 85.18 | - |
MLPNN | 91.53 | 92.42 | 91.19 | - |
Mixture of experts | 97.93 | 97.73 | 98.01 | - |
Fuzzy Clustering+RBF | 96.1 | 95.24 | 96.86 | Proposed Model |
ANFIS | 97.14 | 96.54 | 97.56 | Proposed Model |
As illustrated in the Table 5, fuzzy logic-based methods have significant efficacy in the diagnosis of diabetes.
The health industry is continually generating a large amount of data, and people who encounter this type of data have found that there is a wide gap between their collection and their interpretation. Data mining in medicine and biology is an important part of biomedical informatics and is one of the most applied computer sciences used in hospitals, clinics, laboratories, and research centers [14].
Until now, various machine learning approaches have been used to diagnose diseases as a decision support system along with the expertise of physicians. Fig. 8 illustrates some of the methods used in this field. In recent years, data mining methods have been helpful to diabetes and diabetic patients. In the following, a number of very valid scientific research on the applications of data mining in diabetes are reviewed and a summary is provided.
In 2006, Mr. Sue and his colleagues were able to accurately diagnose diabetes based on four data mining methods: artificial neural networks, decision tree, logistic regression, and dependence rules, using 3D body images. In this study, three-dimensional and two-dimensional photographs were taken of all the organs of different individuals (diabetic patients and healthy subjects). Variables such as: abdominal surface, leg circumference, hand volume, etc. were extracted from these photographs. These variables were assigned to the four algorithms, and finally, using these algorithms, the researchers were able to predict 89% accurately based on three-dimensional and two-dimensional images of the body whether the person was suffering from type 2 diabetes [15].
One of the advantages of this method of diagnosis of type 2 diabetes is that people with suspected disease do not need blood tests. The researchers also came up with rules that could help diagnose type 2 diabetes in people. For example, they extracted data such as "If abdominal volume = x and leg volume = y and breast area = z then 90% of patients have type 2 diabetes"; also in 2009 Purnami et al. Support vectors were able to accurately diagnose 93% of people with type 2 diabetes [16]. Their data were from 768 patients and used eight variables including their blood pressure and their insulin levels to predict. Using the same support vector machine, Brakat and his colleagues, one year later, in 2010, were able to improve the accuracy of the diagnosis and accurately diagnose 94% of people with type 2 diabetes. They extracted their research data from the data of 4682 clients and included variables such as gender, BMI, blood pressure, cholesterol and blood glucose [17].
In 2010, some Thai researchers using the decision tree were able to accurately diagnose more than 90% of metabolic syndrome in individuals. They entered data for 5638 individuals into the algorithm for this purpose. At the end of the study, they also discovered rules that could be more accurately diagnosed in patients.
The researchers were also able to derive a set of rules for the diagnosis of type 2 diabetes, which could in the future replace medical diagnostic guidelines and assist physicians in the diagnosis of diabetes [18].
In 2001, Richards and colleagues reviewed the first observations in the hospital and the early death of diabetic patients. They reviewed information records for 21,000 patients. Using the rules of association rules, it was discovered that these rules were not commonly accepted by physicians, but when these rules were proven, physicians welcomed them. [19]
One of the articles that has been cited more than 35 times is the article by Barry-Ouell et al. They used the decision tree method and obtained attractive results, including "the most important variable for controlling age-related blood glucose" [20].
Icelandic researchers, using data mining techniques, especially the decision tree, concluded that HbA1c levels are the only important variable that improves blood glucose control in response to patient education [21]. Huang et al. also investigated the variables affecting blood glucose control using data mining methods. They used the decision tree and simple Bayes to analyze the data. They concluded that variables such as age, duration of disease diagnosis, need for insulin treatment, blood glucose and diet were the most important factors affecting blood glucose control [22].
Fig. 9 shows some of the machine learning methods used to diagnose diabetes, along with the accuracy of the paper and the year of publication.
Data mining methods in recent years in the field of medicine and health care, in the field of diagnosis and prevention of disease, selection of treatment methods and prediction of mortality and prediction of widely used medical costs Diabetes is one of the most common diseases of today that early diagnosis and proper treatment can significantly reduce the problems of this disease. In this article, we tried to develop models for the diagnosis of diabetes and to classify patients as diabetic and non-diabetic using different classification techniques in MatLab software. The dataset used in this study was extracted from the UCI database used in most of the diabetic trials. In the first method used in this study, to isolate healthy individuals from people with diabetes, two fuzzy clustering algorithms and the RBF neural network are combined. In the second model ANFIS neuro-fuzzy inference system is used to classify the data. The accuracy of this method is 97.14%, which is significantly higher than the previous combination method and other models of diabetes diagnosis. Other suggestions that may be useful in this regard are as follows:
Reduce the size of the problem by using feature reduction and feature selection techniques.
1. | Nazarzadeh, M. Bidel, Z. Sanjari, MA. Meta analysis of diabetes mellitus and risk of hip fractures: Small study effect. Osteoporos Int 2016 27(1):229–30. [PubMed] [CrossRef] |
2. | Janahmadi, Z. Nekooeian, AA. Mozafari, M. Hydroalcoholic extract of Allium eriophyllum leaves attenuates cardiac impairment in rats with simultaneous type 2 diabetes and renal hypertension. Res Pharm Sci 2015 10(2):125–33. [PubMed] |
3. | Al, JAA. Decision tree discovery for the diagnosis of type II diabetes; International Conference on Innovations in Information Technology (IIT); IEEE; 2011. |
4. | Khajehei, M.; Etemady, F. Data mining and medical research studies; Modelling and Simulation (CIMSiM); IEEE; 2010. |
5. | Jayalakshmi, T. A novel classification method for diagnosis of diabetes mellitus using artificial neural networks; International Conference on Data Storage and Data Engineering (DSDE); IEEE: 2010. |
6. | Zadeh, LA. Fuzzy sets. Information and Control 1965 8(3):338–53. [CrossRef] |
7. | Ruspini, ER. A new approach to clustering. Information and Control 1969 15(1):22–32. [CrossRef] |
8. | Dunn, JC. A fuzzy relative of the ISODATA process and its uses in detecting compact well-seperated clusters. Journal of Cybernetics and Systems 1973 3(3):32–57. [CrossRef] |
9. | Chiu SL. A cluster estimation method with extension to fuzzy model identification; Conference on Fuzzy Systems; IEEE; 1994. |
10. | Hathaway, RJ. Bezdek, JC. Hu, Y. Generalized fuzzy c-means clustering strategies using LP norm distances. IEEE Trans on Fuzzy Systems 2000 8(5):576–82. [CrossRef] |
11. | Asrardel, M. Prediction of combustion dynamics in an experimental turbulent swirl stabilized combustor with secondary fuel injection. [MSc Thesis] University of Tehran; 2015. |
12. | Rezakazemi, M. Mosavi, A. Shirazian, S. ANFIS pattern for molecular membranes separation optimization. Journal of Molecular Liquids 2019 274:270–6. [CrossRef] |
13. | Tan, PN.; Steinbach, M.; Karpatne, A.; Kumar, V. Introduction to data mining. 2nd Ed. India: Pearson Publication; 2018. |
14. | Berka, P.; Rauch, J.; Zighed, DA. Data mining and medical knowledge management: Cases and applications. Hershey: Idea Group Inc (IGI); 2009. |
15. | Su, CT. Yang, CH. Hsu, KH. Chiu, WK. Data mining for the diagnosis of type II diabetes from three dimensional body surface anthropometrical scanning data. Computers and Mathematics with Applications 2006 51(6-7):1075–92. [CrossRef] |
16. | Purnami, SW. Embong, A. Zain, JM. Rahayu, SP. A new smooth support vector machine and its applications in diabetes disease diagnosis. Journal of Computer Science 2009 5(12):1003–8. [CrossRef] |
17. | Barakat, NH. Bradley, AP. Barakat, MN. Intelligible support vector machines for diagnosis of diabetes mellitus. IEEE Trans Inf Technol Biomed 2010 14(4):1114–20. [PubMed] [CrossRef] |
18. | Patil, BM.; Joshi, RC.; Toshniwal, D. Association rule for classification of type -2 diabetic patients. Second International; Conference on Machine Learning and Computing; IEEE; 2010. |
19. | Richards, G. Rayward-Smith, VJ. Sönksen, PH. Carey, S. Weng, C. Data mining for indicators of early mortality in database of clinical records. Artif Intell Med 2001 22(3):215–31. [PubMed] [CrossRef] |
20. | Breault, JL. Goodall, CR. Fos, PJ. Data mining a diabetic data warehouse. Artif Intell Med 2002 26(1-2):37–54. [PubMed] [CrossRef] |
21. | Sigurdardottir, AK. Jonsdottir, H. Benediktsson, R. Outcomes of educational interventions in type 2 diabetes: WEKA data-mining analysis. Patient Educ Couns 2007 67(1-2):21–31. [PubMed] [CrossRef] |
22. | Huang, Y. McCullagh, P. Black, N. Harper, R. Feature selection and classification model construction on type 2 diabetic patients’ data. Artif Intell Med 2007 41(3):251–62. [PubMed] [CrossRef] |