Diabetes Diagnosis Using Machine Learning, , , and
Diabetes is a disease associated with high levels of glucose in the blood. Diabetes make many kinds of complications, which also leads to a high rate of repeated admission of patients with diabetes. The aim of this study is to diagnose Diabetes with machine learning techniques.
Material and Methods:
The datasets of the article contain several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age. The main objective of the machine learning models is to classify of the diabetes disease.
Six classifiers have been also adapted and compared their performance based on accuracy, F1-score, recall, precision and AUC. And Finally, Adaboost has the most accuracy 83%.
In this paper a performance comparison of different classifier models for classifying diagnosis is done. The models considered for comparison are logistic regression, Decision Tree, support vector machine (SVM), xgboost, Random Forest and Adaboost. Finally, in the comparison flow, Adaboost, Logistic Regression, SVM and Random Forest, usually has had a high amount; and their amounts has little differences normally.
Diabetes is a chronic disease and commonly stated by health professionals or doctors as diabetes mellitus (DM), which describes a set of metabolic diseases in which the person has blood sugar, either insulin production inefficient, or because of the body cell do not return correctly to insulin, or by both reason . This will increase concentration levels of glucose in the blood . The majority of cases of diabetes can be broadly classified in two categories, type 1 and type 2, although some cases are difficult to classify . Many complications occur if diabetes remain untreated . Therefore, it is not only a disease but also a creator of different kinds of diseases like heart attack, blindness, kidney diseases, etc. . Diabetes has become one of the major causes of national disease and death in most countries . According to the International Diabetes Federation report, this figure is expected to rise to more than 642 million in 2040, so early screening and diagnosis of diabetes patients have great significance in detecting and treating diabetes on time . The analysis of diabetes data is a challenging issue because most of the medical data are nonlinear, abnormal, correlation structured, and complex in nature . Applying machine learning methods in diabetes mellitus research is a key approach to utilizing large volumes of available diabetes-related data for extracting knowledge . It also helps the people to accurately diagnosis of diabetes . The purpose of this study was to compare performance analysis of logestic regression (LR), decision tree (DT), support vector machine (SVM), xgboost, random forest (RF) and adaboost models to diabetes mellitus classification. In fact the purpose of using these algorithms is the comparison between different algorithms such as ensemble learning and linear classifiers.
MATERIAL AND METHODS
The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. The datasets consist of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
Optimizing the performance of the classification model by feature selecting is an important part . Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing data (especially high-dimensional data) for various data mining and machine-learning problems. The objectives of feature selection include building simpler and more comprehensible models, improving data mining performance, and preparing clean, understandable data . The Principal Components Analysis (PCA) was used for data reduction, which can not only greatly reduce the time of model learning while preserving the data implied information, but also eliminate data noise and data redundancy . In this study, PCA-based feature selection and the best feature selection which has 7 features including: age, Skin Thickness, Glucose, Blood Pressure, Diabetes Pedigree Function, Insulin and Pregnancies, has been applied.
Machine learning models
The main objective of the machine learning models is to classify of the diabetes disease. The overview of the proposed machine learning models has been shown in Fig 1.
The training/test set paradigm of the entire machine learning models has been shown in Fig 2. The first step is to divide the dataset into two sets such as 80% for training set and 20% for testing set. The training and test sets are separated. In the second step, the most significant risk factors of diabetes disease have been selected based on PCA feature selection. We have adopted six classifiers; logistic regression, decision tree, support vector machine (SVM), xgboost, random forest (RF) and adaboost. The next step is to estimate the training classifier coefficients, and then the test classifiers have been applied to classify the patients into two categories as diabetic vs. control. Finally, the performances of the classifiers are evaluated using five performance parameters, namely accuracy, F1-score, recall, precision and AUC.
In the section, we show the performance of machine learning classification techniques for diabetes classification. For this, we analyze various popular classification techniques that include the logistic regression, decision tree, support vector machine (SVM), xgboost, random forest and adaboost. The high risk factors have been selected based on PCA feature selection. Moreover, six classifiers have been also adapted and compared their performance based on accuracy, F1-score, recall, precision and AUC. The next section represents the related work.
Comparison of the Efficiency of Algorithms
Fig 3 demonstrates the comparison of the performance of 6 machine learning algorithms based on accuracy. Adaboost has the most accuracy and after that two algorithms, random forest and logistic regression, which has the same accuracy, have high precision. In the next rate, SVM with accuracy of 82.46 almost has a high accuracy. Decision tree and xgboost have accuracy below 80 percent. Finally xgboost has the least accuracy.
Fig 4 represents the comparison of the performance of 6 machine learning algorithms based on F1-score. Adaboost has the most F1-score. After that, logistic regression and SVM have high F1-score more that 70% respectively. F1-score in decision tree is less than 70% and xgboost has the least F1-score among algorithms.
Fig 5 indicates the comparison of the performance of 6 machine learning algorithms based on recall. Decision tree has the most recall among algorithms. Adaboost, SVM and logistic regression have equal recall. Four mentioned algorithms have a recall more than 60%. Xgboost and random forest have a recall less than 60%, and xgboost has the least recall among algorithms.
Fig 6 points out the comparison of the performance of 6 machine learning algorithms based on precision. Random forest has the most precision. After that, adaboost, logistic regression and SVM have precision more than 80% respectively. Both decision tree and xgboost have precision less than 80%, and decision tree has the least precision among algorithms.
Receiver operating characteristics (ROC) is a graphical plot that is created by plotting sensitivity versus ‘1-specifcity’. The area under the curve (AUC) which is computed from ROC curve is the indicator to evaluate the performance of the classifers. The value of the AUC lies between ‘0’ to ‘1’ .
Fig 7 illustrates ROC curves of six classifiers. The amount of AUC of SVM is more than other algorithms, and after that, the amount of AUC in adaboost, logistic regression and Random forest is high respectively; which are close to each other, somehow. It can be said that almost the amount of AUC in decision tree and xgboost in related to the mentioned algorithms is diminished significantly, and it has the least AUC in xgboost.
The models considered for comparison are logistic regression, decision tree, support vector machine (SVM), xgboost, random forest and adaboost. The accuracy, F1-score, recall, precision and AUC are considered for the comparison. It can be stated that in the algorithms, in different comparison, xgboost has had the least amount and it has not just had the least precision; however, it has had a close precision to the least precision one. In the comparison flow, decision tree has had different positions. For instance, although it has had the least precision, it has had the most recall; and its amount in AUC, F1-score and accuracy is close to the least amounts. Altogether, it can be noted that in the comparison flow, adaboost, logistic regression, SVM and random forest, usually has had a high amounts; and their amounts has little differences normally (Fig 8).
These are the results of searches in scopus, google scholar and pubmed databases. The diagnosis of diabetes disease has been studied by Warke et al. In this project the primary aim is to analyses the diabetes dataset and use support vector machine, Naïve Bayes, logistic regression, and K-nearest neighbors algorithms; this analyses helps to predict and to develop a prediction engine. Developing a web application with following feature is the secondary aim. This project demonstrated a comparison of Naïve Bayes classifier with other linear classifiers, such as support vector machines, logistic regression, and K-nearest neighbors. The result of this project is that the chances of diabetes with more accuracy as compared to other classifiers, can be predicted by Naive Bayes machine learning classifier .
In essay  only the comparison of some kinds of linear algorithms has been done; while, in the current study, the comparison between some kinds of algorithms including linear classifier and ensemble learning is done.
Maniruzzaman et al. have published a project named “Classification and prediction of diabetes disease using machine learning paradigm”. The main target of this study is to develop a machine learning (ML)-based system in order to predict diabetic patients. To identify the risk factors for diabetes disease based on p-value and odds ratio (OR), Logistic regression (LR) is used. Maniruzzaman et al. aimed to predict the diabetic patients; thus, they have adopted four classifiers like adaboost, Naïve Bayes (NB), decision tree (DT), and random forest (RF). Three kinds of partition protocols (K2, K5, and K10) have also chosen; which repeated these protocols into 20 trails. These classifiers’ performances are evaluated by using accuracy (ACC) and by the area under the curve (AUC). The outcome was that the combination of LR and RF-based classifier performs are better; which will be very useful to predict diabetic patients .
“Diabetes diagnosis via XCS classifier system” is title of a project which has been published by Moshtaghi et al. In order to design an expert clinical system, this study aimed to use novel concepts of artificial intelligence. Diagnosing the diabetes disease automatically at the right time is the capability of this system. A learning system is the expert system which has been developed in this paper; it was as an improved version of extended classifier systems (XCS). The system in this research started to learn by application of a real dataset collected. The performance of that system was examined on some 268 other patients then. The results of that examination were compared with some conventional data mining methods. So as to predict accurately, this comparison indicates the preference of the proposed method with other techniques. The suggested method has been applied (XCSR, AD Tree, SVM, C4.5, k star, Dempster-Saffr), and in test phase the results which were obtained from performing improved XCSR algorithm were compared with four other algorithms in the following table .
Diabetes mellitus is commonly known as diabetes. It is of group of metabolic orders which are characterized by the high blood sugar. Diagnosis of diabetes is an important real-world of medical problems. Detection of diabetes one way out before treatment. In this paper a performance comparison of different classifier models for classifying diagnosis is done.
The authors agree on this final form of the manuscript, and attested that all authors contributed in the final draft of the manuscript.
CONFLICTS OF INTEREST
The authors declare no conflicts of interest regarding the publication of this study.
No financial interests related to the material of this manuscript have been declared.