Combining Random Forest and Neural Networks Algorithms to Diagnose Heart Disease, and
Heart disease is known as one of the most important causes of death in today's society and so far no definitive method has been found to predict it and several factors are effective in contracting this disease. Therefore, the aim of this study was to provide a data mining model for predicting heart disease.
Material and Methods:
This study used standard data from UCI. These data include four Cleveland, Hungarian, Swiss and Long Beach VA databases. These data include 13 independent variables and one dependent variable. The data are missing, and the EM algorithm was used to control this loss, and at the end of the data, a suggestion algorithm was implemented that combined the two random forest algorithms and the artificial neural network.
In this study, data was divided into two training sets and 10-Fold method was used. To evaluate the algorithms, three indicators of sensitivity, specificity, accuracy were used and the accuracy of the prediction algorithm for four data Cleveland, Hungarian, Switzerland and Long Beach VA reached 87.65%, 94.37%, 93.45% and 85%, respectively. Then, the proposed algorithm was compared with similar articles in this field, and it was found that this algorithm is more accurate than similar methods.
The heart is an important part of the human body, and if you have a heart problem, you are more likely to die. The heart acts like a pump and is responsible for pumping blood in the arteries of the human body if there are problems with how a person lives, stress, stress, unhealthy nutrition, inactivity and physical activity, having a history of heart disease. Family and so on can cause heart disease. Diagnosis of heart disease in the early stages is very important and necessary. Early diagnosis of this disease plays an important role in a person's health and life . Heart disease has negative effects on a person's life, making it impossible for them to do their daily activities. Mistakes in diagnosing the disease exacerbate the devastating effects and sometimes lead to death. The use of heart prevention and detection methods can be of great help to physicians in diagnosing this disease .
Data mining is very important in medical data. In fact, the purpose of different stages of data mining is to discover knowledge and achieve results that can be used in the real world to improve efficiency. This extracted knowledge can be used as a decision-making system in the real world. Designing assistive devices to help physicians diagnose the type of disease or choose the right type of treatment with the help of research can be a great help in saving lives [3, 4]. The following are some of the articles from the field.
In 2019, in a paper entitled as predictive model for cardiovascular disease forecasting, Perpra presented seven data mining algorithms. Rapid Miner software was used in this study. It reached 91.67 percent accuracy. It was found that the vector machine algorithm has the best output between the seven algorithms .
In 2019, Saklin et al. presented a paper in a paper entitled "Dimensions reduction for the diagnosis of heart disease with the help of a supportive vector machine". This article uses standard UCI data. In this study, first, the data were reduced to a dimensional scale, and then, with the help of a support vector machine with a radial base neural network kernel, a classification model was presented. And reached 81.19 percent accuracy for test data .
In 2019, Reddy et al. presented a paper to predict heart patients. This paper uses a combination of two genetic algorithms and a fuzzy logic algorithm. It is responsible for the treatment of heart patients. UCI standard data were used for evaluation and the algorithm reached 90% accuracy for test data .
In 2019, Prakash et al. designed a model to optimize the optimal criteria (OCFS) for the removal of inappropriate traits in an article titled an Optimal Criteria for Selection of Optimal Criteria Features for Predicting and Analyzing Effective Heart Disease Analysis (OCFS). Provides a number of patients for diagnosis and prognosis compared to RFS-IE and MRPS .
In 2016, Wei-Jia et al. published an article on optimizing particle density and supporting machine vectors optimized by association rules to identify the causes of heart disease using two particle density optimization (PSO) algorithms and SVM optimized support vector machine. And by the rules of the association (AR), they designed a proposed algorithm whose output has a total classification accuracy of 98.96% and the time spent for classification is 94.94 seconds .
In 2018, Yahyaie and colleagues designed an article in an article entitled Using the Internet of Things to provide a new model for predicting remote heart attack using the Internet of Things (IoT), the output of which is valid when using ECG data. 89.5% .
The proposed model uses a combination of two random forest artificial network algorithms that have the desired accuracy for diagnosing heart disease. With this model, it is possible to provide physicians with a decision-making system for diagnosing heart disease, which has the ability to differentiate the disease in the early stages and with high accuracy. The following is a description of the standard data, the proposed model, the model evaluation and the discussion and conclusion.
MATERIAL AND METHODS
The study looked at data from the Heart Disease Collection in the UCI Machine Learning Repository section of the University of California, which includes four Cleveland databases with 303 samples, Hungarian with 294 samples, Switzerland with 123 samples and Long Beach VA with 200. Samples are used. The main data set of heart disease includes 14 14 variables, so that the last variable is introduced as the dependent variable. The dependent variable has values of 4-0, with a value of 0 indicating no disease and values 1 to 4 indicating a degree of heart disease. Due to the conciseness and usefulness of other features in this research, reduction of dimensions has not been used. Table 1 shows the specifications of the data sets used in this study by the number of samples in each class, and Table 2 shows the data specifications .
Summary of standard data set specifications 
|Class 0||Class1||Class 2||Class 3||Class 4|
|Long Beach VA||51||56||41||42||10||200|
Standard heart disease variables 
|14||Diagnosis of Heart Disease|
In this study, a predictive model for the diagnosis of heart disease was obtained using random forest and artificial neural network algorithms. First, a random forest algorithm presents a model, the steps of which are described separately in Fig 1. In the following, using the artificial neural network, a model is presented, which is expressed in Fig 2. Then, using a combination of two random forest algorithms and an artificial neural network, a model with acceptable accuracy for the diagnosis of heart disease is presented. The study uses four databases: Cleveland, Hungarian, Switzerland and Long Beach VA. A total of 14 features have been used to evaluate the above algorithms, the results of which are discussed in the evaluation section. The following is a pre-processing and how to implement the random forest algorithm, then the artificial neural network algorithm, and finally the proposed hybrid algorithm using the combination of the two algorithms.
The study database had missing values (Table 3). This missing values should be deleted. The Cleveland database has 4242 data and six missing data, the Swiss database has 1722 data and 273 missing data, the Long Beach VA database has 2800 data and 698 missing data and the Hungarian database have 4116 data and 782 missing data. The missing values are generated using the EM method  because the data has a completely randomized loss and the EM method has a better output for these missing data . SPSS.V16.0 software was used to search for missing values.
Total number of existing and missing records of four databases
|Row||Database name||Records number||Missing value number||Missing value percent|
|4||Long Beach VA||294||782||19|
Random forest algorithm
Initially, for all 4 databases, the data is divided into two sets of training and testing. The way the data is divided is based on the K-Fold method. The value of K in this study is 10, then the random forest algorithm  is executed on the training data and produces a predictive model. This model is evaluated by both test data and training data. According to the division of data into two sets of training and testing, each answer produces a different answer each time, so the algorithm is executed 100 times and the average of this 100 times is reported. This algorithm will be performed for each data in two states: disease or non-disease and disease type. In the random forest algorithm, the most important parameter is the number of decision trees. Fig 1 shows the flowchart of the classification algorithm. And in the figure, instead of the classification algorithm, a random forest is placed.
Artificial neural network algorithm
In all 4 databases, the data is divided into two sets of training and testing. The method of data division is based on the K-Fold method. Predicts. This model has been evaluated by both test data and training data. According to the division of data into two sets of training and testing, each answer produces a different answer each time, so the algorithm is executed 100 times and the average of this 100 times is reported. This algorithm is performed for each data in two states: disease or non-disease and disease type. In the neural network algorithm, the most important parameter is the number of neurons. Fig 1 shows the flowchart of the classification algorithm. And in the figure, instead of a classification algorithm, an artificial neural network is placed.
In this step, the proposed algorithm is stated, the optimal random values of the decision trees are determined by the random forest algorithm. Then in each decision tree, the features and records that need to be selected are determined. The values are entered into the artificial neural network algorithm, and instead of each tree, the decision of a neural network with the optimal number of neurons is placed. The label is then determined by a majority vote. At the end, the results are evaluated. Fig 2 shows the structure of the proposed algorithm .
First, the random forest algorithm was executed with the number of different trees and it was determined that the best number of trees is 60 trees, then the artificial neural network algorithm was executed and it was determined that the best number of neurons is 10 neurons, then the proposed algorithm was presented. Three indicators of sensitivity, specificity, and accuracy are used to evaluate algorithms. To conduct research, the data can be divided into two parts, training and testing using the K-Fold method . The value of K in this study is equal to 10 so that different states of educational data selection and testing can be expressed. Table 4 shows the output of the proposed algorithm. The proposed algorithm was performed for both the training data and the test data, and sensitivity, specificity, and accuracy were calculated. And its output is shown in Table 4. The highest accuracy for training data in these four databases is 99.07% and the highest accuracy for test data is 94.27%.
Evaluation of the proposed algorithm
Evaluate the proposed hybrid algorithm with similar tasks
The proposed algorithm used in this study is compared with three articles (algorithm , algorithm  and algorithm ) in this study, the results of which are evaluated on the test data in Table 5, respectively. Is observed. According to the table, the algorithm  with 98.96% accuracy for Cleveland database and the proposed algorithm in this study with 93.54% accuracy for Switzerland database and algorithm  with 89.83% accuracy for Long Beach VA database and so on. The proposed hybrid algorithm with 94.37% accuracy for Hungarian database has the best output.
In this study, two different algorithms have been used to provide better results for predicting and diagnosing heart disease. The algorithms used in this study include the random forest algorithm and the artificial neural network. The proposed algorithm is a combination of a random forest algorithm and an artificial neural network. The reason for choosing these two algorithms is the high accuracy of random forest algorithms and artificial neural network. As shown in Table 3, the proposed hybrid algorithm has high accuracy in all four databases.
The proposed algorithm was run on all four databases. For Cleveland data, out of 303 records, 11 incorrect records were detected, which is 3% error. For Hungarian data, out of 294 records, 12 incorrect records were detected, which is 4% error. Record 2 miscalculated record of 1% error and also for Long Beach VA data out of 200 records 5 miscalculated records represent 2% error and in total the predictive model has an error of less than 4% which indicates. The suitability of the proposed algorithm is appropriate.
The proposed algorithm for all four databases was also evaluated with two similar articles, and it was found that the proposed algorithm in both Swiss and Hungarian databases performed better than 3% in terms of accuracy, which indicates that the proposed algorithm is appropriate. But the proposed model for the other two databases did not work better.
Finally, it should be noted that the combination of these random forest and artificial neural network algorithms will greatly increase the accuracy of the proposed model and also increase the level of confidence and trust in the model.
The authors agree on this final form of the manuscript, and attested that all authors contributed in the final draft of the manuscript.
CONFLICTS OF INTEREST
The authors declare no conflicts of interest regarding the publication of this study.
No financial interests related to the material of this manuscript have been declared.