Machine Learning Based Methods for Handling Imbalanced Data in Hepatitis Diagnosis
andAbstract
Introduction:
Hepatitis C virus is the leading cause of mortality from liver disease. Also, diagnosis systems are usable tools for better disease control and management. The aim of this study was to design an HCV disease prediction system and classify its severity based on data mining methods.
Material and Methods:
This is an applied research that uses the hepatitis C dataset in the UCI library. The study was conducted in four steps including data preprocessing, data mining, evaluation and system design. In data pre-processing, data balancing techniques were performed. Then, three data mining algorithms (multi-layer perceptron, Bayesian network, and decision tree) were implemented and 10-fold cross-validation method was used to evaluate data mining algorithms. Finally, user interface was designed in MATLAB programming language (version 2016) based on the best algorithm.
Results:
The results showed that the over-sampling method improved the performance measures of data mining algorithms in disease prediction, so that in the O-dataset the accuracy of the best method (random forest) was 99.9%. Also, the random forest for the O-dataset had the best performance measures in term of sensitivity, accuracy and f-measure (99.9%) and the 100% specificity amount.
INTRODUCTION
Hepatitis C virus (HCV) is a single-stranded RNA virus [1]. This virus is one of the most important causes of chronic liver disease that due to its long-term treatment can cause cirrhosis and liver cell cancer [2]. The virus was first identified as the leading cause of non-A and non-B hepatitis in April 1989 [3], and there are now about 71 million people living with chronic hepatitis C worldwide. About 30% of people (15% to 45%) recover after six months without treatment. Chronic HCV spreads among the remaining 70% (55% to 85%). The risk of cirrhosis for this group will be between 15 to 30% in the next 20 years [4]. HCV is the leading cause of mortality from liver disease, with 333,000 deaths in 1990, 499,000 in 2010, and 704,000 in 2013 [4-6].
The incidence of this disease has been expressed in various studies from 0.5% to 2.8%. In high-income countries, the prevalence of chronic hepatitis C is less than 2%. [7, 8]. Countries with a high prevalence of HCV (more than 5%) include Egypt, Gabon, Uzbekistan, Cameroon, Mongolia, Pakistan, Nigeria and Georgia, which are low-middle income countries [9].
Getting HCV does not provide long-term immunity, and many cases of re-infection have been reported. As a result, in areas with a high prevalence of HCV, hybrid HCV genotypes are observed that result from more than one HCV infection. This is a major obstacle to developing a vaccine for this disease. Other challenges related to the control of the disease are [9]:
Inadequate surveillance data
Coverage of prevention programs is limited
Few people know their hepatitis status and have access to treatment
Medicines and diagnostics are unaffordable for most
A public health approach to hepatitis is lacking
Leadership and commitment are uneven
Hence, the WHO has recently developed the first global health sector strategy for the hepatitis virus. This strategy covers all 5 types of hepatitis virus (Hepatitis A, B, C, D and E) but focuses more on hepatitis C. According to the document, the number of infected people and the HCV death decreased 70% and 60% respectively by 2030 compared to 2010 years [10].
Predicting chronic diseases play a vital role in health informatics. Chronic disease diagnosis is very important because these diseases affect a person for a long time. The most common chronic diseases are diabetes, stroke, cardiovascular disease, cancer, hepatitis C and osteoarthritis. Early detection of chronic diseases improves prevention and increases the effectiveness of the treatment process [11].
Classification is a data mining technique that uses the train data to develop a model and the resulting model is applied to the test data for determining its predictive power. Various classification algorithms have been used to predict chronic diseases and their results have been very promising [12]. One of the challenges in using of data mining techniques is the unbalanced dataset. A dataset is called unbalanced if samples of one class (called the minority class) are much smaller than samples in other class (es) (called the majority class) [13]. This causes the algorithms have good accuracy in majority class and low accuracy for minority class [14]. There are problems with unbalanced datasets in many areas of research [15-17]. In medical science with disease diagnosis goal, the dataset is usually unbalanced because the number of patients is less than of healthy people which become more severe in rare disease. To solve this problem, they increase the number of minority class samples or decrease the number of majority class samples [18]. One solution is over-sampling of minority class samples and under-sampling of majority class samples [19].
Chronic disease diagnosis systems are valuable tools for better disease control and management [20]. Therefore, the aim of this study was to design an HCV disease prediction system and classify its severity based on data mining methods. To design this system, an unbalanced dataset was used and the results of three different data mining algorithms were compared, and the best result is the basis for designing the system.
MATERIAL AND METHODS
This is an applied research that uses the hepatitis C dataset in the UCI library (https://archive.ics.uci.edu/ml/datasets/HCV data). This dataset contains 615 records, of which 75 records are for people with HCV (Hepatitis: 24; Fibrosis: 21; Cirrhosis: 30). Out of the remaining 540 records, 7 suspected cases were recorded that were removed from the analysis (blood donor: 533 records). Descriptive statistics for the 608 records in this dataset (237: female and 371: male) are shown in Table 1.
Table 1
Descriptive statistic of records in dataset
This study was conducted in four steps, which are explained below.
Step 1: Data pre-processing
Because of missing data in the dataset, the attribute values were replaced in unregistered records. According to the one to 7 ratios of the sum of the three minority classes to the majority class, this dataset needs to be balanced. For this purpose, random over-sampling and random under-sampling methods [19, 21] were used for balancing.
Step 2: Data Mining
After preparing the dataset, three data mining algorithms were implemented. These algorithms include multi-layer perceptron (MLP), Bayesian network, and decision tree. Each of which was implemented in the original and balanced dataset and their evaluation results were compared. All methods were implemented in MATLAB programming language (version 2016).
Step 3: data mining algorithms Evaluation
10-fold cross-validation method was used to evaluate data mining algorithms. For each algorithm, the performance measures included accuracy, precision, sensitivity, specificity, and F-Measure were calculated which is presented in Table 2.
Table 2
The performance evaluation measures
Accuracy= (TP+TN)/(TP+TN+FP+FN) |
Precision= TP/(TP+FP) |
Sensitivity= TP/(TP+FN) |
Specificity= TN/(TN+FP) |
F-measure=(2*Precision*Recall)/(Precision + Recall) |
* True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN)
Step 4: System designing
After selecting the best algorithm in diagnosing the HCV and its severity, the system user interface was designed based on the best algorithm.
RESULTS
There were 31 unregistered data that were replaced using mean and median for continuous and discrete attributes, respectively. Then, two datasets were created with an equal minority to majority ratio by using two methods of random over-sampling and random under-sampling. The over-sampling dataset is called O-Dataset and the under-sampling dataset is called U-Dataset. In O-Dataset, the number of minority class samples increased and was equal to the number of majority class samples, and in U-Dataset, the number of majority class decreased and was equal to the number of minority class samples.
The 10-fold cross-validation method was used for evaluation. The acquired accuracy for data mining algorithms on the three datasets: original, O-Dataset, and U-Dataset are shown in Fig 1.
Fig 1
Comparison of data mining algorithms on three datasets: original, O-Dataset and U-Dataset based on accuracy measure
According to results, all three algorithms in the O-Dataset have higher accuracy measure than the original dataset and U-Dataset. However, due to the unbalanced data and display the accuracy in general, accuracy measure cannot indicate the superiority of one method over another. Fig 2 to 4 show other performance measures for the three data mining algorithms and the three datasets. In these figures, the performance measures are reported for each class separately. Table 3 also shows the average of these measures for each dataset.
Table 3
Weighted average performance measures in the three datasets
According to Fig 2, the sensitivity, specificity and accuracy for the original dataset in the three patients’ group are much less than in the healthy group (majority class), and as a result, the F-measure, which is a result of sensitivity and specificity, are low. These results are also repeated in the U-Dataset (Fig 4) and show that under-sampling has not been effective in resolving imbalances. While, Fig 3 shows that the over-sampling method is very effective and significant progress has been made in all performance measures in all groups (including patients and healthy individuals). Among the data mining methods, the random forest method had better performance in the U-dataset and O-dataset, but the MLP method was better in the original dataset. Also, Table 3 confirms the findings in Fig 2 to 4. Considering that the best results were obtained for random forest in O-dataset, it became the basis for the development of the hepatitis prediction system. The system user interface simulated in the MATLB environment is shown in Fig 5.
DISCUSSION
In this study, a hepatitis C prediction system was designed using demographic data and blood tests. Due to the imbalance of classes in the dataset, the random under-sampling and random over-sampling methods were used. The results showed that the over-sampling method improved the performance measures of data mining algorithms in disease prediction, so that in the O-dataset the accuracy of the best method (random forest) was 99.9% and about 6% higher than the original dataset. However, due to the data imbalance, algorithms were compared using other performance measures. According to results, the random forest for the O-dataset had the best performance measures in term of sensitivity, accuracy and f-measure (99.9%) and the 100% specificity amount.
Given that the dataset was recently published in the UCI, only two related studies have covered it. In the study of Hoffmann et al. [22], only three groups of patients were considered and the healthy individual’s data were deleted and as a result, the problem of data imbalance has not occurred. By adding the enhanced liver fibrosis (ELF) score to the dataset, they examined two modes: the ELF dataset and the ELF-free dataset. The C Tree and rpart algorithms were simulated in R software and the evaluation was performed using the leave-one-out method. rpart had the best result for the ELF dataset with 73.33% accuracy. In Chawathe study [23], a large number of classification algorithms were compared in terms of accuracy, F-measure, AUC, classification time, Training time and model size. In addition to examining the performance measures of algorithms in original dataset classification, three important features have been identified using 7 feature selection methods. These three features include: ALT, AST, CHE. Then, all the algorithms were compared by these three features in the dataset. The results showed that, Bayes Net (an unlimited number of parents per node: BNt-u) methods based on accuracy and F-measure measures and random forest based on AUC measure performed better than other algorithms for the original dataset. Also, for the collection with three important features, random forest has been the best method. Although the exact values of the measures are not specified, but according to the graphs, the accuracy and F-measure of all algorithms in both datasets are less than 93% and 98%, respectively.
AUTHOR’S CONTRIBUTION
The authors agree on this final form of the manuscript, and attested that all authors contributed in the final draft of the manuscript.
CONFLICTS OF INTEREST
The authors declare no conflicts of interest regarding the publication of this study.
FINANCIAL DISCLOSURE
No financial interests related to the material of this manuscript have been declared.