Machine Learning Models for Diagnostic Classiﬁcation of Hepatitis C Tests, and
Hepatitis C is a chronic infection caused by hepatitis c virus - a blood borne virus. Therefore, the infection occurs through exposure to small quantities of blood. It has been estimated by World Health Organization (WHO) to have affected 71 million people worldwide. This infection costs individual, groups and government a lot because no vaccine has been gotten yet for the treatment. This disease is likely to continue to affect more people because it’s long asymptotic phase which makes its early detection not feasible.
Material and Methods:
In this study, we have presented machine learning models to automatically classify the diagnosis test of hepatitis and also ranked the test features in order to know how they contribute to the classification which help in decision making process by the health care industry. The synthetic minority oversampling technique (SMOTE) was used to solve the problem of imbalance dataset.
The models were evaluated based on metrics such as Matthews correlation coefficient, F-measure, Precision-Recall curve and Receiver Operating Characteristic Area Under Curve. We found that using SMOTE techniques helped raise performance of the predictive models. Also, random forest (RF) had the best performance based on Matthews correlation coefficient (0.99), F-measure (0.99), Precision-Recall curve (1.00) and Receiver Operating Characteristic Area Under Curve (0.99).
Illnesses or conditions whose effects are persistent or permanent in nature are known to be chronic illnesses . These kinds of diseases are known to have adverse effect on human life and get the huge portion of budgets by individual, government and groups [2, 3]. Hepatitis C is a chronic disease; it is a liver infection caused by the Hepatitis C Virus (HCV) [4, 5]. This HCV is responsible for both severe and chronic hepatitis; hence ranging in severity from slight illness which lasts for few days to a severe, lifetime illness.
Hepatitis C is now a major concern globally; as it is estimated that 71 million people have chronic hepatitis C virus infection globally according to WHO [4, 6]. WHO also affirms that a noteworthy number of individuals infected will develop cirrhosis or liver cancer. This hepatitis C virus is a blood borne virus; therefore, the most common ways of infection is through exposure to small quantities of blood. This may occur by means of injection drug use, unsafe injection procedures, unsafe health care, transfusion of unscreened blood and blood products, sexual practices and other practices that lead to exposure to blood . It is important to note that HCV does not have effective vaccine yet [4, 6]. However, working is on-going to have. Therefore, as at now the immune system is depended on, to be free of this infection; but treatment is necessary when the infection becomes chronic . Therefore, diagnosis of HCV is of great importance; but the major challenge of new HCV infections is asymptomatic in nature . This could last for decades after which the secondary symptoms begin to show up which leads to liver damage . Another challenge is the diagnostic classification of the hepatitis tests which is done after a person have been diagnosed with HCV infection, this would aid in determining the degree of liver damage , which we intend to propose automated methods to analyze laboratory tests results that will help to make timely decisions.
For the past few decades till this present time, machine learning as a research area has been gaining attention in noticeable researches virtually in all areas of human activities of which the health sector is not left out. Machine learning is a branch of artificial intelligence that deals with studies of scientific algorithms and scientific models that computer system uses to execute tasks efficiently without using explicit instructions, but depending on patterns and inference instead [7-9]. A lot of data have been accumulated by the healthcare industry over the years ; which will be of great help in bringing insight into big data for diagnostic, disease prevention, prediction and policy-making purposes in healthcare industry through machine learning and data analytics [11, 12] by making use of the clinical data.
Data analytics is evolving in healthcare industry which has been of great help in diagnostics tasks. Therefore, this work aims at creating a machine learning models for the diagnostic classification of Hepatitis C tests and ranking the test features which will aid in decision making such as the treatment procedure to be used and procedure of managing the infection. The second section of this work discusses the methodology used in this research; while the third section showcases the result, followed by the discussion of the result and finally the conclusion is drawn at the fourth section.
MATERIAL AND METHODS
This section describes the various stages that composed the proposed methodology: Data description, feature ranking and classification task.
The dataset that was used to pinpoint this research was acquired from University of California, Irvine (UCI) Machine Learning Repository  which is HCV dataset by . The dataset consists of Hepatitis C test records of 615 patients. The patients’ record consisted of 238 women and 377 men with the age bracket of 19 to 77 years. The dataset contains 13 features, which present test information of each patient as shown in Table 1.
The range of the category which is the target class is described in Table 2 below
|Code||Hepatitis C category||Frequencies|
|0s||Suspect blood donor||7|
First step in data mining is data cleaning which involves data pre-processing processes [15, 16]. In this process, id feature was removed from the dataset as it does not contribute to anything to the classification; it is only as means of identifying the patients. Also, it was discovered that the dataset was highly skewed (imbalanced), the Synthetic Minority Oversampling Technique (SMOTE) which is an over-sampling method  was used to alleviate the class imbalance problem.
For the classification task, we selected 5 of the most used classifiers in the machine learning and these classifiers are Decision Tree (DT) [18, 19], K-Nearest Neighbors (KNN) [7, 20] Random Forest (RF) , Naïve Bayes  and Logistic Regression.
This is one of the most used classification algorithms. This algorithm is centered on dividing the data into different subcategories based on a sequence of questions. The process begins with the primary or root node, which is known as the root of the tree based on the highest entropy and contains all samples. Each node is then split into secondary or leaf nodes in either a multi-split or binary form. The decision model is a tree structure that includes the collection of nodes. It includes the decision nodes (split node with the condition) and leaf nodes.
Logistic Regression (LR)
This is a predictive model that is used to get the association between one dependent binary variable and one or more independent variables. LR is differentiated by not assuming a linear relationship between the dependent and independent variables but by displaying association between the output and predictive values . The logistic curve that results from the logistic regression is between 0 and 1 by using natural logarithm for the curve creation.
K-Nearest Neighbor (KNN)
This also is a classification algorithm that classifies the new sample based on a measure known as similarity or distance measure. This measure comprises of three distance measures: Euclidean distance, Manhattan, and Minkowski. KNN first stores the feature sample and class label of the samples which is the training step. The second step is the classification step where user defines a ‘k’ values in order to classify the unknown sample for the k number of the class labels, so as to classify the unknown sample into the defined class depending on the similar features.
Random Forest (RF)
It is supervised learning algorithm that can be used for both classification and Regression tasks. Random forest [23, 24] is based on bagging technique to create random sample features. The difference between the decision tree and the random forest is the method of finding the root node and splitting the feature node which will run randomly. The first step in random forest is to load the data which consist of ‘n’ features. Then the algorithm is trained by selecting m from the n features based on bagging technique in order to determine the unbiased OOB error. After that, node ‘d’ is calculated using the best split, then split the node into sub nodes. These steps are repeated to find m number of trees, after which the total number of vote for each tree for the target class is calculated and the highest class is the final prediction.
Naïve Bayes (NB)
This is a classification algorithm  based on Bayes theorem with the objectivity assumption between the predictors. It takes the dataset as input and then analysis is performed; after which class label is predicted with the use of Bayes’ Theorem. It calculates a probability of class in input data and helps to predict the class of the unknown data sample. It is a powerful classification algorithm suitable for large datasets.
Decision tree (J48), Random forest, Naïve bayes, logistic regression and K-nearest neighbors algorithms were implemented using Waikato Environment for Knowledge Analysis (WEKA). It is a reliable open source software for knowledge analysis developed at the University of Waikato, New Zealand . Cross validation is used as the test mode option with 10 as the number of fold Class attribute was set as the target to be predicted for the classification. This process was done 5 times coupled with changing the random seed starting from 1 -5 for the process for validation purposes.
After the SMOTE was applied the dataset was increased to 2181, the details are given in the Table 3 below.
Description of the category after applying SMOTE
|Code||Hepatitis C category||Frequencies|
|0s||Suspect blood donor||448|
This section first describes the results that was obtained for the Hepatitis C test classification; before applying SMOTE followed by the results that was obtained after SMOTE was applied. The algorithms were executed as stated in the previous section. The performance measures such as Matthews correlation coefficient (MCC) , Receiver Operating Characteristic Area Under Curve (ROC AUC), Precision-Recall Area Under curve (PR AUC), F-Measure and Accuracy that are derived from the confusion matrix which is used to decide how well a classification has performed  by reporting the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
Table 4 and Table 5 shown below provide the details of the average performance of the classification based on F-Measure, MCC, AOC RUC, PR AUC and accuracy after the process was repeated five times coupled with changing of random seed values from 1 -5.
Summary of performance measure of the classification before applying SMOTE.
Summary of performance measure of the classification after applying SMOTE.
|Method||F-Measure||MCC||ROC AUC||PRC AUC||Accuracy|
Based on the results obtained, it was observed the Random Forest algorithm performed best out of the algorithms considered for this study in terms of the performance metrics which includes F-measure, MCC, ROC AUC, PRC AUC, and Accuracy using cross validation method. Furthermore, looking are the ROC AUC and PRC AUC which provide more robust and better performance estimates when comparing classifiers on imbalance dataset, it was observed that the algorithms performed better after applying SMOTE technique to the dataset.
The result of this study is encouraging and promising as these models can be used by the healthcare industry to classify the diagnosis test of Hepatitis C as it is not expensive and time consuming to use. Also from the result obtained from the feature ranking it shows that age and gender contribute little or nothing in the classification of the Hepatitis C test.
Certain that the dataset was recently published in the UCI repository, little related studies have covered it. This includes Chawathe study , in which the author proposed various models using different algorithms coupled with feature selection. The results were compared based on accuracy, F-Measure, ROC AUC, classification time and training time in which results were depicted with graph. The results showed that the ROC AUC is below 0.98. Also Orooji and Kermani study , in which the authors proposed various models using three algorithms coupled with techniques of handling imbalanced dataset. Their results also showed that Random forest has the best performance out of the three algorithms with 99.9% accuracy and F-measure of 9.99. The authors did not provide neither the PRC AUC nor ROC AUC.
This paper has presented machine learning models for the automatic classification of Hepatitis C test at early stage. It will aid in decision making which can also be extended to others diagnosis task. This study has also shown the importance of machine learning in the healthcare industry in decision making. Conclusively, it is important to note that public should be aware that healthy living such as protection from blood exposure by making provision for personal objects such as razor blade and needle and following of standard medical procedures.
It would be interesting in the future research to get to know whether blood type, genotype could be included in the dataset and check out the roles they play in the classification of Hepatitis C diagnosis classification.
The authors would like to appreciate Rev. Sunday Oladimeji and Rev. Bayo Afolaranmi for the proof reading and language editing of the manuscript
OOO – Conception, design and drafting of the work.
AO – Interpretation of the data and revised the work.
OO – Acquisition and analysis of the data and revised the work.
The authors agree on this final form of the manuscript, and attested that all authors contributed in the final draft of the manuscript.
CONFLICTS OF INTEREST
The authors declare no conflicts of interest regarding the publication of this study.
No financial interests related to the material of this manuscript have been declared.