• Logo
  • HamaraJournals

Machine Learning Based Methods for Handling Imbalanced Data in Hepatitis Diagnosis

Azam Orooji, Farzaneh Kermani



Introduction: Hepatitis C virus is the leading cause of mortality from liver disease. Also, diagnosis systems are usable tools for better disease control and management. The aim of this study was to design an HCV disease prediction system and classify its severity based on data mining methods.

Method: This is an applied research that uses the hepatitis C dataset in the UCI library. The study was conducted in four steps including data preprocessing, data mining, evaluation and system design. In data pre-processing, data balancing techniques were performed. Then, three data mining algorithms (Multi-Layer Perceptron, Bayesian network, and decision tree) were implemented and 10-fold cross-validation method was used to evaluate data mining algorithms. Finally, user interface was designed in MATLAB programming language (version 2016) based on the best algorithm.

Results:The results showed that the over-sampling method improved the performance measures of data mining algorithms in disease prediction, so that in the O-dataset the accuracy of the best method (random forest) was 99.9%. Also, the random forest for the O-dataset had the best performance measures in term of sensitivity, accuracy and f-measure (99.9%) and the 100% specificity amount.

Conclusion: Considering that the presented approach has performed better than all suggested methods in previous studies, the proposed system in this study can be used well in HCV diagnosing and determining its severity.


Yager EJ, Konan KV. Sphingolipids as potential therapeutic targets against enveloped human rna viruses. Viruses. 2019; 11(10): 912. PMID: 31581580 DOI: 10.3390/v11100912

Ghadir M, Jafari E, Amiriani M, Rezvan H, Aminikafiabad S, Pourshams A. Hepatitis C in Golestan Province-Iran. Govaresh. 2006; 11(3): 158-62.

Choo Q-L, Kuo G, Weiner AJ, Overby LR, Bradley DW, Houghton M. Isolation of a cDNA clone derived from a blood-borne non-A, non-B viral hepatitis genome. Science. 1989; 244(4902): 359-62. PMID: 2523562 DOI: 10.1126/science.2523562

World Health Organization. Hepatitis C: Fact sheet [Internet]. 2016 [cited: 15 Oct 2020; updated: 27 July 2020]. Available from: https://www.who.int/en/news-room/fact-sheets/detail/hepatitis-c.

Lozano R, Naghavi M, Foreman K, Lim S, Shibuya K, Aboyans V, et al. Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: A systematic analysis for the Global Burden of Disease Study 2010. The lancet. 2012; 380(9859): 2095-128. PMID: 23245604 DOI: 10.1016/S0140-6736(12)61728-0

Vos T, Barber RM, Bell B, Bertozzi-Villa A, Biryukov S, Bolliger I, et al. Global, regional, and national incidence, prevalence, and years lived with disability for 301 acute and chronic diseases and injuries in 188 countries, 1990–2013: A systematic analysis for the Global Burden of Disease Study 2013. The lancet. 2015; 386(9995): 743-800. PMID: 26063472 DOI: 10.1016/S0140-6736(15)60692-4

Mohd Hanafiah K, Groeger J, Flaxman AD, Wiersma ST. Global epidemiology of hepatitis C virus infection: new estimates of age‐specific antibody to HCV seroprevalence. Hepatology. 2013; 57(4): 1333-42. PMID: 23172780 DOI: 10.1002/hep.26141

Gower E, Estes C, Blach S, Razavi-Shearer K, Razavi H. Global epidemiology and genotype distribution of the hepatitis C virus infection. J Hepatol. 2014; 61(1 Suppl): S45-57. PMID: 25086286 DOI: 10.1016/j.jhep.2014.07.027

Lanini S, Easterbrook PJ, Zumla A, Ippolito G. Hepatitis C: Global epidemiology and strategies for control. Clin Microbiol Infect. 2016; 22(10): 833-8. PMID: 27521803 DOI: 10.1016/j.cmi.2016.07.035

World Health Organization. Global health sector strategy on viral hepatitis 2016-2021 [Internet]. 2016 [cited: 15 Oct 2020]. Available from: http://www.who.int/hepatitis/strategy2016-2021/ghss-hep/en/

Jain D, Singh V. Feature selection and classification systems for chronic disease prediction: A review. Egyptian Informatics Journal. 2018; 19(3): 179-89.

Han J, Pei J, Kamber M. Data mining: Concepts and techniques. Elsevier; 2011.

Beyan C, Fisher R. Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognition. 2015; 48(5): 1653-72.

Devarriya D, Gulati C, Mansharamani V, Sakalle A, Bhardwaj A. Unbalanced breast cancer data classification using novel fitness functions in genetic programming. Expert Systems with Applications. 2020; 140: 112866.

Bhardwaj A, Tiwari A, RameshKrishna M, Vishaal Varma M. An innovative genetic programming framework in modelling a real time epileptic seizure detection system. ASE BigData/SocialInformatics/ PASSAT/BioMedCom Conference. Harvard University; 2014.

Bhardwaj H, Sakalle A, Bhardwaj A, Tiwari A. Classification of electroencephalogram signal for the detection of epilepsy using Innovative Genetic Programming. Expert Systems. 2019; 36(1): e12338.

Mera D, Bolon-Canedo V, Cotos JM, Alonso-Betanzos A. On the use of feature selection to improve the detection of sea oil spills in SAR images. Computers & Geosciences. 2017; 100: 166-78.

More A. Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv preprint arXiv:160806048. 2016.

Shelke MS, Deshmukh PR, Shandilya VK. A review on imbalanced data handling using undersampling and oversampling technique. Int J Recent Trends in Eng & Res. 2017; 3: 444-9.

Hussein AS, Omar WM, Li X, Ati M. Efficient chronic disease diagnosis prediction and recommendation system. IEEE-EMBS Conference on Biomedical Engineering and Sciences. IEEE; 2012.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002; 16: 321-57.

Hoffmann G, Bietenbeck A, Lichtinghagen R, Klawonn F. Using machine learning techniques to generate laboratory diagnostic pathways—a case study. J Lab Precis Med. 2018; 3: 58.

Chawathe SS. Diagnostic classification using hepatitis C tests. International IOT, Electronics and Mechatronics Conference. IEEE; 2020.

DOI: http://dx.doi.org/10.30699/fhi.v10i1.259


  • There are currently no refbacks.