BigData Analysis in Healthcare: Apache Hadoop , Apache spark and Apache Flink

Elham Nazari, Mohammad Hasan Shahriari, Hamed Tabesh
3843

Views


Abstract

Introduction: Health care data is increasing. The correct analysis of such data will improve the quality of care and reduce costs. This kind of data has certain features such as high volume, variety, high-speed production, etc. It makes it impossible to analyze with ordinary hardware and software platforms. Choosing the right platform for managing this kind of data is very important.
The purpose of this study is to introduce and compare the most popular and most widely used platform for processing big data, Apache Hadoop MapReduce, and the two Apache Spark and Apache Flink platforms, which have recently been featured with great prominence.

Material and Methods: This study is a survey whose content is based on the subject matter search of the Proquest, PubMed, Google Scholar, Science Direct, Scopus, IranMedex, Irandoc, Magiran, ParsMedline and Scientific Information Database (SID) databases, as well as Web reviews, specialized books with related keywords and standard. Finally, 80 articles related to the subject of the study were reviewed.

Results: The findings showed that each of the studied platforms has features, such as data processing, support for different languages, processing speed, computational model, memory management, optimization, delay, error tolerance, scalability, performance, compatibility, Security and so on. Overall, the findings showed that the Apache Hadoop environment has simplicity, error detection, and scalability management based on clusters, but because its processing is based on batch processing, it works for slow complex analyzes and does not support flow processing, Apache Spark is also distributed as a computational platform that can process a big data set in memory with a very fast response time, the Apache Flink allows users to store data in memory and load them multiple times and provide a complex Fault Tolerance mechanism Continuously retrieves data flow status.

Conclusion: The application of big data analysis and processing platforms varies according to the needs. In other words, it can be said that each technology is complementary, each of which is applicable in a particular field and cannot be separated from one another and depending on the purpose and the expected expectation, and the platform must be selected for analysis or whether custom tools are designed on these platforms.

Keywords

Big Data Analysis; Apache Hadoop; Apache Spark; Apache Flink; Healthcare

References

Hermon R, Williams PA. Big data in healthcare: What is it used for? 3rd Australian eHealth Informatics and Security Conference; 2014.

Chen M, Mao S, Liu Y. Big data: A survey. Mobile Netw Appl. 2014; 19(2): 171-209.

Ristevski B, Chen M. Big data analytics in medicine and healthcare. J Integr Bioinform. 2018; 15(3): 1-5. PMID: 29746254 DOI: 10.1515/jib-2017-0030 [PubMed]

Mooney SJ, Pejaver V. Big data in public health: Terminology, machine learning, and privacy. Annu Rev Public Health. 2018; 39: 95-112. PMID: 29261408 DOI: 10.1146/annurev-publhealth-040617-014208 [PubMed]

Jin X, Wah BW, Cheng X, Wang Y. Significance and challenges of big data research. Big Data Research. 2015; 2(2): 59-64.

Bello-Orgaz G, Jung JJ, Camacho D. Social big data: Recent achievements and new challenges. Information Fusion. 2016; 28: 45-59.

Arockia Panimalar S, Varnekha Shree S, Veneshia Kathrine A. The 17 V’s of big data. International Research Journal of Engineering and Technology. 2017; 4(9): 329-33.

Goga K, Xhafa F, Terzo O. VM deployment methods for DaaS model in clouds. In: Barolli L, Xhafa F, Javaid N, Spaho E, Kolici V. (eds) Advances in internet, data & web technologies. Lecture notes on data engineering and communications technologies, vol 17. Springer, Cham; 2018.

Khan AS, Fleischauer A, Casani J, Groseclose SL. The next public health revolution: Public health information fusion and social networks. Am J Public Health. 2010; 100(7): 1237-42. PMID: 20530760 DOI: 10.2105/AJPH.2009.180489 [PubMed]

Velikova M, Lucas PJF, Samulski M, Karssemeijer N. A probabilistic framework for image information fusion with an application to mammographic analysis. Medical Image Analysis. 2012; 16(4): 865-75.

Antink CH, Leonhardt S, Walter M. A synthesizer framework for multimodal cardiorespiratory signals. Biomedical Physics & Engineering Express. 2017; 3(3): 035028.

Sung W-T, Chang K-Y. Evidence-based multi-sensor information fusion for remote health care systems. Sensors and Actuators A: Physical. 2013; 204: 1-19.

Benke K, Benke G. Artificial intelligence and big data in public health. Int J Environ Res Public Health. 2018; 15(12): 2796-805. PMID: 30544648 DOI: 10.3390/ijerph15122796 [PubMed]

Kim W-J. Knowledge-based diagnosis and prediction using big data and deep learning in precision medicine. Investig Clin Urol. 2018; 59(2): 69–71. PMID: 29520381 DOI: 10.4111/icu.2018.59.2.69 [PubMed]

Lee CH, Yoon H-J. Medical big data: Promise and challenges. Kidney Res Clin Pract. 2017; 36(1): 3–11. PMID: 28392994 DOI: 10.23876/j.krcp.2017.36.1.3 [PubMed]

Archenaa J, Anita EM. A survey of big data analytics in healthcare and government. Procedia Computer Science. 2015; 50: 408-13.

Andreu-Perez J, Poon CCY, Merrifield RD, Wong STC, Yang G-Z. Big data for health. IEEE Journal of Biomedical and Health Informatics. 2015; 19(4): 1193-208.

Verma A, Mansuri AH, Jain N. Big data management processing with hadoop MapReduce and spark technology: A comparison. Symposium on Colossal Data Analysis and Networking. 2016; IEEE.

Taylor RC. An overview of the hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics. 2010; 11(12): S1.

Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, et al. Apache spark: A unified engine for big data processing. Communications of the ACM. 2016; 59(11): 56-65.

Carbone P, Ewen S, Haridi S. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2015.

García-Gil D, Ramírez-Gallego S, García S, Herrera F. A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Analytics. 2017; 2(1): 1-11.

O’Driscoll A, Daugelaite J, Sleator RD. ‘Big data’, hadoop and cloud computing in genomics. J Biomed Inform. 2013; 46(5): 774-81. PMID: 23872175 DOI: 10.1016/j.jbi.2013.07.001 [PubMed]

Sagiroglu S, Sinanc D, editors. Big data: A review. International Conference on Collaboration Technologies and Systems (CTS). 2013: IEEE.

Jagadish H, Gehrke J, Labrinidis A, Papakonstantinou Y, Patel JM, Ramakrishnan R, et al. Big data and its technical challenges. Communications of the ACM. 2014; 57(7): 86-94.

Shafer T. The 42 V’s of big data and data science [Internet]. 2017. [cited: 15 May 2019] Available from: https://www. kdnuggets.com/2017/04/42-vs-big-data-data-science.html.

Landset S, Khoshgoftaar TM, Richter AN, Hasanin T. A survey of open source tools for machine learning with big data in the hadoop ecosystem. Journal of Big Data. 2015; 2(1): 24-60.

Dunning T, Friedman E. Real world hadoop. O'Reilly Media; USA: 2015.

Hoffman S. Apache Flume: distributed log collection for hadoop. Packt Publishing Ltd; 2013.

Garg N. Apache kafka. Packt Publishing Ltd; 2013.

Ting K, Cecho JJ. Apache sqoop cookbook: Unlocking hadoop for your relational database. O'Reilly Media; USA: 2013.

White T. Hadoop: The definitive guide. O'Reilly Media; USA: 2012.

Hausenblas M, Nadeau J. Apache drill: Interactive ad-hoc analysis at scale. Big Data. 2013; 1(2): 100-4. PMID: 27442064 DOI: 10.1089/big.2013.0011 [PubMed]

Fernández A, del Río S, López V, Bawakid A, del Jesus MJ, Benítez JM, et al. Big data with cloud computing: An insight on the computing environment, MapReduce, and programming frameworks. Data Mining and Knowledge Discovery. 2014; 4(5): 380-409.

Wu D, Sakr S, Zhu L. Big data programming models. In: Zomaya A, Sakr S. (eds) Handbook of Big Data Technologies. Springer; Cham: 2017.

Pol UR. Big data analysis: Comparison of hadoop mapreduce, pig and hive. International Journal of Innovative Research in Science, Engineering and Technology. 2016; 5(6): 9687-93.

Oozie. Apache oozie workflow scheduler for hadoop. [Internet] 2019. [cited: 1 Jul 2019]. Available from: https://oozie.apache.org/

Olasz A, Thai BN, Kristóf D. A new initiative for tiling, stitching and processing geospatial big data in distributed computing environments. ISPRS Ann Photogramm Remote Sens Spatial Inf Sci. 2016; 3(4): 111-8.

Masiane M, Warren L. CS5604 front-end user interface team. [Internet] 2016 [cited: 1 Jul 2019] Available from: https://vtechworks.lib.vt.edu/ handle/10919/70935

Shrivastava A, Deshpande T. Hadoop blueprints. Packt Publishing; 2016.

Sinha S. What is a hadoop ecosystem? [Internet]. 2017 [cited: 1 Jul 2019]. Available from: https://www.quora.com/What-is-a-Hadoop-ecosystem.

Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM. 2008; 51(1): 107-13.

Kumar VN, Shindgikar P. Modern big data processing with hadoop: Expert techniques for architecting end-to-end big data solutions to get valuable insights. Packt Publishing; 2018.

Thomas L, Syama R. Survey on MapReduce scheduling algorithms. International Journal of Computer Applications. 2014; 95(23): 9-13.

Team D. Hadoop vs spark vs flink: Big data frameworks comparison [Internet]. 2016 [cited: 1 Jul 2019]. Available from: https://www.data-flair.training/blogs/hadoop-vs-spark-vs-flink/.

Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2015; 36(4): 28-38.

Chintapalli S, Dagit D, Evans B, Farivar R, Graves T, Holderbaugh M, et al. Benchmarking streaming computation engines: Storm, flink and spark streaming. International Parallel and Distributed Processing Symposium Workshops, IEEE; 2016.

Frampton, M., Mastering Apache Spark. Packt Publishing; 2015.

Monteith JY, McGregor JD, Ingram JE. Hadoop and its evolving ecosystem. 5th International Workshop on Software Ecosystems. Citeseer; 2013.

Parsian M. Data algorithms: Recipes for scaling up with hadoop and spark. O'Reilly Media, USA; 2015.

Singh D, Reddy CK. A survey on platforms for big data analytics. Journal of Big Data, 2015. 2(1): 8-28.

Estrada R, Ruiz I. Big data SMACK: A guide to apache spark, mesos, akka, cassandra, and kafka. Apress; 2016.

Meng X. Mllib: Scalable machine learning on spark. Spark Workshop; 2014.

Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, et al. Fast and interactive analytics over hadoop data with Spark. Usenix Login. 2012; 37(4): 45-51.

Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I, et al. Spark: Cluster computing with working sets. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing; 2010.

Team, D. Apache spark ecosystem: Complete spark components guide [Internet]. 2017 [cited: 1 Dec 2018]. Available from: https://data-flair. training/blogs/apache-spark-ecosystem-components

Safabakhsh M. Apache spark [Internet]. 2018 [cited: 1 Jul 2019]. Available from: http://myhadoop.ir/?page_id=131.

Penchikala S. Big data processing with apache spark– Part 1: Introduction [Internet]. 2015 [cited: 1 Jul 2019]. Available from: https://www.infoq.com/ articles/apache-spark-introduction.

Shoro AG, Soomro TR. Big data analysis: Apache spark perspective. Global Journal of Computer Science and Technology. 2015; 15(1): 7-14.

Kupisz B, Unold O. Collaborative filtering recommendation algorithm based on hadoop and spark. International Conference on Industrial Technology. IEEE; 2015.

Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C. Apache tez: A unifying framework for modeling and building data processing applications. International Conference on Management of Data. ACM; 2015.

Team D. Flink tutorial: A comprehensive guide for apache flink [Internet]. 2018 [cited: 1 Jan 2019]. Available from: https://data-flair.training/blogs /flink-tutorial/.

Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV. Big data analytics: A survey. Journal of Big Data. 2015; 2(1): 21-53.

Oussous A, Benjelloun F-Z, Lahcen AA, Belfkih S. Big data technologies: A survey. Journal of King Saud University-Computer and Information Sciences. 2018; 30(4): 431-48.

Ramírez-Gallego S, Fernández A, García S, Chen M, Herrera F. Big data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce. Information Fusion. 2018; 42: 51-61.

Ferranti A, Marcelloni F, Segatori A, Antonelli M, Ducange P. A distributed approach to multi-objective evolutionary generation of fuzzy rule-based classifiers from big data. Information Sciences. 2017; 415: 319-40.

Nazari E, Pour R, Tabesh H. Comprehensive overview of decision-fusion technique in healthcare: A scoping review protocol. Iran J Med Inform. 2018; 7(1): e7.

Kayyali B, Knott D, Van Kuiken S. The big-data revolution in US health care: Accelerating value and innovation. Mc Kinsey & Company. 2013; 2(8): 1-13.

Poojary P. Big data in healthcare: How hadoop is revolutionizing healthcare analytics [Internet]. 2019 [cited: 15 May 2019]. Available from: https://www.edureka.co/blog/hadoop-big-data-in-healthcare.

HDFS Tutorial Team. How big data is solving healthcare problems successfully? [Internet]. 2016 [cited: 15 May 2019]. Available from: https://www.hdfstutorial.com/blog/big-data-application-in-healthcare/

Lijun W, Yongfeng H, Ji C, Ke Z, Chunhua L. Medoop: A medical information platform based on hadoop. International Conference on e-Health Networking, Applications and Services. IEEE; 2013.

Sweeney C, Liu L, Arietta S, Lawrence J. HIPI: A hadoop image processing interface for image-based mapreduce tasks. International Journal of Recent Trends in Engineering & Research. 2011; 2(11): 557-62.

Wiewiórka MS, Messina A, Pacholewska A, Maffioletti S, Gawrysiak P, Okoniewski MJ. SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics. 2014; 30(18): 2652-3. PMID: 24845651 DOI: 10.1093/bioinformatics/btu343 [PubMed]

Freeman J, Vladimirov N, Kawashima T, Mu Y, Sofroniew NJ, Bennett DV, et al. Mapping brain activity at scale with cluster computing. Nat Methods. 2014; 11(9): 941-50. PMID: 25068736 DOI: 10.1038/nmeth.3041 [PubMed]

Boubela RN, Kalcher K, Huf W, Našel C, Moser E. Big data approaches for the analysis of large-scale fMRI data using apache spark and GPU processing: a demonstration on resting-state fMRI data from the human connectome project. Front Neurosci. 2016; 9: 492. PMID: 26778951 DOI: 10.3389/fnins.2015.00492 [PubMed]

Versaci F, Pireddu L, Zanetti G. Scalable genomics: From raw data to aligned reads on Apache YARN. International Conference on Big Data. IEEE; 2016.

Harerimana G, Jang B, Kim JW, Park HK. Health big data analytics: A technology survey. IEEE Access. 2018; 6: 65661-78.




DOI: https://doi.org/10.30699/fhi.v8i1.180

Refbacks

  • There are currently no refbacks.