Big Data Analysis in Healthcare: Apache Hadoop, Apache spark and Apache Flink, and
Health care data is increasing. The correct analysis of such data will improve the quality of care and reduce costs. This kind of data has certain features such as high volume, variety, high-speed production, etc. It makes it impossible to analyze with ordinary hardware and software platforms. Choosing the right platform for managing this kind of data is very important.
The purpose of this study is to introduce and compare the most popular and most widely used platform for processing Big Data, Apache Hadoop MapReduce, and the two Apache Spark and Apache Flink platforms, which have recently been featured with great prominence.
Material and Methods:
This study is a survey whose content is based on the subject matter search of the Proquest, PubMed, Google Scholar, Science Direct, Scopus, IranMedex, Irandoc, Magiran, ParsMedline and Scientific Information Database (SID) databases, as well as Web reviews, specialized books with related keywords and standard. Finally, 80 articles related to the subject of the study were reviewed.
The findings showed that each of the studied platforms has features, such as data processing, support for different languages, processing speed, computational model, memory management, optimization, delay, error tolerance, scalability, performance, compatibility, Security and so on. Overall, the findings showed that the Apache Hadoop environment has simplicity, error detection, and scalability management based on clusters, but because its processing is based on batch processing, it works for slow complex analyzes and does not support stream processing, Apache Spark is also distributed as a computational platform that can process a Big Data set in memory with a very fast response time, the Apache Flink allows users to store data in memory and load them multiple times and provide a complex Fault Tolerance mechanism Continuously retrieves data stream status.
The application of Big Data analysis and processing platforms varies according to the needs. In other words, it can be said that each technology is complementary, each of which is applicable in a particular field and cannot be separated from one another and depending on the purpose and the expected expectation, and the platform must be selected for analysis or whether custom tools are designed on these platforms.
With the development of new technologies, health care data is increasing. Estimated in 2012, the data is about 200 petabyte (PB), estimated to reach 250000 PB by 2020 . Analysis of these data is very important for acquiring knowledge, for extracting useful information and for discovering hidden data patterns. And in the area of health care will improve the quality of services, reduce costs and reduce errors [2, 3]. This kind of data has many features: including high volume, variety, scalability, complexity, high speed production and uncertainty, which makes it possible to use common data mining techniques and typical software and hardware to analyze this type of data [3-7].
A Big Data analysis is a process that organizes the data collected from various sources and then analyzes the data sets in order to discover the facts and meaningful patterns .
Large-scale data analyzes have many uses in the field of health: for example, early diagnosis of diseases such as breast cancer, in the processing of medical images and medical signals for providing high-quality diagnosis, monitoring patient symptoms, tracking chronic diseases such as diabetes, preventing incidence of contagious diseases, education through social networks, genetic data analysis and personalized (precision) medicine. Examples of this type of data are omics data, including genomics, transcriptomics, proteomics and pharmacogenomics, biomedical data, web data and data in various electronic health records (EHRs) and hospital information systems (HISs). Data contained in the EHR and HIS contain rich data including demographic characteristics, test results, diagnosis and information of each individual [9-17]. Therefore, analysis of health data has been considered with regard to its importance, so that it has led scholars and scientists to create structures, methodologies, approaches and new approaches for managing, controlling and processing Big Data .
In recent years, many tools have been introduced for Big Data analysis. We intend to introduce the tools provided by the Apache Software Foundation and then compare them with each other after defining the Big Data and its features. Some of the tools available for Big Data analysis are Apache Hadoop , Spark , and Flink , the focus of these tools is on batch processing or stream processing. Mostly batch processing tools are based on the Apache Hadoop Infrastructure such as Apache Mahout. Data stream analysis programs are often used for real-time analysis. Spark and Flink are examples of data flow analysis platforms. The interactive analysis process allows users to interact directly in real time to conduct their analysis. For example, "Hive and Drill" are cluster platforms that support interactive analysis. These tools help researchers develop Big Data project .
MATERIALS AND METHODS
The Big Data is a term used for data with a volume greater than 1018 or Exabyte, and storage, management, sharing, analysis, and visualization of this type of data is difficult due to its characteristics [23, 24]. The analysis of this type of data includes these steps:
Information extraction and cleaning
Modeling and analysis
Interpretation and deployment .
The Big Data is defined by the attributes. These features are, in fact, the challenges that the Big Data analysis has to address and need to be managed. At the beginning of the emergence of this type of data, three features were raised; in studies of 8 characteristics for the big data, and in 2017, 42 characteristics were proposed, and it is expected to reach 120 characteristics by the year 2021 . Further, these descriptions are some important features:
Volume: refers to the production of high-volume data that requires a lot of storage space.
Velocity: Data rates are unpredictable.
Variety: Variety of data and its various formats. Data may fall into three categories: structured, semi-structured, and unstructured. They can also have different types of images, videos and audio.
Veracity: refers to bias, noise and abnormality in large data. Extreme noise, incomplete, inaccurate, inaccurate or extra data, or, in other words, data quality imply this characteristic.
Vagueness: refers to the vagueness of the information.
Apache Haddop is a suite of open source software that facilitates solving issues with Big Data through the use of a large number of computers. Many people regard the two Hadoop and MapReduce as similar, while this is not true . In fact, Hadoop uses the MapReduce software model to provide a framework for storing and processing Big Data. While Hadoop was originally designed to use computing on weak and medium systems, it was gradually being used in high-end hardware .
In recent years, many projects have been developed to complete or modify the Hadoop, for this purpose the term "Hadoop Ecosystem" is used to refer to projects and related products. In Fig 1 the Hadoop Ecosystem is shown .
To fully understand the Hadoop, you need to look at both yourself and the ecosystem around it. The Hadoop project consists of four main components :
1) Hadoop Distributed File System (HDFS): A file system for storing huge volumes of data across a cluster of systems. HDFS has master-slave architecture. It provides high-throughput and error tolerance systems, which holds more than three copies of each data block.
2) MapReduce data processing engine: The distributed programming and processing model is distributed.
3) Resource Management Layer, YARN (also known as MapReduce Version 2): The new model is a distributed job and places jobs among the cluster. This model provides a distinction between infrastructure and programming model.
4) Commons libraries used in different parts of the Hadoop that are also used elsewhere. Some of these libraries have been implemented in java, including compression codecs, I/O utilities, and error detection [19-25].
The Hadoop Ecosystem consists of several projects around the main components mentioned above. These projects are designed to help researchers and experts in all stages of the analysis and machine learning workflow. The general structure of the ecosystem consists of three layers: the storage layer, the processing layer and the management layer [26-40].
MapReduce is a Google-generated programming model for Big Data processing based on the “divide and conquer” method [41, 42]. The divide and conquer method is implemented in two steps: Map and Reduction. The steps in performing the MapReduce model are shown in Fig 2 .
MapReduce programming enables a large amount of data to be processed in parallel. Based on this model, each software is a sequence of MapReduce operations consisting of a Map stage and a Reduce step that is used to process a large number of independent data. These two main actions are applied to a key, value :
Mapping step: The main node takes the input, dividing it into smaller issues. Then they distribute them between nodes that are tasked with doing things. This node may also repeat the same thing, in which case we have a multi-level structure. Ultimately, these sub-issues are processed and sent to the original node.
Step Reduce: Now, the main node that receives the responses and the results combines them to provide output. Meanwhile, actions may be performed on results, such as filtering, summarizing or converting.
These two main actions are applied to a key, value. The Map function takes a tidy pair of data and converts it into a list of ordered pairs:
Map (k1, v1) -> list (k2, v2)
Then, the MapReduce framework collects all pairs with the same key from all the lists and groups them together. Then, for each key generated, a group is created. Now the Reduce function applies to each group:
Reduce (k2, list (v2)) -> list (v3)
The MapReduce framework now converts a list of (key, values) into a list of values. The device should be able to process a list of (key, values) in the main memory.
One of the important features of MapReduce is that the failed nodes are categorized automatically and the complexity of the error tolerance is hidden from the viewpoint of the programmers .
Disadvantages of MapReduce are:
One challenge and two-step dataflow challenge. It does not directly support tasks with different data flows.
The user must repeatedly copy the join and filtering and aggregation codes manually, which will waste time, create program errors, reduce readability and impede optimization.
Even for typical operations such as Filtering and Projection, custom codes are written that cause problems with reuse and maintenance.
The vague nature of the Map and Reduce functions hinders the system's ability to be optimized.
Apache Spark is a text-based framework that was presented at the University of Berkeley in 2009. The existence of multiple advantages has made the processing engine a powerful and useful process for macro processing, and distinguishes it from other tools, such as Hadoop and Storm [18, 20, 52-55].
All features provided by Apache Spark are built on top of the core . For Spark, there are three Java APIs, Python and Scala. The Spark core is an API location that defines the resilient distributed RDD dataset, which is the concept of Spark's original programming . Its key features are :
1. Responsible for the essential I/O capabilities.
2. Its important role in planning and observing Spark cluster.
3. Recovering errors by using computations in memory solves the complexity of MapReduce.
Advantages of Apache Spark are:
Open source professionals like Intel, IBM, Databricks, Cloudera and MapR have officially announced that they support and support Apache Spark's standards as the standard engine for large data analysis.
Eliminates the needs of large-format processing with various data (text data, data charts, etc.) and also manages data sources properly.
Surely, 10 to 100 times faster than the Hadoop because of processing in memory, it also works better with MapReduce in executing the program on the disk.
Supports various programming languages from Python to Scala and Java. The system has a set of over 80 high-level operators and can be interactively used for querying data within the shell.
For duplicate processing, interactive processing and event processing.
The main issue is whether with the arrival of Spark, we have to leave the Hadoop, and do they differ from each other? In the answer, Spark and Hadoop cannot be considered completely separate from each other and with the advent of new tools from the past, we must say that Spark has integrated with the Hadoop and has overcome its problems [50, 51]. Spark does not use MapReduce as its executable engine, but is well integrated with the Hadoop. Because it can run in yarn and work with the Hadoop and HDFS data format. There is no way to create security in Spark, and it needs to be connected to the security mechanisms in YARN, such as Kerberos. As a result, Spark can be more powerful in combination with Hadoop [20, 54, 60].
Another high-level Apache project is the Flink project, which has exactly the same mission as Spark in the Hadoop Ecosystem, and has been introduced as an alternative to the MapReduce Hadoop model. Apache Flink is an open source stream processor. Flink provides speed, efficiency, and precision for mass processing, and can handle batch processing or even direct and stream processing.
Many of the concepts of the two tools are similar, but we will see Flink as a more diverse and lighter option. For example, both data streams in Spark and Flink guarantee that each record is processed exactly once. As a result, any duplicates that may be available will be deleted. Compared to other processing systems, such as Storm, they have a very high operational capability, but both have a low error overhead.
The same flake with Spark can do structured query language (SQL), graph, machine learning and stream processing. It also works with NoSQL, relational database management system (RDBMS), like the SQL server and MongoDB. Flink is a combination of MapReduce based on memory-based disk and spark. One of Fink's advantages over Spark is the following:
It has better performance for repetitive processes. Duplicate processing runs on a node independently of the cluster, which will increase the speed.
Can run classic MapReduce processing and also integrate with Apache TEZ .
Spark thanks to the micro-classification architecture provides near-real-time flow, while Apache Flink provides real-time stream for real-time. Due to pure stream architecture based on the Kappa architecture [48, 51, 59].
Table 1 shows the differences between the Hadoop, Spark, and Flink and they are examined more precisely:
Considering the necessity of analyzing health data that is increasing day by day  and the importance of considering the appropriate software platform, this study examined and compared the three platforms Apache Hadoop, Apache Spark and Apache Flink. The results showed that, depending on the needs, the efficiency of the data analysis and processing platforms varies, in other words, it can be said that each technology is complementary and each one is applicable in a particular field and cannot be separated from one another. For example, when the volume discussion was raised, Hadoop first implemented MapReduce, which has a high processing speed parallel to the volume Then, with the advancement of technology, a variety of discussions came about, and other tools came into being and were based on different aspects. If there is a need for real-time data processing and stream in a single event, Spark will be the appropriate option it will not be a must and should be used with a Flink, so the use of each technology depends on the need and that we can combine these tools together to achieve the desired purpose. The results of this study can help researchers and those who are seeking Big Data analytics in the field of health and medical care in choosing the appropriate platform.
Big Data analysis improves health care services and reduces costs. The results of well-conducted studies and projects in the field of health care in the context of the Big Data analysis illustrate this fact. According to a report, these analyzes will cost $340 to $450 billion in various prevention, diagnosis and treatment departments [67, 68].
One of the most famous recently implemented projects is IBM Watson. In this study, Watson's physician will help in identifying symptoms and factors associated with patient diagnosis and treatment and making better decisions.
In the field of health care, 80% of the complex data (MRI images, medical notes, etc.) are used to perform these analyzes, based on the need of different platforms. Hadoop helps researchers and doctors get a glimpse of data that has never been possible before.
It finds correlations in the data with many variables, which is a very difficult task for humans and can be effective in the discovery and prevention of diseases and the treatment of chronic diseases. One MapReduce demo, this demo helps write a program that can remove duplicate CT scan images from a 100 million photos database. Anticipated wearable technologies of 26.54% in care for the elderly and in intensive care during the period 2020-2016 will create a change in the field of health care. Collected data can be stored in Hadoop and analyzed using MapReduce and Spark and will save costs .
Hadoop is well placed to upgrade hospital services, especially when hospitals sit at the bedside with sensors for checking the status of blood pressure, cholesterol, and so on, while using the RDBMS, it's no longer possible to get this information in the long run. Production is saved.
More than 40 percent of people have admitted that high insurance costs are due to the large number of fraudulent complaints that cost more than one billion. Insurance companies use a hypothetical environment to reduce these scams. They use historical and real-time data on medical complaints, wages, and so on.
At Texas Hospital, Hadoop was used in electronic medical records (EMRs) and found that patients in their 30-day treatment period needed additional care and treatment, and with the help of the Hadoop platform, they could reduce the readmission rate from 26 to 21; this means that using Hadoop in EMR, they could reduce the readmission rate by 5%.
96% of US hospitals have EHR, while in Asia and India, this is very low. The EHR is a rich source of data analysis and well-managed patient planning changes and changes. In India, a hospital called AIIMS uses the Big Data analysis to improve the quality of services .
In addition to using the platform's capabilities, it is possible to add and use tools for the needs of these platforms to carry out the analysis of the database. In a study, the Medoop tool, a medical platform based on Hadoop-based, was proposed. This system uses the features of scalability, high reliability and high throughput in the Hadoop . In the study, a Hadoop image processing interface (HIFI) tool was developed for MapReduce Image-based activities . In a study, the SparkSeq tool was also proposed, which was also added to Spark, with the goal of analyzing the genomic data and the expression of the cloud-based gene expression for the purpose of discovering translated strings for a type of cancer or another tool called Thunderbuilt for analysis of large-scale, neural data [73, 74]. A study from Apache Spark has been used to analyze functional MRI data and avoid frequent write-ups on disk . Flink has also been used to monitor electrocardiogram (ECG), magnetic resonance imaging (MRI) reading, wearable sensor monitoring, and other cyber-physical systems, and is also useful in analyzing genomic data and has been reported to have a high fault tolerance [76, 77].
It is suggested that future studies introduce other platforms and make more comparisons with different platforms and their capabilities for managing Big Data in the field of health. It is also suggested that, according to the target, custom tools should be designed and incorporated into existing platforms to be used.
CONFLICTS OF INTEREST
The authors declare no conflicts of interest regarding the publication of this study.
No financial interests related to the material of this manuscript have been declared.