dis 3 BD - Paper Answers

See attached

As outlined within this weeks Topic, there are several benefits as well as challenges associated with the use of Big Data Analytics in the e-Healthcare industry. Identify the challenges associated with each of the Catagories below:
Data Gathering
Storage and Integration
Data Analysis
Knowledge Discovery and Information Interpretation
Please make your initial post and two response posts substantive. A substantive post will do at least TWO of the following:
Ask an interesting, thoughtful question pertaining to the topic
Answer a question (in detail) posted by another student or the instructor
Provide extensive additional information on the topic
Explain, define, or analyze the topic in detail
Share an applicable personal experience
Provide an outside source (for example, an article from the UC Library) that applies to the topic, along with additional information about the topic or the source (please cite properly in APA)
Make an argument concerning the topic.
At least one scholarly source should be used in the initial discussion thread. Be sure to use information from your readings and other sources from the UC Library. Use proper citations and references in your post.

Attached references:
AYANI, S., MOULAEI, K., DARWISH KHANEHSARI, S., JAHANBAKHSH, M., & SADEGHI, F. (2019). A Systematic Review of Big Data Potential to Make Synergies between Sciences for Achieving Sustainable Health: Challenges and Solutions. Applied Medical Informatics, 41(2), 53–64. Retrieved from http://search.ebscohost.com/login.aspx?direct=true&AuthType=shib&db=a9h&AN=138949499&site=eds-live
Dash, S., Shakyawar, S.K., Sharma, M., Kaushik, S. (2019). Big data in healthcare: management, analysis and future prospects. J Big Data 6, 54, doi:10.1186/s40537-019-0217-0

Applied Medical Informatics

Review Vol. 41, No. 2 /2019, pp: 53-64

[

A Systematic Review of Big Data Potential to Make Synergies
between Sciences for Achieving Sustainable Health: Challenges
and Solutions

Shirin AYANI1, Khadijeh MOULAEI2,*, Sarah DARWISH KHANEHSARI1, Maryam
JAHANBAKHSH3, Faezeh SADEGHI4

1 Rayavaran Medical Informatics Company, Smart Hospital and Telemedicine Research Center,
Tehran, Iran.
2 Iran University of Medical Sciences, School of Management and Medical Information, Khadijeh
Moulaei, No.2, Corner of Hamsayegan Street, Valiasr Ave, Tehran, Iran.
3 Isfahan University of Medical Sciences, Isfahan, Iran.
4 Faran (Mehr Danesh) Non-governmental Institute of Virtual Higher Education, Department of
English Language, Tehran, Iran.
E-mail: Moulaei.kh@tak.iums.ac.ir

* Author to whom correspondence should be addressed; Tel: +982188783115; Fax: +982188661654

Received: January 9, 2019 /Accepted: May 16, 2019/ Published online: June 30, 2019

Abstract
The importance of the healthcare industry, benefiting from the synergies between sciences, adds to
the necessity of discovering knowledge, which is achievable with big data analytics tools. The purpose
of this article is to examine the challenges and provide solutions for using big data in the healthcare
industry. The methods of this article are derived from PRISMA guidelines and its models. A variety
of databases and search engines including PubMed, Scopus, Elsevier, IEEE, Springer, Web of
Science, Proquest, and Google Scholar were searched according to credible keywords. The results of
the present study showed that the problems associated with the use of big data in the healthcare
industry could be classified in four groups including “data gathering, storage and integration”, “data
analysis”, “knowledge discovery and information interpretation”, and “infrastructure”. Although the
results point to a high frequency of challenges in the “data gathering, storage and integration” group,
the greatest weight of problems, due to their importance, appears to be visible in the “infrastructure”
group. Considering the numerous benefits of using big data, it is imperative to identify the challenges
and resolve them accurately. It is expected that all the barriers can be removed soon. Big data analytics
tools will be able to offer the best possible strategies based on human individual and social conditions
in the context of artificial intelligence methods.

Keywords: Big data; Data analysis; Data integration; Internet Of Things (IOT); Medical informatics;
Biological informatics

Introduction

At the end of the 1990s, in order to make the right decisions and gain a better understanding of
market behaviors, the role of gathering data, integrating and interpreting business information was
emphasized by the researchers. For this purpose, the term “Big Data” was introduced by Michael
Cox and David Ellsworth in 1997 [1]. Big data is referred to as a set of data whose volume is beyond

Shirin AYANI, Khadijeh MOULAEI, Sarah DARWISH KHANEHSARI, Maryam JAHANBAKHSH, Faezeh
SADEGHI

54 Appl Med Inform 41(2) June/2019

the capabilities of current databases and technologies. Therefore, in order to analyze these data,
databases with volume capabilities higher than terabyte and Exabyte are needed [2].

Big data separating factors from other data include volume (scale and size of data in storage),
velocity (the speed in which this data is generated, produced, created, refreshed, and streamed),
variety (multiple different forms of the data), veracity (uncertainty of the data that leads to confidence
or trust in the data), and value (deriving business value and insights from the data) [3-6].

Additionally, the outcomes of big data analyses contribute to the identification of unknown
patterns that show the causal relationships between different events in a wide range of information
in the real world and ends in knowledge production [7].

Before the introduction of the concept of big data, with the emergence of information
revolutionary age, it was possible to collect the data associated with healthcare activities in related
centers [8], and healthcare providers who exploited health information systems like hospital
information systems (HISs) started generating massive data [9]. While the information systems were
used, specialists’ level of expectation went up, and another need was formed: how to understand the
multidimensional causal relationships associated with individual and social health. Simultaneously,
big data was introduced to the world and health researchers showed interest in this field [10-12]. Big
data in the healthcare industry refers to a set of data related to diagnosis and treatment of diseases,
contagious diseases, nutritional status, climate, political status, security of a country (especially war
conditions), cultural status, social system, regional and vernacular status, metabolism and
micronutrients (ions), genetics and cells, the economic status, the insurance companies’ bills and other
things [1, 10, 13-15]. To collect these data, equipment, and tools such as the Internet, smartphones,
social media, sensors and databases – which are related to the scientific societies and hospitals- are
used as well as clinical and hospital information systems [14, 16]. After gathering data, it was possible
to discover unknown patterns associated with some features carried out by the use of advanced
analytics tools. These features that are used by advanced analytics tools include individual and social
disease management, changes in habits and pathogenic conditions, prevention, diagnosis and
treatment of diseases especially rare diseases, forecasting, individualization of health services, support
and supervision of health social services [14]. One of the valuable advantages of analyzing data in the
healthcare industry is the knowledge discovery beyond researchers’ imagination, which ends in the
successful medical decision making of healthcare providers and producing clinical decision support
systems [3, 17, 18]. Analysis of this data is done by the use of particular computing technologies,
which requires specific hardware structures and operating platforms. At this time, operating
platforms, hardware structures, and advanced technologies for using big data are acceptably
reachable. Extensible Markup Language (XML), Web Services, Database Management Systems,
Hadoop, SAP HANA and analytical software, are their examples [1, 14, 19, 20]. On the other hand,
because of the considerable data volume, it is impossible to store and transfer data using traditional
methods and technologies. Today, SQL, MySQL, and Oracle databases are widely used for
implementing information systems, while for storing large data, Apache and Nosql databases are
required [1].

The big data formation is performed in three stages of collecting, processing and visualizing data
[15, 21]. Each step is accompanied by significant challenges that prevent successful implementation
of big data operationally. Therefore, the purpose of this study was to survey the applications of big
data in the healthcare industry in order to increase the synergy, as well as achieve sustainable health
to survey challenges and provide suitable solutions.

Material and Method

The methods of this article are derived from PRISMA guidelines and its models (for further
information see www.prisma-statement.org).

A Systematic Review of Big Data Potential to Make Synergies between Sciences for Achieving Sustainable
Health: Challenges and Solutions

[

Appl Med Inform 41(2) June/2019 55

Information Resources

A variety of databases and search engines including Pubmed, Scopus, Elsevier, IEEE, Springer,
Web of Science, Proquest and Google Scholar were searched according to credible keywords and the
pre-specified search strategy mentioned below. The databases were searched from May 24, 2018, to
July 30, 2018.

Keywords and Search Strategy

The keywords of this research used in the search strategy are as follows: Big Data, Data Sets, Big
Data Analytics, Big Data Analytics Tools, Administrative Data, Structured Data, Unstructured Data,
Business Process Analytics, Real-time Analytics, Information Technology Management, Health Care,
Health Care Industries, E-Health Solutions, Social Health, Clinical Registries, Bioinformatics, Health
Informatics, Medical Informatics, Sensor Informatics, Challenge, Solution, Problems.

The applied search strategy was [Big data* AND (Business Process OR Data sets OR Real-time
OR Administrative Data OR Unstructured Data OR Structured Data OR Information Technology
Management OR Resource-based Theory) AND (Solutions* OR Challenges* OR Problems*) AND
(Healthcare OR E-Health OR Social Health OR Public Health OR Clinical Registries OR Medical
OR *Informatics OR Bioinformatics)].

Inclusion, Exclusion and Data Extraction

The inclusion criteria were as follows:
 Full-text resources were available.
 The articles were published in the last 10 years.
 The articles were published in scientific and high-ranking journals.
 Big data challenges and their potential solutions in the healthcare industry were suggested.

The exclusion criteria were as follows:
 The challenges were unrelated to the big data in the healthcare industry.
 The definitions were not clear and related to the challenges and their solutions.
 Concerning challenges and their solutions, the articles were not comprehensive.

First, all the challenges introduced in the selected articles were extracted. The challenges were
categorized into four groups: “data gathering, storage and integration”, “data analysis”, “knowledge
discovery and information interpretation” and “infrastructure”.

Then, to find or propose solutions for each challenge, the necessary examinations were carried
out and solutions were put in different groups along with their related challenges.

Results

The search results are shown in Figure 1, and the results of these studies are illustrated in Table
1. In this table, the problems were grouped, and solutions to each problem were specified.

Data Gathering, Storage and Integration

Over the years, the volume of generated data has been increased significantly by healthcare
organizations. These data are collected from various sources as well as by various tools and
technologies such as information systems, cell phones, wireless sensors, RFID (radio frequency
identification) and so on [22, 25]. Therefore, in order to create big data in the healthcare industry,
heterogeneous sources and different formats are used [28]. For this reason, it is normal to face some
problems such as noise, confounding factors, and inconsistencies in the gathered data collection [25].
Also, during data gathering, for some reasons such as inadequate storage space and gathering data
from various sources, some valuable data can be possibly ignored or removed [22].

Ladha and his colleagues argued that lost data might cause the creation of invalid patterns.
Therefore, three potential constraints in gathering data should be taken into account. First, some data

Shirin AYANI, Khadijeh MOULAEI, Sarah DARWISH KHANEHSARI, Maryam JAHANBAKHSH, Faezeh
SADEGHI

56 Appl Med Inform 41(2) June/2019

may be missing or artificial. Second, in some cases, some found data could lead to the definition of
ambiguous or contradictory variables. Third, some variables can act as confounding factors during
the analysis. The measure of the error rate in each one of these three stages indicates the
ineffectiveness of big data and presents the risk of its use by conducting scientific researches [11].

It is noteworthy that in order to avoid gathering data redundancy, it is necessary to identify and
provide methods to prevent data abundance and redundant data storage [35].

Figure 1. Literature search criteria with inclusion and exclusion criteria

3.2. Data Analysis

Analyzing big data, some errors occurring during the data gathering and those errors that were
hidden in the databases were identified and corrected as much as possible. Current information
systems in the healthcare industry are not integrated, so by gathering data in a big data repository,
these errors can be identified [39]. Because of this and due to the lack of practical methods for
accurate and rapid processing of massive volume of data, big data analysis brings up a critical issue
[32]. However, since analyzing these health data leads to significant worthwhile outcomes, many
researchers are trying to overcome these challenges [40, 41]. For example, Mathew and Pillai
described the SAP HANA platform in data analysis as highly useful. SAP HANA utilizes some data
mining methods for analyzing complex and large volumes of data[3]. It is also imperative to use some
techniques such as networks, graphs, and charts for data analysis [42-44]. Note that if no pattern is
extracted, it is necessary to re-formulate and repeat the analysis step[14]. The system analyst’s skill in
recognizing the patterns, the definition of the rules, process modeling, error detection, and setting
error threshold are important [45]. Evaluating the results of processing big data in order to validate
the acquired patterns is also a significant challenge that must be done by professional multidisciplinary
teams [1].

Another critical point is the need for a remarkable space of temporary memory that is accessible
on unique hardware platforms while analyzing the data [13].

A Systematic Review of Big Data Potential to Make Synergies between Sciences for Achieving Sustainable
Health: Challenges and Solutions

[

Appl Med Inform 41(2) June/2019 57

Table 1. Big data challenges and their solutions in the healthcare industry

ID Challenge
Group

Challenge Solution Reference

1 Data gathering,
storage and
integration

Difficulty in gathering and
integrating data

Creating distributed databases which are
interrelated and choosing a scientific manner
to gather data from health centers nationally

[2,22,23]

Lack of time priority in
gathering data

Having data gathering patterns so that data
interconnections are considered

[11,18]

Data ambiguity Using various scientific techniques to clarify
data and code them and defining flexible
formats for high-clarity data gathering

[13]

Heterogeneity of different
sources of gathered data

Using semantic networks and ontological
interpretations

[10,24,25]

Artificial nature of some
data elements due to the
loss of other relevant data

Using Loss data analysis and providing clear
definitions of variables

[11,26,27]

Existence of noisy data Using preprocessing techniques (like PCA)

[26]

Massive data Centralized data monitoring, preventing

redundancy and filtering unnecessary data
[17,28,29]

2 Data analysis Difficulty in analyzing big
data

Using advanced computing technologies and
processing parallelism

[30,31]

Difficulty in extracting
patterns and models

Using advanced data mining techniques to
obtain valuable models and patterns

[13,23]

Uncertainty about the
accuracy of extracted
information, patterns, and
models

Evaluating is done to confirm the accuracy and
value of the model

[13,23]

3 Knowledge
discovery and
information
interpretation

Interpretation of patterns Information interpretation after getting help
from experts in multidisciplinary or
interdisciplinary fields

[1]

Difficulty in representing
knowledge

Defining flexible formats to represent the
interpreted information. The multidisciplinary
or interdisciplinary specialists should
document extracted knowledge

[23]

Studying the
generalizability of
knowledge and the
accuracy of explicit
knowledge

The validity of explicit knowledge should be
confirmed statistically and epidemiologically

[26]

4 Infrastructure Absence of specific rules
and standards

Defining a set of rules in the form of specific
frameworks to achieve standards and to apply
the rules in a correct way

[13,32,33]

Immaturity of required
infrastructure

Identifying and implementing modern
technologies and then complying the existing
infrastructure with them

[25,27,36]

Lack of a stakeholder Making collective business policies for
applicability of big data

[32, 34]

Data security (availability,
confidentiality, and
integrity]

Using advanced security standards and related
technologies and then continuing the
monitoring over data accuracy and quality

[23,35,36]

Lack of proper bandwidth
for data transfer

Using high bandwidth or special data transfer
protocols

[26, 37]

Lack of some important
information systems for
storing data digitally as in
Electronic Medical
Records

Applying information systems like transaction
processing systems, registration systems, and
decision support systems

[38]

Shirin AYANI, Khadijeh MOULAEI, Sarah DARWISH KHANEHSARI, Maryam JAHANBAKHSH, Faezeh
SADEGHI

58 Appl Med Inform 41(2) June/2019

Knowledge Discovery and Information Interpretation

One of the major challenges of big data is the interpretation of information and patterns after the
analysis. At this point, the leading question is how new knowledge can be obtained from the
aggregation of information; then how it can be documented, displayed, and verified [46].

Knowledge discovery during data analysis based on the Internet of Things is one of the biggest
challenges that professionals encounter. Devices and software based on the Internet of Things
generate substantial data streams, which result in the mass production of data. Researchers can use
Artificial Intelligence and Machine Learning to interpret this information. Machine Learning
algorithms and Intelligent Computing are considered achievable solutions for analyzing big data
based on the Internet of Things [22, 47]. On the other hand, the simple representation of the
knowledge extracted from big data is a serious issue. If it is not possible to demonstrate the novel
knowledge, it is impossible to develop and apply it. Therefore, to eliminate the complexity of the
discovered patterns and create relations between them, experts’ opinions in different subjects areas
and from other perspectives are needed [35]. Identifying invalid patterns and accrediting extracted
knowledge also needs multidisciplinary expert teams. Since a pattern may be taken from a specialized
field to the others, it is not easy to perceive the relationships between them and to evaluate the results
[35, 48].

Infrastructure

Infrastructure refers to the use of a combination of hardware, software, and services that should
be robust, supportive, and scalable [49]. The infrastructure of big data refers to all cases that can
support its lifecycle on a large scale over time. In the infrastructure of big data, it is essential to be
ensured of data security (including confidentiality, integrity, and availability) [50, 51). There are many
challenges in healthcare that make it impossible to create a secure infrastructure; for example, high
bandwidth is effective to the speed of data stream [32]. Communication channels based on high-
bandwidth help gathering and managing data and protecting its security [52].

On the other hand, storing and retrieving data from clinical and hospital information systems is a
very complex, time-consuming, expensive endeavour, which requires a robust infrastructure.
Databases and computing systems such as Mongo, Hadoop, and MapReduce can provide proper
infrastructure for applying big data[53]. Concerning the characteristics of health data, the Mongo
database can provide high performance; accessibility and scalability for big data. This can successfully
create data repositories [54, 55].

The conceptual framework of big data analysis in the healthcare industry varies from the
traditional data analysis frameworks; so, processing should be distributed and performed all across
the nodes of the network. Therefore, instead of using a machine, the processing is broken down and
carried out by variant machines, and their analytics can be performed in parallel with the help of
MapReduce [56, 57]. Besides, open source platforms such as Hadoop, which operates in cloud space,
support the use of big data analysis in the healthcare industry [13, 19].

One of the other challenges is the absence of national and international laws and standards, which
if available, could guarantee the success of the work [33]. On the other hand, the absence of
information systems that record events causes the unsuccessful gathering of required data. Therefore,
it seems necessary to develop these systems, especially Electronic Medical Records [58].

Ultimately, what was learned from the whole study reflects the fact that although the results of
studying the four groups of big data challenges represent high frequency in the “data gathering,
storage and integration” group, the greatest weight of problems, due to their importance, appears to
be visible in the “infrastructure”.

The study found that many researchers are trying to discover health-related causal relationships
from big data by the use of artificial intelligence techniques. The discovery of these relationships is
expected to significantly increase the human ability to control the risk factors associated with
individual and social health.

A Systematic Review of Big Data Potential to Make Synergies between Sciences for Achieving Sustainable
Health: Challenges and Solutions

[

Appl Med Inform 41(2) June/2019 59

Discussion

Different studies have pointed to a variety of challenges of big data, and they have been classified
from different perspectives. Acharjya and Kauser stated that for surveying these challenges, it is
essential to know different types of computational complexity, information security, and
computational methods of data analysis [22]. In a study, Andreu Perez and his colleagues also
emphasized the challenges of privacy, security, data ownership, stewardship and data governance [2].
Huffman declared that vendors, healthcare providers, and government officials must carefully
consider the big data challenges and design appropriate strategies. This will cause progress in all
science areas, especially in healthcare [59]. Nasser and Tariq divided big data challenges based on
their “life cycles” into three distinct categories: “data”, “process” and “management.

They associated the challenges of the “data” category with data volume, variety, velocity, veracity,
volatility, quality, discovery, and dogmatism. They also ascribed the challenges of the “process”
category as: to how to capture data, how to integrate data, how to transform data, how to select the
right model for analysis and how to provide the results. Eventually, privacy, security, governance, and
ethical aspects were grouped into the “management” category [29]. Philip and his colleagues argued
that challenges and opportunities come together and the challenges associated with big data will bring
many attractive opportunities in the future. They categorized big data challenges into data capture,
storing, searching, sharing, analyzing and visualizing categories [35].

According to this study, researchers found that the categorizations in other studies are too general.
Moreover, there was no study exactly addressing the challenges and proposing related solutions.
Therefore, in order to achieve a thorough understanding of the big data implementation problems,
at first, the challenges have been identified. Then, after complementary studies about each of the
challenges, a suitable solution was found or proposed. Note that in order to resolve many of these
challenges, no precise executive solution has been suggested [1, 2, 11, 14, 17, 32]. Thus, most of the
solutions represent some ideas and approaches that are expected to be achievable with current or
future technologies and tools.

The results of the present study showed that the problems associated with the use of big data in
the healthcare industry could be classified in four groups including “data gathering, storage and
integration”, “data analysis”, “knowledge discovery and information interpretation”, and
“infrastructure”. Although the results point a high frequency of challenges in the “data gathering,
storage and integration” group, the greatest weight of problems, due to their importance, appears to
be visible in the “infrastructure” group. Considering the numerous benefits of using big data, it is
imperative to identify the challenges and resolve them accurately.

In connection with the importance of fixing the problems of using Big Data and benefiting from
their advantages, the following items can be mentioned:

Human knowledge improvement has identified the interconnection of sciences and resulted in
the creation of multidisciplinary sciences. It is anticipated that in the future, multidisciplinary sciences
will revolutionize the world, and they can lead to the emergence of the knowledge age [60].
Multidisciplinary sciences focus on the data stream from different scientific areas to each other. These
data are originated from effective factors that have come up from a scientific field and affect others
[61]. In this regard, the following examples can be mentioned: differences in lifestyle and social
conditions and their effects on diseases or their treatments, differences in geographical conditions
and their effects on treatment and care. Some other items include wars and their long-term effects
on the health, psychological conditions in raising children and their outcomes in adolescence and
middle age as well as controlling chronic diseases in particular conditions, activation of a defective
gene and its role in the development of the disease with regard to environmental and nutritional
status and so on [62-67]. In such situations, knowledge discovery and understanding the effects of
the factors are very complicated. Hence, achieving the great human ideal, which is understanding
unknowns and discovering facts, can be fulfilled by the use of big data [6, 10]. Big data analytics tools
are valuable to identify the relationship between sciences through knowledge discovery. It represents
the actual movements of a scientific field at one point in the world and its effects on other areas of
science in other parts of the world [6, 35, 68].

Shirin AYANI, Khadijeh MOULAEI, Sarah DARWISH KHANEHSARI, Maryam JAHANBAKHSH, Faezeh
SADEGHI

60 Appl Med Inform 41(2) June/2019

The relation between the sciences can be achieved by technological convergence that results in
the synergy of sciences like NBIC containing four different sciences (Nanotechnology,
Biotechnology, Information Technology, and Cognitive Science) [69, 70]. The technological
convergence is important for some reasons including developing a person’s perception of reality,
creating robust technology platforms for health improvement and diseases diagnosis and treatment
[10, 71, 72]. In order to achieve a technological convergence, the potentiality of big data potential
that depicts the connection between different sciences should not be ignored.

On the other hand, big data is capable of detecting and displaying multilevel molecular and genetic
interatomic. Epigenetic knowledge can open another perspective for the researchers and help them
explain epigenetic rules and the effects on gene expression. Furthermore, big data can help to perceive
the functions of body organs “as one of the best extraordinary biological systems in the world” [73,
74].

Another important point is that although clinical trials are the golden standards in medical science
to determine the causality, clinical trial challenges provide an opportunity to use big data. Clinical
trials are morally questionable, expensive, and time-consuming. Bias and subjective attitudes can
accompany them, and in many cases are not technically supportable, while the use of big data provides
researchers with the opportunity to design and develop research hypotheses after considering disease
patterns and their commonalities and to identify causal relationships, despite the challenges of clinical
trials [11].

From another perspective, the big data analytics tools can be considered as an appropriate
simulator for some health information systems that monitors and controls the influential factors. For
example, the Pharmacovigilance information system has been implemented to monitor drug effects
and determine adverse reactions [75, 76]. In order to run this information system, governments, some
organizations and specialized users are busy taking actions in some areas of the world. If big data fed
by required distributed subsystems exist, it is potential to simulate Pharmacovigilance results.

At last, the existence of big data is essential for pure and applied researches. Most studies have
sufficed to examine the existing big data challenges, some solutions have only been suggested
theoretically [3, 11, 17, 32, 38]. Considering the huge benefits of big data, it is imperative to identify
the challenges and resolve them accurately. It is expected that with modern technologies, all the
barriers can be removed in near future and big data analytics tools, utilizing artificial intelligence, will
be able to offer the best possible strategies based on social and individual conditions.

Conflict of Interest

The authors declare that they have no conflict of interest.

Acknowledgments

The researchers of this study would like to thank Dr. Rafat Bayat and the educational and research
staff of Faran (Mehr Danesh) Non-governmental Institute of Virtual Higher Education for the
financial and spiritual support of this research.

References

1. Cox M, Ellsworth D (Eds.). Application-controlled demand paging for out-of-core visualization.
VIS ’97 Proceedings of the 8th conference on Visualization ’97 (Cat No 97CB36155); 1997: IEEE.
97-010, July 1997.

2. Andreu-Perez J, Poon CCY, Merrifield RD, Wong STC, Yang GZ. Big Data for Health. IEEE
Journal of Biomedical and Health Informatics 2015;19(4):1193-1208.

A Systematic Review of Big Data Potential to Make Synergies between Sciences for Achieving Sustainable
Health: Challenges and Solutions

[

Appl Med Inform 41(2) June/2019 61

3. Mathew PS, Pillai AS (Eds.). Big Data solutions in Healthcare: Problems and perspectives. 2015
International Conference on Innovations in Information, Embedded and Communication
Systems (ICIIECS); 2015, 19-20 March.

4. Lee I. Big data: Dimensions, evolution, impacts, and challenges. Business Horizons
2017;60(3):293-303.

5. Gandomi A, Haider M. Beyond the hype: Big data concepts, methods, and analytics. International
Journal of Information Management 2015;35(2):137-144.

6. Sagiroglu S, Sinanc D, editors. Big data: A review. Collaboration Technologies and Systems (CTS),
2013 International Conference; 2013: IEEE. 20-24 May 2013.

7. Chen C, Ma J, Susilo Y, Liu Y, Wang M. The promises of big data and small data for travel
behavior (aka human mobility) analysis. Transportation Research Part C: Emerging Technologies
2016;68:285-299.

8. Gill P, Stewart K, Treasure E, Chadwick B. Methods of data collection in qualitative research:
interviews and focus groups. British Dental Journal 2008;204(6):291-295.

9. Li J-S, Zhang Y-F, Tian Y. Medical Big Data Analysis in Hospital Information System. Big Data
on Real-World Applications. InTech; 2016. Available at:
https://www.intechopen.com/books/big-data-on-real-world-applications/medical-big-data-
analysis-in-hospital-information-system (accessed December 10, 2018)

10. Wang Y, Hajli N. Exploring the path to big data analytics success in healthcare. Journal of
Business Research 2017;70:287-299.

11. Ladha KS, Arora VS, Dutton RP, Hyder JA. Potential and Pitfalls for Big Data in Health Research.
Advances in Anesthesia 2015;33(1):97-111.

12. Riabacke M, Danielson M, Ekenberg L. State-of-the-art prescriptive criteria weight elicitation.
Advances in Decision Sciences 2012;2(3):1-24.

13. Wassan JT. Big Data Paradigm for Healthcare Sector. Maitreyi College, University of Delhi, India:
Maitreyi College; 2016.

14. Huang T, Lan L, Fang X, An P, Min J, Wang F. Promises and Challenges of Big Data Computing
in Health Sciences. Big Data Research 2015;2(1):2-11.

15. Koufi V, Malamateniou F, Vassilacopoulos G. A Big Data-driven Model for the Optimization of
Healthcare Processes. Stud Health Technol Inform. 2015;2(10):697-701.

16. Bhatt CM, Dey N, Ashour A. Internet of Things and Big Data Technologies for Next Generation
Healthcare. Springer; 2017.

17. Liu W, Park EK, editors. Big Data as an e-Health Service. 2014 International Conference on
Computing, Networking and Communications (ICNC); 2014 3-6 Feb 2014.

18. Belle A, Thiagarajan R, Soroushmehr S, Navidi F, Beard DA, Najarian K. Big data analytics in
healthcare. BioMed Research International. 2015;4(12):20-30.

19. Zikopoulos P, Eaton C. Understanding big data: Analytics for enterprise class hadoop and
streaming data: McGraw-Hill Osborne Media; 2011.

20. Färber F, May N, Lehner W, Große P, Müller I, Rauhe H, et al. The SAP HANA Database–An
Architecture Overview. IEEE Data Eng Bull.2012;35(1):28-33.

21. Janke AT, Overbeek DL, Kocher KE, Levy PD. Exploring the potential of predictive analytics
and big data in emergency care. Annals of Emergency Medicine 2016;67(2):227-236.

22. Acharjya DP, Kauser Ahmed P. A Survey on Big Data Analytics: Challenges, Open Research
Issues and Tools. (IJACSA) International Journal of Advanced Computer Science and
Applications 2016;7(2):512-8.

23. Cortés R, Bonnaire X, Marin O, Sens P. Stream Processing of Healthcare Sensor Data: Studying
User Traces to Identify Challenges from a Big Data Perspective. Procedia Computer Science
2015;52(1):1004-1009.

24. Khan N, Yaqoob I, Hashem IAT, Inayat Z, Mahmoud Ali WK, Alam M, et al. Big Data: Survey,
Technologies, Opportunities, and Challenges. The Scientific World Journal 2014;2014(2014):18-
24.

25. Fan J, Han F, Liu H. Challenges of Big Data analysis. National Science Review 2014;1(2):293-314.

Shirin AYANI, Khadijeh MOULAEI, Sarah DARWISH KHANEHSARI, Maryam JAHANBAKHSH, Faezeh
SADEGHI

62 Appl Med Inform 41(2) June/2019

26. Merelli I, Pérez-Sánchez H, Gesing S, D’Agostino D. Managing, analysing, and integrating big
data in medical bioinformatics: open problems and future perspectives. BioMed Research
International 2014;8(6):1-13.

27. Groenwold RH, Donders AR, Roes KC, Harrell FE, Jr., Moons KG. Dealing with missing
outcome data in randomized trials and observational studies. Am J Epidemiol. 2012;175(3):210-
7.

28. Tole AA. Big data challenges. Database Systems Journal 2013;4(3):31-40.
29. Nasser T, Tariq R. Big data challenges. Computer Engineering & Information Technology

2015;4(3):1-3.
30. Marx V. Biology: The big challenges of big data. Nature 2013;498(7453):255-260.
31. Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV. Big data analytics: a survey. Journal of Big Data

2015;2(1):21-30.
32. Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inf

Sci Syst. 2014;2(3):1-10.
33. Schultz T. Turning healthcare challenges into big data opportunities: A use‐case review across the

pharmaceutical development lifecycle. Bulletin of the Association for Information Science and
Technology 2013;39(5):34-40.

34. Chaudhari N, Srivastava S (Eds.). Big data security issues and challenges. 2016 International
Conference on Computing, Communication and Automation (ICCCA); 29-30 April 2016.

35. Philip Chen CL, Zhang C-Y. Data-intensive applications, challenges, techniques and technologies:
A survey on Big Data. Information Sciences 2014;275:314-47.

36. Terzi DS, Terzi R, Sagiroglu S (Eds.). A survey on security and privacy issues in big data. 2015
10th International Conference for Internet Technology and Secured Transactions (ICITST);14-
16 Dec 2015.

37. Javadi B, Zhang B, Taufer M, editors. Bandwidth Modeling in Large Distributed Systems for Big
Data Applications. 2014 15th International Conference on Parallel and Distributed Computing,
Applications and Technologies 2014, pp. 9-11.

38. Kuziemsky CE, Monkman H, Petersen C, Weber J, Borycki EM, Adams S, et al. Big Data in
Healthcare – Defining the Digital Persona through User Contexts from the Micro to the Macro:
Contribution of the IMIA Organizational and Social Issues WG. Yearbook of Medical
Informatics 2014;9(1):82-9.

39. Bologa A-R, Bologa R, Florea A. Big data and specific analysis methods for insurance fraud
detection. Database Systems Journal 2013;4(4):30-39.

40. Kam HJ, Kim JA, Cho I, Kim Y, Park RW. Integration of heterogeneous clinical decision support
systems and their knowledge sets: feasibility study with drug-drug interaction alerts. AMIA Annual
Symposium Proceedings 2011;2011(1):664-73.

41. Zhang Z, Sarcevic A, An Y. A prototype system for heterogeneous data management and medical
devices integration in trauma resuscitation. iConference Proceedings 2013;3(6):1-5.

42. Lee J, Kwon YS, Färber F, Muehle M, Lee C, Bensberg C, et al., editors. SAP HANA distributed
in-memory database system: Transaction, session, and metadata management. Conference on
Data Engineering (ICDE), 29th International Conference, 8-12 April 2013

43. Morgen C, editor Empowering SAS. Users on the SAP HANA Platform. Proceedings of the SAS
Global Forum 2014. International Conference, 8-12 April 2014

44. Rudolf M, Paradies M, Bornhövd C, Lehner W, editors. The Graph Story of the SAP HANA
Database. BTW 2013;4:403-20.

45. Abdullah N, Ismail SA, Sophiayati S, Sam SM. Data quality in big data: a review. International
Journal of Advances in Soft Computing & Its Applications. 2015;7(3):16-2.

46. Das T, Kumar PM. Big data analytics: A framework for unstructured data analysis. International
Journal of Engineering Science & Technology 2013;5(1):153.

47. Chen X-Y, Jin Z-G. Research on Key Technology and Applications for Internet of Things.
Physics Procedia 2012;33:561-566.

48. Subasinghe K, Kodithuwakku S (Eds.). A big data analytic identity management expert system for
social media networks. Conference on Electrical and Computer Engineering (WIECON-ECE),
2015 International WIE, 19-20 Dec 2015.

A Systematic Review of Big Data Potential to Make Synergies between Sciences for Achieving Sustainable
Health: Challenges and Solutions

[

Appl Med Inform 41(2) June/2019 63

49. Schoenborn B. Big Data Analytics Infrastructure for DUMMIES. United States of America: John
Wiley & Sons, Inc; 2014.

50. Demchenko Y, Ngo C, Membrey P. Architecture framework and components for the big data
ecosystem. Journal of System and Network Engineering 2013;4(7):1-31.

51. Moreno J, Serrano MA, Fernández-Medina E. Main Issues in Big Data Security. Future Internet
2016;8(3):44.

52. Wasan SK, Bhatnagar V, Kaur H. The impact of data mining techniques on medical diagnostics.
Data Science Journal 2006;5(2):119-26.

53. Nandimath J, Banerjee E, Patil A, Kakade P, Vaidya S, Chaturvedi D, editors. Big data analysis
using Apache Hadoop. Information Reuse and Integration (IRI), 2013 IEEE 14th International
Conference on; 20 July 2013.

54. Prasad S, Sha MSN (Eds.). NextGen data persistence pattern in healthcare: Polyglot persistence.
2013 Fourth International Conference on Computing, Communications and Networking
Technologies (ICCCNT), 4-6 July 2013.

55. Yu WD, Kollipara M, Penmetsa R, Elliadka S, editors. A distributed storage solution for cloud
based e-Healthcare Information System. 2013 IEEE 15th International Conference on e-Health
Networking, Applications and Services (Healthcom 2013), 9-12 Oct 2013.

56. Chen Y, Alspaugh S, Katz R. Interactive analytical processing in big data systems: A cross-industry
study of mapreduce workloads. Proceedings of the VLDB Endowment. 2012;5(12):1802-1813.

57. Wu X, Zhu X, Wu G-Q, Ding W. Data mining with big data. IEEE Transactions on Knowledge
and Data Engineering 2014;26(1):97-107.

58. Heiss H-U, Wagner R. Adaptive load control in transaction processing systems: Universität
Karlsruhe, Fakultät für Informatik; 1991.

59. Hoffman S. Medical Big Data and Big Data Quality Problems. 2014 ;289 (2014): 2015-2018.
60. Guba EG, Lincoln YS. Fourth generation evaluation. Sage; 1989.
61. Dierker S, Bergmann U, Corlett J, Falcone R, Galayda J, Gibson M, et al. Science and Technology

of Future Light Sources. Brookhaven National Laboratory; 2008.
62. McGinnis JM, Foege WH. Actual causes of death in the United States. Jama. 1993;270(18):2207-

2212.
63. Paffenbarger Jr RS, Hyde R, Wing AL, Hsieh C-c. Physical activity, all-cause mortality, and

longevity of college alumni. New England Journal of Medicine 1986;314(10):605-613.
64. Wang H, Naghavi M, Allen C, Barber RM, Bhutta ZA, Carter A, et al. GBD 2015 Mortality and

Causes of Death Collaborators. Global, regional, and national life expectancy, all-cause mortality,
and cause-specific mortality for 249 causes of death, 1980-2015: a systematic analysis for the
Global Burden of Disease Study 2015. Lancet 2016;388(10053):1459-1544.

65. Short A. War and disease: War epidemics in the nineteenth and twentieth centuries. ADF Health
1949;11(1):15-18.

66. Briggs D. Environmental pollution and the global burden of disease. British Medical Bulletin.
2003;68(1):1-24.

67. Genes and human disease: World Health Organization(WHO); 2019 [cited 2017 SEP2017].
Available from: http://www.who.int/genomics/public/geneticdiseases/en/index2.html.

68. Agarwal R, Dhar V. Big data, data science, and analytics: The opportunity and challenge for IS
research. INFORMS; 201;25(3):443-448.

69. Porter AL, Youtie J. Where does nanotechnology belong in the map of science? Nature
Nanotechnology 2009;4(9):534-6.

70. Grunwald A, Orwat C. Technology Assessment of Information and Communication
Technologies. Encyclopedia of Information Science and Technology, Fourth Edition: IGI
Global 2018;2(1) 4267-4277.

71. Chen H, Chiang RHL, Storey VC. Business intelligence and analytics: from big data to big impact.
MIS Quart. 2012; 36 (4):1165-1188.

72. Murdoch TB, Detsky AS. The inevitable application of big data to health care. JAMA
2013;309(13):1351-1352.

73. Qin Y, Yalamanchili HK, Qin J, Yan B, Wang J. The Current Status and Challenges in
Computational Analysis of Genomic Big Data. Big Data Research 2015;2(1):12-18.

Shirin AYANI, Khadijeh MOULAEI, Sarah DARWISH KHANEHSARI, Maryam JAHANBAKHSH, Faezeh
SADEGHI

64 Appl Med Inform 41(2) June/2019

74. Chui CK, Mhaskar H, Zhuang X. Representation of functions on big data associated with directed
graphs. Applied and Computational Harmonic Analysis 2018;44(1):165-188.

75. Härmark L, Van Grootheest A. Pharmacovigilance: methods, recent developments and future
perspectives. European journal of clinical pharmacology. 2008;64(8):743-752.

76. Organization WH. The importance of pharmacovigilance: Geneva: World Health Organization;
2002 [accessed July 20, 2017]. Available from:
http://apps.who.int/medicinedocs/en/d/Js4893e/

Copyright of Applied Medical Informatics is the property of SRIMA Publishing House and its
content may not be copied or emailed to multiple sites or posted to a listserv without the
copyright holder’s express written permission. However, users may print, download, or email
articles for individual use.

Big data in healthcare: management,
analysis and future prospects
Sabyasachi Dash1†, Sushil Kumar Shakyawar2,3†, Mohit Sharma4,5 and Sandeep Kaushik6*

Introduction
Information has been the key to a better organization and new developments. The more
information we have, the more optimally we can organize ourselves to deliver the best
outcomes. That is why data collection is an important part for every organization. We
can also use this data for the prediction of current trends of certain parameters and
future events. As we are becoming more and more aware of this, we have started pro-
ducing and collecting more data about almost everything by introducing technological
developments in this direction. Today, we are facing a situation wherein we are flooded
with tons of data from every aspect of our life such as social activities, science, work,
health, etc. In a way, we can compare the present situation to a data deluge. The tech-
nological advances have helped us in generating more and more data, even to a level

Abstract
‘Big data’ is massive amounts of information that can work wonders. It has become a
topic of special interest for the past two decades because of a great potential that is
hidden in it. Various public and private sector industries generate, store, and analyze
big data with an aim to improve the services they provide. In the healthcare indus-
try, various sources for big data include hospital records, medical records of patients,
results of medical examinations, and devices that are a part of internet of things.
Biomedical research also generates a significant portion of big data relevant to public
healthcare. This data requires proper management and analysis in order to derive
meaningful information. Otherwise, seeking solution by analyzing big data quickly
becomes comparable to finding a needle in the haystack. There are various challenges
associated with each step of handling big data which can only be surpassed by using
high-end computing solutions for big data analysis. That is why, to provide relevant
solutions for improving public health, healthcare providers are required to be fully
equipped with appropriate infrastructure to systematically generate and analyze big
data. An efficient management, analysis, and interpretation of big data can change
the game by opening new avenues for modern healthcare. That is exactly why various
industries, including the healthcare industry, are taking vigorous steps to convert this
potential into better services and financial advantages. With a strong integration of bio-
medical and healthcare data, modern healthcare organizations can possibly revolution-
ize the medical therapies and personalized medicine.

Keywords: Healthcare, Biomedical research, Big data analytics, Internet of things,
Personalized medicine, Quantum computing

Open Access

© The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium,
provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and
indicate if changes were made.

S U R V E Y PA P E R

Dash et al. J Big Data (2019) 6:54
https://doi.org/10.1186/s40537-019-0217-0

*Correspondence:
sandeep.kaushik.
nii2012@gmail.com;
skaushik@i3bs.uminho.pt
†Sabyasachi Dash and
Sushil Kumar Shakyawar
contributed equally to this
work
6 3B’s Research
Group, Headquarters
of the European
Institute of Excellence
on Tissue Engineering
and Regenerative Medicine,
AvePark – Parque de
Ciência e Tecnologia, Zona
Industrial da Gandra, Barco,
4805-017 Guimarães,
Portugal
Full list of author information
is available at the end of the
article

Page 2 of 25Dash et al. J Big Data (2019) 6:54

where it has become unmanageable with currently available technologies. This has led
to the creation of the term ‘big data’ to describe data that is large and unmanageable. In
order to meet our present and future social needs, we need to develop new strategies to
organize this data and derive meaningful information. One such special social need is
healthcare. Like every other industry, healthcare organizations are producing data at a
tremendous rate that presents many advantages and challenges at the same time. In this
review, we discuss about the basics of big data including its management, analysis and
future prospects especially in healthcare sector.

The data overload

Every day, people working with various organizations around the world are generating
a massive amount of data. The term “digital universe” quantitatively defines such mas-
sive amounts of data created, replicated, and consumed in a single year. International
Data Corporation (IDC) estimated the approximate size of the digital universe in 2005
to be 130 exabytes (EB). The digital universe in 2017 expanded to about 16,000 EB or 16
zettabytes (ZB). IDC predicted that the digital universe would expand to 40,000 EB by
the year 2020. To imagine this size, we would have to assign about 5200 gigabytes (GB)
of data to all individuals. This exemplifies the phenomenal speed at which the digital
universe is expanding. The internet giants, like Google and Facebook, have been collect-
ing and storing massive amounts of data. For instance, depending on our preferences,
Google may store a variety of information including user location, advertisement prefer-
ences, list of applications used, internet browsing history, contacts, bookmarks, emails,
and other necessary information associated with the user. Similarly, Facebook stores and
analyzes more than about 30 petabytes (PB) of user-generated data. Such large amounts
of data constitute ‘big data’. Over the past decade, big data has been successfully used by
the IT industry to generate critical information that can generate significant revenue.

These observations have become so conspicuous that has eventually led to the birth
of a new field of science termed ‘Data Science’. Data science deals with various aspects
including data management and analysis, to extract deeper insights for improving the
functionality or services of a system (for example, healthcare and transport system).
Additionally, with the availability of some of the most creative and meaningful ways to
visualize big data post-analysis, it has become easier to understand the functioning of
any complex system. As a large section of society is becoming aware of, and involved in
generating big data, it has become necessary to define what big data is. Therefore, in this
review, we attempt to provide details on the impact of big data in the transformation of
global healthcare sector and its impact on our daily lives.

Defining big data

As the name suggests, ‘big data’ represents large amounts of data that is unmanageable
using traditional software or internet-based platforms. It surpasses the traditionally used
amount of storage, processing and analytical power. Even though a number of definitions
for big data exist, the most popular and well-accepted definition was given by Douglas
Laney. Laney observed that (big) data was growing in three different dimensions namely,
volume, velocity and variety (known as the 3 Vs) [1]. The ‘big’ part of big data is indic-
ative of its large volume. In addition to volume, the big data description also includes

Page 3 of 25Dash et al. J Big Data (2019) 6:54

velocity and variety. Velocity indicates the speed or rate of data collection and making it
accessible for further analysis; while, variety remarks on the different types of organized
and unorganized data that any firm or system can collect, such as transaction-level data,
video, audio, text or log files. These three Vs have become the standard definition of big
data. Although, other people have added several other Vs to this definition [2], the most
accepted 4th V remains ‘veracity’.

The term “big data” has become extremely popular across the globe in recent years.
Almost every sector of research, whether it relates to industry or academics, is generat-
ing and analyzing big data for various purposes. The most challenging task regarding
this huge heap of data that can be organized and unorganized, is its management. Given
the fact that big data is unmanageable using the traditional software, we need technically
advanced applications and software that can utilize fast and cost-efficient high-end com-
putational power for such tasks. Implementation of artificial intelligence (AI) algorithms
and novel fusion algorithms would be necessary to make sense from this large amount
of data. Indeed, it would be a great feat to achieve automated decision-making by the
implementation of machine learning (ML) methods like neural networks and other AI
techniques. However, in absence of appropriate software and hardware support, big data
can be quite hazy. We need to develop better techniques to handle this ‘endless sea’ of
data and smart web applications for efficient analysis to gain workable insights. With
proper storage and analytical tools in hand, the information and insights derived from
big data can make the critical social infrastructure components and services (like health-
care, safety or transportation) more aware, interactive and efficient [3]. In addition,
visualization of big data in a user-friendly manner will be a critical factor for societal
development.

Healthcare as a big‑data repository

Healthcare is a multi-dimensional system established with the sole aim for the preven-
tion, diagnosis, and treatment of health-related issues or impairments in human beings.
The major components of a healthcare system are the health professionals (physicians or
nurses), health facilities (clinics, hospitals for delivering medicines and other diagnosis
or treatment technologies), and a financing institution supporting the former two. The
health professionals belong to various health sectors like dentistry, medicine, midwifery,
nursing, psychology, physiotherapy, and many others. Healthcare is required at several
levels depending on the urgency of situation. Professionals serve it as the first point of
consultation (for primary care), acute care requiring skilled professionals (secondary
care), advanced medical investigation and treatment (tertiary care) and highly uncom-
mon diagnostic or surgical procedures (quaternary care). At all these levels, the health
professionals are responsible for different kinds of information such as patient’s medi-
cal history (diagnosis and prescriptions related data), medical and clinical data (like data
from imaging and laboratory examinations), and other private or personal medical data.
Previously, the common practice to store such medical records for a patient was in the
form of either handwritten notes or typed reports [4]. Even the results from a medical
examination were stored in a paper file system. In fact, this practice is really old, with the
oldest case reports existing on a papyrus text from Egypt that dates back to 1600 BC [5].

Page 4 of 25Dash et al. J Big Data (2019) 6:54

In Stanley Reiser’s words, the clinical case records freeze the episode of illness as a story
in which patient, family and the doctor are a part of the plot” [6].

With the advent of computer systems and its potential, the digitization of all clinical
exams and medical records in the healthcare systems has become a standard and widely
adopted practice nowadays. In 2003, a division of the National Academies of Sciences,
Engineering, and Medicine known as Institute of Medicine chose the term “electronic
health records” to represent records maintained for improving the health care sector
towards the benefit of patients and clinicians. Electronic health records (EHR) as defined
by Murphy, Hanken and Waters are computerized medical records for patients any
information relating to the past, present or future physical/mental health or condition
of an individual which resides in electronic system(s) used to capture, transmit, receive,
store, retrieve, link and manipulate multimedia data for the primary purpose of provid-
ing healthcare and health-related services” [7].

Electronic health records

It is important to note that the National Institutes of Health (NIH) recently announced
the “All of Us” initiative (https ://allof us.nih.gov/) that aims to collect one million or more
patients’ data such as EHR, including medical imaging, socio-behavioral, and environ-
mental data over the next few years. EHRs have introduced many advantages for han-
dling modern healthcare related data. Below, we describe some of the characteristic
advantages of using EHRs. The first advantage of EHRs is that healthcare profession-
als have an improved access to the entire medical history of a patient. The information
includes medical diagnoses, prescriptions, data related to known allergies, demograph-
ics, clinical narratives, and the results obtained from various laboratory tests. The rec-
ognition and treatment of medical conditions thus is time efficient due to a reduction in
the lag time of previous test results. With time we have observed a significant decrease
in the redundant and additional examinations, lost orders and ambiguities caused by
illegible handwriting, and an improved care coordination between multiple healthcare
providers. Overcoming such logistical errors has led to reduction in the number of drug
allergies by reducing errors in medication dose and frequency. Healthcare professionals
have also found access over web based and electronic platforms to improve their medi-
cal practices significantly using automatic reminders and prompts regarding vaccina-
tions, abnormal laboratory results, cancer screening, and other periodic checkups. There
would be a greater continuity of care and timely interventions by facilitating communi-
cation among multiple healthcare providers and patients. They can be associated to elec-
tronic authorization and immediate insurance approvals due to less paperwork. EHRs
enable faster data retrieval and facilitate reporting of key healthcare quality indicators to
the organizations, and also improve public health surveillance by immediate reporting of
disease outbreaks. EHRs also provide relevant data regarding the quality of care for the
beneficiaries of employee health insurance programs and can help control the increas-
ing costs of health insurance benefits. Finally, EHRs can reduce or absolutely eliminate
delays and confusion in the billing and claims management area. The EHRs and internet
together help provide access to millions of health-related medical information critical
for patient life.

Page 5 of 25Dash et al. J Big Data (2019) 6:54

Digitization of healthcare and big data

Similar to EHR, an electronic medical record (EMR) stores the standard medical and
clinical data gathered from the patients. EHRs, EMRs, personal health record (PHR),
medical practice management software (MPM), and many other healthcare data com-
ponents collectively have the potential to improve the quality, service efficiency, and
costs of healthcare along with the reduction of medical errors. The big data in health-
care includes the healthcare payer-provider data (such as EMRs, pharmacy prescription,
and insurance records) along with the genomics-driven experiments (such as genotyp-
ing, gene expression data) and other data acquired from the smart web of internet of
things (IoT) (Fig. 1). The adoption of EHRs was slow at the beginning of the 21st century
however it has grown substantially after 2009 [7, 8]. The management and usage of such
healthcare data has been increasingly dependent on information technology. The devel-
opment and usage of wellness monitoring devices and related software that can gener-
ate alerts and share the health related data of a patient with the respective health care
providers has gained momentum, especially in establishing a real-time biomedical and
health monitoring system. These devices are generating a huge amount of data that can
be analyzed to provide real-time clinical or medical care [9]. The use of big data from
healthcare shows promise for improving health outcomes and controlling costs.

Big data in biomedical research

A biological system, such as a human cell, exhibits molecular and physical events of
complex interplay. In order to understand interdependencies of various components and
events of such a complex system, a biomedical or biological experiment usually gathers
data on a smaller and/or simpler component. Consequently, it requires multiple simpli-
fied experiments to generate a wide map of a given biological phenomenon of interest.
This indicates that more the data we have, the better we understand the biological pro-
cesses. With this idea, modern techniques have evolved at a great pace. For instance,
one can imagine the amount of data generated since the integration of efficient tech-
nologies like next-generation sequencing (NGS) and Genome wide association studies
(GWAS) to decode human genetics. NGS-based data provides information at depths
that were previously inaccessible and takes the experimental scenario to a completely

Fig. 1 Workflow of Big data Analytics. Data warehouses store massive amounts of data generated from
various sources. This data is processed using analytic pipelines to obtain smarter and affordable healthcare
options

Page 6 of 25Dash et al. J Big Data (2019) 6:54

new dimension. It has increased the resolution at which we observe or record biologi-
cal events associated with specific diseases in a real time manner. The idea that large
amounts of data can provide us a good amount of information that often remains uni-
dentified or hidden in smaller experimental methods has ushered-in the ‘-omics’ era. The
‘omics’ discipline has witnessed significant progress as instead of studying a single ‘gene’
scientists can now study the whole ‘genome’ of an organism in ‘genomics’ studies within
a given amount of time. Similarly, instead of studying the expression or ‘transcription’
of single gene, we can now study the expression of all the genes or the entire ‘transcrip-
tome’ of an organism under ‘transcriptomics’ studies. Each of these individual experi-
ments generate a large amount of data with more depth of information than ever before.
Yet, this depth and resolution might be insufficient to provide all the details required to
explain a particular mechanism or event. Therefore, one usually finds oneself analyzing
a large amount of data obtained from multiple experiments to gain novel insights. This
fact is supported by a continuous rise in the number of publications regarding big data
in healthcare (Fig. 2). Analysis of such big data from medical and healthcare systems can
be of immense help in providing novel strategies for healthcare. The latest technologi-
cal developments in data generation, collection and analysis, have raised expectations
towards a revolution in the field of personalized medicine in near future.

Big data from omics studies

NGS has greatly simplified the sequencing and decreased the costs for generating
whole genome sequence data. The cost of complete genome sequencing has fallen
from millions to a couple of thousand dollars [10]. NGS technology has resulted in
an increased volume of biomedical data that comes from genomic and transcriptomic
studies. According to an estimate, the number of human genomes sequenced by 2025
could be between 100 million to 2 billion [11]. Combining the genomic and transcrip-
tomic data with proteomic and metabolomic data can greatly enhance our knowledge
about the individual profile of a patient—an approach often ascribed as “individual,

Fig. 2 Publications associated with big data in healthcare. The numbers of publications in PubMed
are plotted by year

Page 7 of 25Dash et al. J Big Data (2019) 6:54

personalized or precision health care”. Systematic and integrative analysis of omics
data in conjugation with healthcare analytics can help design better treatment strate-
gies towards precision and personalized medicine (Fig. 3). The genomics-driven experi-
ments e.g., genotyping, gene expression, and NGS-based studies are the major source of
big data in biomedical healthcare along with EMRs, pharmacy prescription information,
and insurance records. Healthcare requires a strong integration of such biomedical data
from various sources to provide better treatments and patient care. These prospects are
so exciting that even though genomic data from patients would have many variables to
be accounted, yet commercial organizations are already using human genome data to
help the providers in making personalized medical decisions. This might turn out to be a
game-changer in future medicine and health.

Internet of Things (IOT)

Healthcare industry has not been quick enough to adapt to the big data movement com-
pared to other industries. Therefore, big data usage in the healthcare sector is still in
its infancy. For example, healthcare and biomedical big data have not yet converged to
enhance healthcare data with molecular pathology. Such convergence can help unravel
various mechanisms of action or other aspects of predictive biology. Therefore, to assess
an individual’s health status, biomolecular and clinical datasets need to be married. One
such source of clinical data in healthcare is ‘internet of things’ (IoT).

In fact, IoT is another big player implemented in a number of other industries includ-
ing healthcare. Until recently, the objects of common use such as cars, watches, refriger-
ators and health-monitoring devices, did not usually produce or handle data and lacked
internet connectivity. However, furnishing such objects with computer chips and sen-
sors that enable data collection and transmission over internet has opened new avenues.
The device technologies such as Radio Frequency IDentification (RFID) tags and readers,

Fig. 3 A framework for integrating omics data and health care analytics to promote personalized treatment

Page 8 of 25Dash et al. J Big Data (2019) 6:54

and Near Field Communication (NFC) devices, that can not only gather information but
interact physically, are being increasingly used as the information and communication
systems [3]. This enables objects with RFID or NFC to communicate and function as
a web of smart things. The analysis of data collected from these chips or sensors may
reveal critical information that might be beneficial in improving lifestyle, establishing
measures for energy conservation, improving transportation, and healthcare. In fact, IoT
has become a rising movement in the field of healthcare. IoT devices create a continuous
stream of data while monitoring the health of people (or patients) which makes these
devices a major contributor to big data in healthcare. Such resources can interconnect
various devices to provide a reliable, effective and smart healthcare service to the elderly
and patients with a chronic illness [12].

Advantages of IoT in healthcare

Using the web of IoT devices, a doctor can measure and monitor various parameters
from his/her clients in their respective locations for example, home or office. Therefore,
through early intervention and treatment, a patient might not need hospitalization or
even visit the doctor resulting in significant cost reduction in healthcare expenses. Some
examples of IoT devices used in healthcare include fitness or health-tracking wear-
able devices, biosensors, clinical devices for monitoring vital signs, and others types
of devices or clinical instruments. Such IoT devices generate a large amount of health
related data. If we can integrate this data with other existing healthcare data like EMRs
or PHRs, we can predict a patients’ health status and its progression from subclinical to
pathological state [9]. In fact, big data generated from IoT has been quiet advantageous
in several areas in offering better investigation and predictions. On a larger scale, the
data from such devices can help in personnel health monitoring, modelling the spread of
a disease and finding ways to contain a particular disease outbreak.

The analysis of data from IoT would require an updated operating software because of
its specific nature along with advanced hardware and software applications. We would
need to manage data inflow from IoT instruments in real-time and analyze it by the min-
ute. Associates in the healthcare system are trying to trim down the cost and ameliorate
the quality of care by applying advanced analytics to both internally and externally gen-
erated data.

Mobile computing and mobile health (mHealth)

In today’s digital world, every individual seems to be obsessed to track their fitness and
health statistics using the in-built pedometer of their portable and wearable devices
such as, smartphones, smartwatches, fitness dashboards or tablets. With an increasingly
mobile society in almost all aspects of life, the healthcare infrastructure needs remod-
eling to accommodate mobile devices [13]. The practice of medicine and public health
using mobile devices, known as mHealth or mobile health, pervades different degrees of
health care especially for chronic diseases, such as diabetes and cancer [14]. Healthcare
organizations are increasingly using mobile health and wellness services for implement-
ing novel and innovative ways to provide care and coordinate health as well as wellness.
Mobile platforms can improve healthcare by accelerating interactive communication
between patients and healthcare providers. In fact, Apple and Google have developed

Page 9 of 25Dash et al. J Big Data (2019) 6:54

devoted platforms like Apple’s ResearchKit and Google Fit for developing research appli-
cations for fitness and health statistics [15]. These applications support seamless interac-
tion with various consumer devices and embedded sensors for data integration. These
apps help the doctors to have direct access to your overall health data. Both the user
and their doctors get to know the real-time status of your body. These apps and smart
devices also help by improving our wellness planning and encouraging healthy lifestyles.
The users or patients can become advocates for their own health.

Nature of the big data in healthcare

EHRs can enable advanced analytics and help clinical decision-making by providing
enormous data. However, a large proportion of this data is currently unstructured in
nature. An unstructured data is the information that does not adhere to a pre-defined
model or organizational framework. The reason for this choice may simply be that
we can record it in a myriad of formats. Another reason for opting unstructured for-
mat is that often the structured input options (drop-down menus, radio buttons, and
check boxes) can fall short for capturing data of complex nature. For example, we cannot
record the non-standard data regarding a patient’s clinical suspicions, socioeconomic
data, patient preferences, key lifestyle factors, and other related information in any other
way but an unstructured format. It is difficult to group such varied, yet critical, sources
of information into an intuitive or unified data format for further analysis using algo-
rithms to understand and leverage the patients care. Nonetheless, the healthcare indus-
try is required to utilize the full potential of these rich streams of information to enhance
the patient experience. In the healthcare sector, it could materialize in terms of better
management, care and low-cost treatments. We are miles away from realizing the ben-
efits of big data in a meaningful way and harnessing the insights that come from it. In
order to achieve these goals, we need to manage and analyze the big data in a systematic
manner.

Management and analysis of big data

Big data is the huge amounts of a variety of data generated at a rapid rate. The data gath-
ered from various sources is mostly required for optimizing consumer services rather
than consumer consumption. This is also true for big data from the biomedical research
and healthcare. The major challenge with big data is how to handle this large volume
of information. To make it available for scientific community, the data is required to be
stored in a file format that is easily accessible and readable for an efficient analysis. In the
context of healthcare data, another major challenge is the implementation of high-end
computing tools, protocols and high-end hardware in the clinical setting. Experts from
diverse backgrounds including biology, information technology, statistics, and math-
ematics are required to work together to achieve this goal. The data collected using the
sensors can be made available on a storage cloud with pre-installed software tools devel-
oped by analytic tool developers. These tools would have data mining and ML functions
developed by AI experts to convert the information stored as data into knowledge. Upon
implementation, it would enhance the efficiency of acquiring, storing, analyzing, and vis-
ualization of big data from healthcare. The main task is to annotate, integrate, and pre-
sent this complex data in an appropriate manner for a better understanding. In absence

Page 10 of 25Dash et al. J Big Data (2019) 6:54

of such relevant information, the (healthcare) data remains quite cloudy and may not
lead the biomedical researchers any further. Finally, visualization tools developed by
computer graphics designers can efficiently display this newly gained knowledge.

Heterogeneity of data is another challenge in big data analysis. The huge size and
highly heterogeneous nature of big data in healthcare renders it relatively less inform-
ative using the conventional technologies. The most common platforms for operating
the software framework that assists big data analysis are high power computing clusters
accessed via grid computing infrastructures. Cloud computing is such a system that has
virtualized storage technologies and provides reliable services. It offers high reliability,
scalability and autonomy along with ubiquitous access, dynamic resource discovery and
composability. Such platforms can act as a receiver of data from the ubiquitous sensors,
as a computer to analyze and interpret the data, as well as providing the user with easy
to understand web-based visualization. In IoT, the big data processing and analytics can
be performed closer to data source using the services of mobile edge computing cloud-
lets and fog computing. Advanced algorithms are required to implement ML and AI
approaches for big data analysis on computing clusters. A programming language suit-
able for working on big data (e.g. Python, R or other languages) could be used to write
such algorithms or software. Therefore, a good knowledge of biology and IT is required
to handle the big data from biomedical research. Such a combination of both the trades
usually fits for bioinformaticians. The most common among various platforms used for
working with big data include Hadoop and Apache Spark. We briefly introduce these
platforms below.

Hadoop

Loading large amounts of (big) data into the memory of even the most powerful of com-
puting clusters is not an efficient way to work with big data. Therefore, the best logical
approach for analyzing huge volumes of complex big data is to distribute and process
it in parallel on multiple nodes. However, the size of data is usually so large that thou-
sands of computing machines are required to distribute and finish processing in a rea-
sonable amount of time. When working with hundreds or thousands of nodes, one has
to handle issues like how to parallelize the computation, distribute the data, and handle
failures. One of most popular open-source distributed application for this purpose is
Hadoop [16]. Hadoop implements MapReduce algorithm for processing and generating
large datasets. MapReduce uses map and reduce primitives to map each logical record’
in the input into a set of intermediate key/value pairs, and reduce operation combines
all the values that shared the same key [17]. It efficiently parallelizes the computation,
handles failures, and schedules inter-machine communication across large-scale clusters
of machines. Hadoop Distributed File System (HDFS) is the file system component that
provides a scalable, efficient, and replica based storage of data at various nodes that form
a part of a cluster [16]. Hadoop has other tools that enhance the storage and processing
components therefore many large companies like Yahoo, Facebook, and others have rap-
idly adopted it. Hadoop has enabled researchers to use data sets otherwise impossible
to handle. Many large projects, like the determination of a correlation between the air
quality data and asthma admissions, drug development using genomic and proteomic

Page 11 of 25Dash et al. J Big Data (2019) 6:54

data, and other such aspects of healthcare are implementing Hadoop. Therefore, with
the implementation of Hadoop system, the healthcare analytics will not be held back.

Apache Spark

Apache Spark is another open source alternative to Hadoop. It is a unified engine for
distributed data processing that includes higher-level libraries for supporting SQL que-
ries (Spark SQL), streaming data (Spark Streaming), machine learning (MLlib) and graph
processing (GraphX) [18]. These libraries help in increasing developer productivity
because the programming interface requires lesser coding efforts and can be seamlessly
combined to create more types of complex computations. By implementing Resilient
distributed Datasets (RDDs), in-memory processing of data is supported that can make
Spark about 100× faster than Hadoop in multi-pass analytics (on smaller datasets) [19,
20]. This is more true when the data size is smaller than the available memory [21]. This
indicates that processing of really big data with Apache Spark would require a large
amount of memory. Since, the cost of memory is higher than the hard drive, MapReduce
is expected to be more cost effective for large datasets compared to Apache Spark. Simi-
larly, Apache Storm was developed to provide a real-time framework for data stream
processing. This platform supports most of the programming languages. Additionally,
it offers good horizontal scalability and built-in-fault-tolerance capability for big data
analysis.

Machine learning for information extraction, data analysis and predictions

In healthcare, patient data contains recorded signals for instance, electrocardiogram
(ECG), images, and videos. Healthcare providers have barely managed to convert such
healthcare data into EHRs. Efforts are underway to digitize patient-histories from pre-
EHR era notes and supplement the standardization process by turning static images into
machine-readable text. For example, optical character recognition (OCR) software is one
such approach that can recognize handwriting as well as computer fonts and push digi-
tization. Such unstructured and structured healthcare datasets have untapped wealth of
information that can be harnessed using advanced AI programs to draw critical action-
able insights in the context of patient care. In fact, AI has emerged as the method of
choice for big data applications in medicine. This smart system has quickly found its
niche in decision making process for the diagnosis of diseases. Healthcare professionals
analyze such data for targeted abnormalities using appropriate ML approaches. ML can
filter out structured information from such raw data.

Extracting information from EHR datasets

Emerging ML or AI based strategies are helping to refine healthcare industry’s informa-
tion processing capabilities. For example, natural language processing (NLP) is a rapidly
developing area of machine learning that can identify key syntactic structures in free
text, help in speech recognition and extract the meaning behind a narrative. NLP tools
can help generate new documents, like a clinical visit summary, or to dictate clinical
notes. The unique content and complexity of clinical documentation can be challenging

Page 12 of 25Dash et al. J Big Data (2019) 6:54

for many NLP developers. Nonetheless, we should be able to extract relevant informa-
tion from healthcare data using such approaches as NLP.

AI has also been used to provide predictive capabilities to healthcare big data. For
example, ML algorithms can convert the diagnostic system of medical images into auto-
mated decision-making. Though it is apparent that healthcare professionals may not be
replaced by machines in the near future, yet AI can definitely assist physicians to make
better clinical decisions or even replace human judgment in certain functional areas of
healthcare.

Image analytics

Some of the most widely used imaging techniques in healthcare include computed
tomography (CT), magnetic resonance imaging (MRI), X-ray, molecular imaging, ultra-
sound, photo-acoustic imaging, functional MRI (fMRI), positron emission tomography
(PET), electroencephalography (EEG), and mammograms. These techniques capture
high definition medical images (patient data) of large sizes. Healthcare professionals like
radiologists, doctors and others do an excellent job in analyzing medical data in the form
of these files for targeted abnormalities. However, it is also important to acknowledge
the lack of specialized professionals for many diseases. In order to compensate for this
dearth of professionals, efficient systems like Picture Archiving and Communication
System (PACS) have been developed for storing and convenient access to medical image
and reports data [22]. PACSs are popular for delivering images to local workstations,
accomplished by protocols such as digital image communication in medicine (DICOM).
However, data exchange with a PACS relies on using structured data to retrieve medical
images. This by nature misses out on the unstructured information contained in some of
the biomedical images. Moreover, it is possible to miss an additional information about
a patient’s health status that is present in these images or similar data. A professional
focused on diagnosing an unrelated condition might not observe it, especially when
the condition is still emerging. To help in such situations, image analytics is making an
impact on healthcare by actively extracting disease biomarkers from biomedical images.
This approach uses ML and pattern recognition techniques to draw insights from mas-
sive volumes of clinical image data to transform the diagnosis, treatment and monitor-
ing of patients. It focuses on enhancing the diagnostic capability of medical imaging for
clinical decision-making.

A number of software tools have been developed based on functionalities such as
generic, registration, segmentation, visualization, reconstruction, simulation and diffu-
sion to perform medical image analysis in order to dig out the hidden information. For
example, Visualization Toolkit is a freely available software which allows powerful pro-
cessing and analysis of 3D images from medical tests [23], while SPM can process and
analyze 5 different types of brain images (e.g. MRI, fMRI, PET, CT-Scan and EEG) [24].
Other software like GIMIAS, Elastix, and MITK support all types of images. Various
other widely used tools and their features in this domain are listed in Table 1. Such bio-
informatics-based big data analysis may extract greater insights and value from imaging
data to boost and support precision medicine projects, clinical decision support tools,
and other modes of healthcare. For example, we can also use it to monitor new targeted-
treatments for cancer.

Page 13 of 25Dash et al. J Big Data (2019) 6:54

Ta
b

le
1

B
io

in
fo

rm
at

ic
s

o
o

ls
fo

r
m

ed
ic

al
im

ag
e

p
ro

ce
ss

in
g

a
n

d
a

n
al

ys
is

V
TK

(h
tt

p
s :

//
vt

k.
o

rg
/)

, I
TK

(h
tt

p
s :

//
it

k.
o
rg
/)

-T
K

(h
tt

p
:/

/d
ti

-t
k.

o
u

rc
ef

o
rg

e.
n

et
/p

m
w

ik
i/

p
m

w
ik

p
h

p
),

IT
K-

n
ap

(h
tt
p
:/

w
w

w
.it

ks
n

ap
.o

rg
/p

m
w
ik
i/
p
m
w
ik
i.p
h
p
),

FS
L

(h
tt
p
s :

//
fs

l.f
m

ri
b

.o
x.

ac
.u

k/
fs

l/
fs

lw
i k

i/
),

SP
M

(h
tt
p
s :

//
w

w
w

.
fil

o
n

.u
cl

.a
c.

u
k/

sp
m

/)
, N

ift
yR

eg
(h

tt
p

:/
/s

o
u
rc
ef
o
rg
e.
n
et
/p

ro
je

ct
s/

n
ift

y r
eg

),
N

ift
yS

eg
(h
tt
p
:/
/s
o
u
rc
ef
o
rg
e.
n
et
/p
ro
je
ct
s/
n
ift

y s
eg

),
N

ift
tS

im
(h

tt
p
:/
/s
o
u
rc
ef
o
rg
e.
n
et
/p
ro
je
ct
s/
n
ift

y s
im

),
N

ift
R

ec
(h

tt
p
:/
/s
o
u
rc
ef
o
rg
e.
n
et
/p
ro
je
ct
s/
n
ift

y r
ec

),
A

N
TS

(h
tt
p
:/

/p
ic

sl
.u

p
en

n
.e

d
u

/s
o

ft
w

ar
e/

an
ts

/)
, G

M
IA

S
(h

tt
p

:/
/w

w
w

.g
im

ia
s.

o
rg

/)
, e

la
st

tt
p

:/
/e

la
st

ix
.is

i.u
u

.n
l/

),
M

IA
(h

tt
p

:/
/m

ia
.s

o
u
rc
ef
o
rg
e.
n

et
/)

M
IT

K
(h

tt
p
:/
/w
w
w

.m
it

k.
o

rg
/w

ik
i),

C
am

in
o

(h
tt
p
:/

/w
eb

4.
cs

.u
cl

.
ac

.u
k/

re
se

a r
ch

/m
ed

ic
/c

am
in

o
/p

m
w
ik
i/
p
m
w
ik
i.p
h

p
?n
=

M
ai

n
.H

o
m

eP
ag

e)
, I

M
O

D
(h

tt
p

s :
//

o
m

ic
t o

o
ls

.c
o

m
/i

m
o

d
-t

o
o

l),
M

RI
C

ro
n

(h
tt
p
s :

//
o

m
ic

t o
o

ls
.c

o
m

/m
ri

cr
o

n
-t

o
o

l),
O

st
ri

X
(h

tt
p
s :
//
o
m
ic
t o
o
ls
.c
o

m
/o

si
ri

x-
to

o
l)

To
o

ls
/s

o
ft

w
ar

es
V

K
IT

K
D

TI
‑T

K
IT

K
‑S

n
ap

FS
L

SP
M

N
if

ty
R

eg
N

if
ty

Se
g

N
if

tt
Si

m
N

if
tR

ec
A

N
TS

G
IM

IA
S

el
as

ti
x

M
IA
M
IT

K
C

am
in

o
O

si
ri

X
M

R
Ic

ro
n

IM
O

In
p

u
t

im
ag

e
su

p
p

o
rt

M
RI

x
x

x
x
x
x
x
x
x
x
x
x
x
x
x

U
lt

ra
so

u
n

d
x

x
x
x
x
x
x

X
-r

ay
x

x
x
x
x
x

fM
RI

x
x
x
x
x
x
x

P
ET

x
x
x
x

C
T-

Sc
an

x
x
x
x

E
EG

x
x
x
x

M
am

m
o

g
ra

m
x

x
x

G
ra

p
h

ic
al

u
se

r
in

te
rf

ac
e

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x

Fu
n

ct
io

n
s

G
en

er
ic

x
x
x
x
x
x
x
x
x
x

R
eg

is
tr

at
io

n
x

x
x
x
x
x
x
x
x
x
x
x

S
eg

m
en

ta
ti

o
n
x
x
x
x
x
x
x
x
x
x
x

V
is

u
al

iz
at

io
n

x
x
x
x
x
x
x
x
x
x
x
x

R
ec

o
n

st
ru

ct
io
n
x
x
x
x
x
x
x
x
x
x
x
x

S
im

u
la

ti
o

n
x
x
x
x
x
x
x
x
x

D
iff

u
si

o
n
x
x
x
x
x
x
x
x

Page 14 of 25Dash et al. J Big Data (2019) 6:54

Big data from omics

The big data from “omics” studies is a new kind of challenge for the bioinformati-
cians. Robust algorithms are required to analyze such complex data from biological
systems. The ultimate goal is to convert this huge data into an informative knowledge
base. The application of bioinformatics approaches to transform the biomedical and
genomics data into predictive and preventive health is known as translational bioin-
formatics. It is at the forefront of data-driven healthcare. Various kinds of quantita-
tive data in healthcare, for example from laboratory measurements, medication data
and genomic profiles, can be combined and used to identify new meta-data that can
help precision therapies [25]. This is why emerging new technologies are required to
help in analyzing this digital wealth. In fact, highly ambitious multimillion-dollar pro-
jects like “Big Data Research and Development Initiative” have been launched that
aim to enhance the quality of big data tools and techniques for a better organization,
efficient access and smart analysis of big data. There are many advantages antici-
pated from the processing of ‘omics’ data from large-scale Human Genome Project
and other population sequencing projects. In the population sequencing projects like
1000 genomes, the researchers will have access to a marvelous amount of raw data.
Similarly, Human Genome Project based Encyclopedia of DNA Elements (ENCODE)
project aimed to determine all functional elements in the human genome using bio-
informatics approaches. Here, we list some of the widely used bioinformatics-based
tools for big data analytics on omics data.

1. SparkSeq is an efficient and cloud-ready platform based on Apache Spark framework
and Hadoop library that is used for analyses of genomic data for interactive genomic
data analysis with nucleotide precision

2. SAMQA identifies errors and ensures the quality of large-scale genomic data. This
tool was originally built for the National Institutes of Health Cancer Genome Atlas
project to identify and report errors including sequence alignment/map [SAM] for-
mat error and empty reads.

3. ART can simulate profiles of read errors and read lengths for data obtained using
high throughput sequencing platforms including SOLiD and Illumina platforms.

4. DistMap is another toolkit used for distributed short-read mapping based on Hadoop
cluster that aims to cover a wider range of sequencing applications. For instance, one
of its applications namely the BWA mapper can perform 500 million read pairs in
about 6 h, approximately 13 times faster than a conventional single-node mapper.

5. SeqWare is a query engine based on Apache HBase database system that enables
access for large-scale whole-genome datasets by integrating genome browsers and
tools.

6. CloudBurst is a parallel computing model utilized in genome mapping experiments
to improve the scalability of reading large sequencing data.

7. Hydra uses the Hadoop-distributed computing framework for processing large pep-
tide and spectra databases for proteomics datasets. This specific tool is capable of
performing 27 billion peptide scorings in less than 60 min on a Hadoop cluster.

Page 15 of 25Dash et al. J Big Data (2019) 6:54

8. BlueSNP is an R package based on Hadoop platform used for genome-wide associa-
tion studies (GWAS) analysis, primarily aiming on the statistical readouts to obtain
significant associations between genotype–phenotype datasets. The efficiency of
this tool is estimated to analyze 1000 phenotypes on 106 SNPs in 104 individuals in a
duration of half-an-hour.

9. Myrna the cloud-based pipeline, provides information on the expression level differ-
ences of genes, including read alignments, data normalization, and statistical mod-
eling.

The past few years have witnessed a tremendous increase in disease specific datasets
from omics platforms. For example, the ArrayExpress Archive of Functional Genomics
data repository contains information from approximately 30,000 experiments and more
than one million functional assays. The growing amount of data demands for better
and efficient bioinformatics driven packages to analyze and interpret the information
obtained. This has also led to the birth of specific tools to analyze such massive amounts
of data. Below, we mention some of the most popular commercial platforms for big data
analytics.

Commercial platforms for healthcare data analytics

In order to tackle big data challenges and perform smoother analytics, various compa-
nies have implemented AI to analyze published results, textual data, and image data to
obtain meaningful outcomes. IBM Corporation is one of the biggest and experienced
players in this sector to provide healthcare analytics services commercially. IBM’s Wat-
son Health is an AI platform to share and analyze health data among hospitals, provid-
ers and researchers. Similarly, Flatiron Health provides technology-oriented services
in healthcare analytics specially focused in cancer research. Other big companies such
as Oracle Corporation and Google Inc. are also focusing to develop cloud-based storage
and distributed computing power platforms. Interestingly, in the recent few years, sev-
eral companies and start-ups have also emerged to provide health care-based analytics
and solutions. Some of the vendors in healthcare sector are provided in Table 2. Below
we discuss a few of these commercial solutions.

AYASDI

Ayasdi is one such big vendor which focuses on ML based methodologies to primarily
provide machine intelligence platform along with an application framework with tried
& tested enterprise scalability. It provides various applications for healthcare analytics,
for example, to understand and manage clinical variation, and to transform clinical care
costs. It is also capable of analyzing and managing how hospitals are organized, conver-
sation between doctors, risk-oriented decisions by doctors for treatment, and the care
they deliver to patients. It also provides an application for the assessment and manage-
ment of population health, a proactive strategy that goes beyond traditional risk analysis
methodologies. It uses ML intelligence for predicting future risk trajectories, identifying
risk drivers, and providing solutions for best outcomes. A strategic illustration of the
company’s methodology for analytics is provided in Fig. 4.

Page 16 of 25Dash et al. J Big Data (2019) 6:54

Linguamatics

It is an NLP based algorithm that relies on an interactive text mining algorithm (I2E).
I2E can extract and analyze a wide array of information. Results obtained using this
technique are tenfold faster than other tools and does not require expert knowledge for
data interpretation. This approach can provide information on genetic relationships and
facts from unstructured data. Classical, ML requires well-curated data as input to gener-
ate clean and filtered results. However, NLP when integrated in EHR or clinical records
per se facilitates the extraction of clean and structured information that often remains
hidden in unstructured input data (Fig. 5).

IBM Watson

This is one of the unique ideas of the tech-giant IBM that targets big data analytics in
almost every professional sector. This platform utilizes ML and AI based algorithms

Table 2 List of some of big companies which provide services on big data analysis
in healthcare sector

Company Description Web link

IBM Watson Health Provides services on sharing clinical and health
related data among hospital, researchers, and
provider for advance researches

https ://www.ibm.com/watso n/
healt h/index -1.html

MedeAnalytics Provides performance management solutions,
health systems and plans, and health analytics
along with long track record facility of patient
data

https ://medea nalyt ics.com/

Health Fidelity Provides management solution for risks assess-
ment in workflows of healthcare organization
and methods for optimization and adjustment

https ://healt hfide lity.com/

Roam Analytics Provides platforms for digging into big unstruc-
tured healthcare data for getting meaningful
information

https ://roama nalyt ics.com/

Flatiron Health Provides applications for organizing and improv-
ing oncology data for better cancer treatment

https ://flati ron.com/

Enlitic Provides deep learning using large-scale data sets
from clinical tests for healthcare diagnosis

https ://www.enlit ic.com/

Digital Reasoning Systems Provides cognitive computing services and data
analytic solutions for processing and organizing
unstructured data into meaningful data

https ://digit alrea sonin g.com/

Ayasdi Provides AI accommodated platform for clinical
variations, population health, risk management
and other healthcare analytics

https ://www.ayasd i.com/

Linguamatics Provides text mining platform for digging impor-
tant information from unstructured healthcare
data

https ://www.lingu amati cs.com/

Apixio Provides cognitive computing platform for analyz-
ing clinical data and pdf health records to gener-
ate deep information

https ://www.apixi o.com/

Roam Analytics Provides natural language processing infrastruc-
ture for modern healthcare systems

https ://roama nalyt ics.com/

Lumiata Provides services for analytics and risk manage-
ment for efficient outcomes in healthcare

https ://www.lumia ta.com

OptumHealth Provides healthcare analytics, improve modern
health system’s infrastructure and comprehen-
sive and innovative solutions for the healthcare
industry

https ://www.optum .com/

Page 17 of 25Dash et al. J Big Data (2019) 6:54

Fig. 4 Illustration of application of “Intelligent Application Suite” provided by AYASDI for various analyses
such as clinical variation, population health, and risk management in healthcare sector

Fig. 5 Schematic representation for the working principle of NLP-based AI system used in massive data
retention and analysis in Linguamatics

Fig. 6 IBM Watson in healthcare data analytics. Schematic representation of the various functional modules
in IBM Watson’s big-data healthcare package. For instance, the drug discovery domain involves network
of highly coordinated data acquisition and analysis within the spectrum of curating database to building
meaningful pathways towards elucidating novel druggable targets

Page 18 of 25Dash et al. J Big Data (2019) 6:54

extensively to extract the maximum information from minimal input. IBM Wat-
son enforces the regimen of integrating a wide array of healthcare domains to provide
meaningful and structured data (Fig. 6). In an attempt to uncover novel drug targets
specifically in cancer disease model, IBM Watson and Pfizer have formed a produc-
tive collaboration to accelerate the discovery of novel immune-oncology combinations.
Combining Watson’s deep learning modules integrated with AI technologies allows the
researchers to interpret complex genomic data sets. IBM Watson has been used to pre-
dict specific types of cancer based on the gene expression profiles obtained from various
large data sets providing signs of multiple druggable targets. IBM Watson is also used in
drug discovery programs by integrating curated literature and forming network maps to
provide a detailed overview of the molecular landscape in a specific disease model.

In order to analyze the diversified medical data, healthcare domain, describes ana-
lytics in four categories: descriptive, diagnostic, predictive, and prescriptive analytics.
Descriptive analytics refers for describing the current medical situations and comment-
ing on that whereas diagnostic analysis explains reasons and factors behind occurrence
of certain events, for example, choosing treatment option for a patient based on clus-
tering and decision trees. Predictive analytics focuses on predictive ability of the future
outcomes by determining trends and probabilities. These methods are mainly built up of
machine leaning techniques and are helpful in the context of understanding complica-
tions that a patient can develop. Prescriptive analytics is to perform analysis to propose
an action towards optimal decision making. For example, decision of avoiding a given
treatment to the patient based on observed side effects and predicted complications. In
order to improve performance of the current medical systems integration of big data
into healthcare analytics can be a major factor; however, sophisticated strategies need
to be developed. An architecture of best practices of different analytics in healthcare
domain is required for integrating big data technologies to improve the outcomes. How-
ever, there are many challenges associated with the implementation of such strategies.

Challenges associated with healthcare big data

Methods for big data management and analysis are being continuously developed espe-
cially for real-time data streaming, capture, aggregation, analytics (using ML and pre-
dictive), and visualization solutions that can help integrate a better utilization of EMRs
with the healthcare. For example, the EHR adoption rate of federally tested and certified
EHR programs in the healthcare sector in the U.S.A. is nearly complete [7]. However,
the availability of hundreds of EHR products certified by the government, each with dif-
ferent clinical terminologies, technical specifications, and functional capabilities has led
to difficulties in the interoperability and sharing of data. Nonetheless, we can safely say
that the healthcare industry has entered into a ‘post-EMR’ deployment phase. Now, the
main objective is to gain actionable insights from these vast amounts of data collected as
EMRs. Here, we discuss some of these challenges in brief.

Storage

Storing large volume of data is one of the primary challenges, but many organizations
are comfortable with data storage on their own premises. It has several advantages like
control over security, access, and up-time. However, an on-site server network can be

Page 19 of 25Dash et al. J Big Data (2019) 6:54

expensive to scale and difficult to maintain. It appears that with decreasing costs and
increasing reliability, the cloud-based storage using IT infrastructure is a better option
which most of the healthcare organizations have opted for. Organizations must choose
cloud-partners that understand the importance of healthcare-specific compliance and
security issues. Additionally, cloud storage offers lower up-front costs, nimble disaster
recovery, and easier expansion. Organizations can also have a hybrid approach to their
data storage programs, which may be the most flexible and workable approach for pro-
viders with varying data access and storage needs.

Cleaning

The data needs to cleansed or scrubbed to ensure the accuracy, correctness, consistency,
relevancy, and purity after acquisition. This cleaning process can be manual or automa-
tized using logic rules to ensure high levels of accuracy and integrity. More sophisticated
and precise tools use machine-learning techniques to reduce time and expenses and to
stop foul data from derailing big data projects.

Unified format

Patients produce a huge volume of data that is not easy to capture with traditional EHR
format, as it is knotty and not easily manageable. It is too difficult to handle big data
especially when it comes without a perfect data organization to the healthcare provid-
ers. A need to codify all the clinically relevant information surfaced for the purpose of
claims, billing purposes, and clinical analytics. Therefore, medical coding systems like
Current Procedural Terminology (CPT) and International Classification of Diseases
(ICD) code sets were developed to represent the core clinical concepts. However, these
code sets have their own limitations.

Accuracy

Some studies have observed that the reporting of patient data into EMRs or EHRs is not
entirely accurate yet [26–29], probably because of poor EHR utility, complex workflows,
and a broken understanding of why big data is all-important to capture well. All these
factors can contribute to the quality issues for big data all along its lifecycle. The EHRs
intend to improve the quality and communication of data in clinical workflows though
reports indicate discrepancies in these contexts. The documentation quality might
improve by using self-report questionnaires from patients for their symptoms.

Image pre‑processing

Studies have observed various physical factors that can lead to altered data quality and
misinterpretations from existing medical records [30]. Medical images often suffer tech-
nical barriers that involve multiple types of noise and artifacts. Improper handling of
medical images can also cause tampering of images for instance might lead to delinea-
tion of anatomical structures such as veins which is non-correlative with real case sce-
nario. Reduction of noise, clearing artifacts, adjusting contrast of acquired images and
image quality adjustment post mishandling are some of the measures that can be imple-
mented to benefit the purpose.

Page 20 of 25Dash et al. J Big Data (2019) 6:54

Security

There have been many security breaches, hackings, phishing attacks, and ransomware
episodes that data security is a priority for healthcare organizations. After noticing an
array of vulnerabilities, a list of technical safeguards was developed for the protected
health information (PHI). These rules, termed as HIPAA Security Rules, help guide
organizations with storing, transmission, authentication protocols, and controls over
access, integrity, and auditing. Common security measures like using up-to-date anti-
virus software, firewalls, encrypting sensitive data, and multi-factor authentication can
save a lot of trouble.

Meta‑data

To have a successful data governance plan, it would be mandatory to have complete,
accurate, and up-to-date metadata regarding all the stored data. The metadata would be
composed of information like time of creation, purpose and person responsible for the
data, previous usage (by who, why, how, and when) for researchers and data analysts.
This would allow analysts to replicate previous queries and help later scientific studies
and accurate benchmarking. This increases the usefulness of data and prevents creation
of “data dumpsters” of low or no use.

Querying

Metadata would make it easier for organizations to query their data and get some
answers. However, in absence of proper interoperability between datasets the query
tools may not access an entire repository of data. Also, different components of a dataset
should be well interconnected or linked and easily accessible otherwise a complete por-
trait of an individual patient’s health may not be generated. Medical coding systems like
ICD-10, SNOMED-CT, or LOINC must be implemented to reduce free-form concepts
into a shared ontology. If the accuracy, completeness, and standardization of the data are
not in question, then Structured Query Language (SQL) can be used to query large data-
sets and relational databases.

Visualization

A clean and engaging visualization of data with charts, heat maps, and histograms to
illustrate contrasting figures and correct labeling of information to reduce potential con-
fusion, can make it much easier for us to absorb information and use it appropriately.
Other examples include bar charts, pie charts, and scatterplots with their own specific
ways to convey the data.

Data sharing

Patients may or may not receive their care at multiple locations. In the former case, shar-
ing data with other healthcare organizations would be essential. During such sharing, if
the data is not interoperable then data movement between disparate organizations could
be severely curtailed. This could be due to technical and organizational barriers. This
may leave clinicians without key information for making decisions regarding follow-
ups and treatment strategies for patients. Solutions like Fast Healthcare Interoperabil-
ity Resource (FHIR) and public APIs, CommonWell (a not-for-profit trade association)

Page 21 of 25Dash et al. J Big Data (2019) 6:54

and Carequality (a consensus-built, common interoperability framework) are making
data interoperability and sharing easy and secure. The biggest roadblock for data shar-
ing is the treatment of data as a commodity that can provide a competitive advantage.
Therefore, sometimes both providers and vendors intentionally interfere with the flow of
information to block the information flow between different EHR systems [31].

The healthcare providers will need to overcome every challenge on this list and more
to develop a big data exchange ecosystem that provides trustworthy, timely, and mean-
ingful information by connecting all members of the care continuum. Time, commit-
ment, funding, and communication would be required before these challenges are
overcome.

Big data analytics for cutting costs

To develop a healthcare system based on big data that can exchange big data and pro-
vides us with trustworthy, timely, and meaningful information, we need to overcome
every challenge mentioned above. Overcoming these challenges would require invest-
ment in terms of time, funding, and commitment. However, like other technological
advances, the success of these ambitious steps would apparently ease the present burdens
on healthcare especially in terms of costs. It is believed that the implementation of big
data analytics by healthcare organizations might lead to a saving of over 25% in annual
costs in the coming years. Better diagnosis and disease predictions by big data analyt-
ics can enable cost reduction by decreasing the hospital readmission rate. The health-
care firms do not understand the variables responsible for readmissions well enough. It
would be easier for healthcare organizations to improve their protocols for dealing with
patients and prevent readmission by determining these relationships well. Big data ana-
lytics can also help in optimizing staffing, forecasting operating room demands, stream-
lining patient care, and improving the pharmaceutical supply chain. All of these factors
will lead to an ultimate reduction in the healthcare costs by the organizations.

Quantum mechanics and big data analysis

Big data sets can be staggering in size. Therefore, its analysis remains daunting even with
the most powerful modern computers. For most of the analysis, the bottleneck lies in the
computer’s ability to access its memory and not in the processor [32, 33]. The capacity,
bandwidth or latency requirements of memory hierarchy outweigh the computational
requirements so much that supercomputers are increasingly used for big data analy-
sis [34, 35]. An additional solution is the application of quantum approach for big data
analysis.

Quantum computing and its advantages

The common digital computing uses binary digits to code for the data whereas quantum
computation uses quantum bits or qubits [36]. A qubit is a quantum version of the classi-
cal binary bits that can represent a zero, a one, or any linear combination of states (called
superpositions) of those two qubit states [37]. Therefore, qubits allow computer bits to
operate in three states compared to two states in the classical computation. This allows
quantum computers to work thousands of times faster than regular computers. For
example, a conventional analysis of a dataset with n points would require 2n processing

Page 22 of 25Dash et al. J Big Data (2019) 6:54

units whereas it would require just n quantum bits using a quantum computer. Quantum
computers use quantum mechanical phenomena like superposition and quantum entan-
glement to perform computations [38, 39].

Quantum algorithms can speed-up the big data analysis exponentially [40]. Some
complex problems, believed to be unsolvable using conventional computing, can be
solved by quantum approaches. For example, the current encryption techniques such
as RSA, public-key (PK) and Data Encryption Standard (DES) which are thought to be
impassable now would be irrelevant in future because quantum computers will quickly
get through them [41]. Quantum approaches can dramatically reduce the information
required for big data analysis. For example, quantum theory can maximize the distin-
guishability between a multilayer network using a minimum number of layers [42]. In
addition, quantum approaches require a relatively small dataset to obtain a maximally
sensitive data analysis compared to the conventional (machine-learning) techniques.
Therefore, quantum approaches can drastically reduce the amount of computational
power required to analyze big data. Even though, quantum computing is still in its
infancy and presents many open challenges, it is being implemented for healthcare data.

Applications in big data analysis

Quantum computing is picking up and seems to be a potential solution for big data anal-
ysis. For example, identification of rare events, such as the production of Higgs bosons
at the Large Hadron Collider (LHC) can now be performed using quantum approaches
[43]. At LHC, huge amounts of collision data (1PB/s) is generated that needs to be fil-
tered and analyzed. One such approach, the quantum annealing for ML (QAML) that
implements a combination of ML and quantum computing with a programmable quan-
tum annealer, helps reduce human intervention and increase the accuracy of assessing
particle-collision data. In another example, the quantum support vector machine was
implemented for both training and classification stages to classify new data [44]. Such
quantum approaches could find applications in many areas of science [43]. Indeed,
recurrent quantum neural network (RQNN) was implemented to increase signal sepa-
rability in electroencephalogram (EEG) signals [45]. Similarly, quantum annealing was
applied to intensity modulated radiotherapy (IMRT) beamlet intensity optimization [46].
Similarly, there exist more applications of quantum approaches regarding healthcare e.g.
quantum sensors and quantum microscopes [47].

Conclusions and future prospects
Nowadays, various biomedical and healthcare tools such as genomics, mobile biometric
sensors, and smartphone apps generate a big amount of data. Therefore, it is manda-
tory for us to know about and assess that can be achieved using this data. For example,
the analysis of such data can provide further insights in terms of procedural, technical,
medical and other types of improvements in healthcare. After a review of these health-
care procedures, it appears that the full potential of patient-specific medical specialty
or personalized medicine is under way. The collective big data analysis of EHRs, EMRs
and other medical data is continuously helping build a better prognostic framework.
The companies providing service for healthcare analytics and clinical transforma-
tion are indeed contributing towards better and effective outcome. Common goals of

Page 23 of 25Dash et al. J Big Data (2019) 6:54

these companies include reducing cost of analytics, developing effective Clinical Deci-
sion Support (CDS) systems, providing platforms for better treatment strategies, and
identifying and preventing fraud associated with big data. Though, almost all of them
face challenges on federal issues like how private data is handled, shared and kept safe.
The combined pool of data from healthcare organizations and biomedical researchers
have resulted in a better outlook, determination, and treatment of various diseases. This
has also helped in building a better and healthier personalized healthcare framework.
Modern healthcare fraternity has realized the potential of big data and therefore, have
implemented big data analytics in healthcare and clinical practices. Supercomputers to
quantum computers are helping in extracting meaningful information from big data in
dramatically reduced time periods. With high hopes of extracting new and actionable
knowledge that can improve the present status of healthcare services, researchers are
plunging into biomedical big data despite the infrastructure challenges. Clinical trials,
analysis of pharmacy and insurance claims together, discovery of biomarkers is a part of
a novel and creative way to analyze healthcare big data.

Big data analytics leverage the gap within structured and unstructured data sources.
The shift to an integrated data environment is a well-known hurdle to overcome. Inter-
esting enough, the principle of big data heavily relies on the idea of the more the infor-
mation, the more insights one can gain from this information and can make predictions
for future events. It is rightfully projected by various reliable consulting firms and health
care companies that the big data healthcare market is poised to grow at an exponential
rate. However, in a short span we have witnessed a spectrum of analytics currently in use
that have shown significant impacts on the decision making and performance of health-
care industry. The exponential growth of medical data from various domains has forced
computational experts to design innovative strategies to analyze and interpret such
enormous amount of data within a given timeframe. The integration of computational
systems for signal processing from both research and practicing medical professionals
has witnessed growth. Thus, developing a detailed model of a human body by combining
physiological data and “-omics” techniques can be the next big target. This unique idea
can enhance our knowledge of disease conditions and possibly help in the development
of novel diagnostic tools. The continuous rise in available genomic data including inher-
ent hidden errors from experiment and analytical practices need further attention. How-
ever, there are opportunities in each step of this extensive process to introduce systemic
improvements within the healthcare research.

High volume of medical data collected across heterogeneous platforms has put a chal-
lenge to data scientists for careful integration and implementation. It is therefore sug-
gested that revolution in healthcare is further needed to group together bioinformatics,
health informatics and analytics to promote personalized and more effective treatments.
Furthermore, new strategies and technologies should be developed to understand the
nature (structured, semi-structured, unstructured), complexity (dimensions and attrib-
utes) and volume of the data to derive meaningful information. The greatest asset of
big data lies in its limitless possibilities. The birth and integration of big data within the
past few years has brought substantial advancements in the health care sector ranging
from medical data management to drug discovery programs for complex human dis-
eases including cancer and neurodegenerative disorders. To quote a simple example

Page 24 of 25Dash et al. J Big Data (2019) 6:54

supporting the stated idea, since the late 2000′s the healthcare market has witnessed
advancements in the EHR system in the context of data collection, management and
usability. We believe that big data will add-on and bolster the existing pipeline of health-
care advances instead of replacing skilled manpower, subject knowledge experts and
intellectuals, a notion argued by many. One can clearly see the transitions of health care
market from a wider volume base to personalized or individual specific domain. There-
fore, it is essential for technologists and professionals to understand this evolving situa-
tion. In the coming year it can be projected that big data analytics will march towards a
predictive system. This would mean prediction of futuristic outcomes in an individual’s
health state based on current or existing data (such as EHR-based and Omics-based).
Similarly, it can also be presumed that structured information obtained from a certain
geography might lead to generation of population health information. Taken together,
big data will facilitate healthcare by introducing prediction of epidemics (in relation to
population health), providing early warnings of disease conditions, and helping in the
discovery of novel biomarkers and intelligent therapeutic intervention strategies for an
improved quality of life.
Acknowledgements
Not applicable.

Authors’ contributions
MS wrote the manuscript. SD and SKS further added significant discussion that highly improved the quality of manu-
script. SK designed the content sequence, guided SD, SS and MS in writing and revising the manuscript and checked the
manuscript. All authors read and approved the final manuscript.

Funding
None.

Availability of data and materials
Not applicable.

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Competing interests
The authors declare that they have no competing interests.

Author details
1 Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York 10065, NY, USA. 2 Center
of Biological Engineering, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal. 3 SilicoLife Lda, Rua do
Canastreiro 15, 4715-387 Braga, Portugal. 4 Postgraduate School for Molecular Medicine, Warszawskiego Uniwersytetu
Medycznego, Warsaw, Poland. 5 Małopolska Centre for Biotechnology, Jagiellonian University, Kraków, Poland. 6 3B’s
Research Group, Headquarters of the European Institute of Excellence on Tissue Engineering and Regenerative Medicine,
AvePark – Parque de Ciência e Tecnologia, Zona Industrial da Gandra, Barco, 4805-017 Guimarães, Portugal.

Received: 17 January 2019 Accepted: 6 June 2019

References
1. Laney D. 3D data management: controlling data volume, velocity, and variety, Application delivery strategies. Stam-

ford: META Group Inc; 2001.
2. Mauro AD, Greco M, Grimaldi M. A formal definition of big data based on its essential features. Libr Rev.

2016;65(3):122–35.
3. Gubbi J, et al. Internet of Things (IoT ): a vision, architectural elements, and future directions. Future Gener Comput

Syst. 2013;29(7):1645–60.
4. Doyle-Lindrud S. The evolution of the electronic health record. Clin J Oncol Nurs. 2015;19(2):153–4.
5. Gillum RF. From papyrus to the electronic tablet: a brief history of the clinical medical record with lessons for the

digital Age. Am J Med. 2013;126(10):853–7.
6. Reiser SJ. The clinical record in medicine part 1: learning from cases*. Ann Intern Med. 1991;114(10):902–7.

Page 25 of 25Dash et al. J Big Data (2019) 6:54

7. Reisman M. EHRs: the challenge of making electronic data usable and interoperable. Pharm Ther. 2017;42(9):572–5.
8. Murphy G, Hanken MA, Waters K. Electronic health records: changing the vision. Philadelphia: Saunders W B Co;

1999. p. 627.
9. Shameer K, et al. Translational bioinformatics in the era of real-time biomedical, health care and wellness data

streams. Brief Bioinform. 2017;18(1):105–24.
10. Service, R.F. The race for the $1000 genome. Science. 2006;311(5767):1544–6.
11. Stephens ZD, et al. Big data: astronomical or genomical? PLoS Biol. 2015;13(7):e1002195.
12. Yin Y, et al. The internet of things in healthcare: an overview. J Ind Inf Integr. 2016;1:3–13.
13. Moore SK. Unhooking medicine [wireless networking]. IEEE Spectr 2001; 38(1): 107–8, 110.
14. Nasi G, Cucciniello M, Guerrazzi C. The role of mobile technologies in health care processes: the case of cancer sup-

portive care. J Med Internet Res. 2015;17(2):e26.
15. Apple, ResearchKit/ResearchKit: ResearchKit 1.5.3. 2017.
16. Shvachko K, et al. The hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th symposium on mass

storage systems and technologies (MSST ). New York: IEEE Computer Society; 2010. p. 1–10.
17. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13.
18. Zaharia M, et al. Apache Spark: a unified engine for big data processing. Commun ACM. 2016;59(11):56–65.
19. Gopalani S, Arora R. Comparing Apache Spark and Map Reduce with performance analysis using K-means; 2015.
20. Ahmed H, et al. Performance comparison of spark clusters configured conventionally and a cloud servicE. Procedia

Comput Sci. 2016;82:99–106.
21. Saouabi M, Ezzati A. A comparative between hadoop mapreduce and apache Spark on HDFS. In: Proceedings of the

1st international conference on internet of things and machine learning. Liverpool: ACM; 2017. p. 1–4.
22. Strickland NH. PACS (picture archiving and communication systems): filmless radiology. Arch Dis Child.

2000;83(1):82–6.
23. Schroeder W, Martin K, Lorensen B. The visualization toolkit. 4th ed. Clifton Park: Kitware; 2006.
24. Friston K, et al. Statistical parametric mapping. London: Academic Press; 2007. p. vii.
25. Li L, et al. Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Sci Transl

Med. 2015;7(311):311ra174.
26. Valikodath NG, et al. Agreement of ocular symptom reporting between patient-reported outcomes and medical

records. JAMA Ophthalmol. 2017;135(3):225–31.
27. Fromme EK, et al. How accurate is clinician reporting of chemotherapy adverse effects? A comparison with patient-

reported symptoms from the Quality-of-Life Questionnaire C30. J Clin Oncol. 2004;22(17):3485–90.
28. Beckles GL, et al. Agreement between self-reports and medical records was only fair in a cross-sectional

study of performance of annual eye examinations among adults with diabetes in managed care. Med Care.
2007;45(9):876–83.

29. Echaiz JF, et al. Low correlation between self-report and medical record documentation of urinary tract infection
symptoms. Am J Infect Control. 2015;43(9):983–6.

30. Belle A, et al. Big data analytics in healthcare. Biomed Res Int. 2015;2015:370194.
31. Adler-Milstein J, Pfeifer E. Information blocking: is it occurring and what policy strategies can address it? Milbank Q.

2017;95(1):117–35.
32. Or-Bach, Z. A 1,000x improvement in computer systems by bridging the processor-memory gap. In: 2017 IEEE SOI-

3D-subthreshold microelectronics technology unified conference (S3S). 2017.
33. Mahapatra NR, Venkatrao B. The processor-memory bottleneck: problems and solutions. XRDS. 1999;5(3es):2.
34. Voronin AA, Panchenko VY, Zheltikov AM. Supercomputations and big-data analysis in strong-field ultrafast optical

physics: filamentation of high-peak-power ultrashort laser pulses. Laser Phys Lett. 2016;13(6):065403.
35. Dollas, A. Big data processing with FPGA supercomputers: opportunities and challenges. In: 2014 IEEE computer

society annual symposium on VLSI; 2014.
36. Saffman M. Quantum computing with atomic qubits and Rydberg interactions: progress and challenges. J Phys B:

At Mol Opt Phys. 2016;49(20):202001.
37. Nielsen MA, Chuang IL. Quantum computation and quantum information. 10th anniversary ed. Cambridge: Cam-

bridge University Press; 2011. p. 708.
38. Raychev N. Quantum computing models for algebraic applications. Int J Scientific Eng Res. 2015;6(8):1281–8.
39. Harrow A. Why now is the right time to study quantum computing. XRDS. 2012;18(3):32–7.
40. Lloyd S, Garnerone S, Zanardi P. Quantum algorithms for topological and geometric analysis of data. Nat Commun.

2016;7:10138.
41. Buchanan W, Woodward A. Will quantum computers be the end of public key encryption? J Cyber Secur Technol.

2017;1(1):1–22.
42. De Domenico M, et al. Structural reducibility of multilayer networks. Nat Commun. 2015;6:6864.
43. Mott A, et al. Solving a Higgs optimization problem with quantum annealing for machine learning. Nature.

2017;550:375.
44. Rebentrost P, Mohseni M, Lloyd S. Quantum support vector machine for big data classification. Phys Rev Lett.

2014;113(13):130503.
45. Gandhi V, et al. Quantum neural network-based EEG filtering for a brain-computer interface. IEEE Trans Neural Netw

Learn Syst. 2014;25(2):278–88.
46. Nazareth DP, Spaans JD. First application of quantum annealing to IMRT beamlet intensity optimization. Phys Med

Biol. 2015;60(10):4137–48.
47. Reardon S. Quantum microscope offers MRI for molecules. Nature. 2017;543(7644):162.

Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Big data in healthcare: management, analysis and future prospects

Abstract
Introduction
The data overload
Defining big data
Healthcare as a big-data repository
Electronic health records
Digitization of healthcare and big data
Big data in biomedical research
Big data from omics studies
Internet of Things (IOT)
Advantages of IoT in healthcare
Mobile computing and mobile health (mHealth)
Nature of the big data in healthcare
Management and analysis of big data
Hadoop
Apache Spark
Machine learning for information extraction, data analysis and predictions
Extracting information from EHR datasets
Image analytics
Big data from omics
Commercial platforms for healthcare data analytics
AYASDI
Linguamatics
IBM Watson
Challenges associated with healthcare big data
Storage
Cleaning
Unified format
Accuracy
Image pre-processing
Security
Meta-data
Querying
Visualization
Data sharing
Big data analytics for cutting costs
Quantum mechanics and big data analysis
Quantum computing and its advantages
Applications in big data analysis

Conclusions and future prospects
Acknowledgements
References

Turn in your highest-quality paper
Get a qualified writer to help you with

“ dis 3 BD ”

Get high-quality paper

NEW! AI matching with writer

Order an Essay Now & Get These Features For Free:

Turnitin Report

Formatting

Title Page

Citation

Outline

Place an Order