qsn - Paper Answers

11 12

DUE 10 PM

ZERO PLAGIARISM

15 REFERENCES IN APA

I need a 15 page report in an IEEE format on “A Comparative study of SQL and NoSql Database system for Big Data Analytics” . I provide 15 reference iEEE papers. Based on those papers i need the report. I also provide a report specifications file and report needs to be the way they described in report specification file.

Managing Big Data Analytics Workflows

with a Database System

Carlos Ordonez Javier Garcı́a-Garcı́a

University of Houston UNAM University

USA Mexico

Abstract—A big data analytics workflow is long and complex,
with many programs, tools and scripts interacting together. In
general, in modern organizations there is a significant amount
of big data analytics processing performed outside a database
system, which creates many issues to manage and process big
data analytics workflows. In general, data preprocessing is the
most time-consuming task in a big data analytics workflow. In
this work, we defend the idea of preprocessing, computing models
and scoring data sets inside a database system. In addition, we
discuss recommendations and experiences to improve big data
analytics workflows by pushing data preprocessing (i.e. data
cleaning, aggregation and column transformation) into a database
system. We present a discussion of practical issues and common
solutions when transforming and preparing data sets to improve
big data analytics workflows. As a case study validation, based on
experience from real-life big data analytics projects, we compare
pros and cons between running big data analytics workflows
inside and outside the database system. We highlight which tasks
in a big data analytics workflow are easier to manage and faster
when processed by the database system, compared to external
processing.

I. INTRODUCTION

In a modern organization transaction processing [4] and data

warehousing [6] are managed by database systems. On the

other hand, big data analytics projects are a different story.

Despite existing data mining [5], [6] functionality offered by

most database systems many statistical tasks are generally

performed outside the database system [7], [6], [15]. This

is due to the existence of sophisticated statistical tools and

libraries, the lack of expertise of statistical analysts to write

correct and efficient SQL queries, a somewhat limited set of

statistical algorithms and techniques in the database system

(compared to statistical packages) and abundance of legacy

code (generally well tuned and difficult to rewrite in a different

language). In such scenario, users exploit the database system

just to extract data with ad-hoc SQL queries. Once large tables

are exported they are summarized and further transformed

depending on the task at hand, but outside the database

system.

Finally, when the data set has the desired variables (features),

statistical models are computed, interpreted and tuned. In

general, preprocessing data for analysis is the most time

consuming task [5], [16], [17] because it requires cleaning,

joining, aggregating and transforming files, tables to obtain

“analytic” data sets appropriate for cube or machine learning

processing. Unfortunately, in such environments manipulating

data sets outside the database system creates many data man-

agement issues: data sets must be recreated and re-exported

every time there is a change, models need to be imported

back into the database system and then deployed, different

users may have inconsistent versions of the same data set in

their own computers, security is compromised. Therefore, we

defend the thesis that it is better to transform data sets and

compute statistical models inside the database system. We

provide evidence such approach improves management and

processing of big data analytics workflows. We argue database

system-centric big data analytics workflows are a promising

processing approach for large organizations.

The experiences reported in this work are based on analytics

projects where the first author participated as as developer

and consultant in a major DBMS company. Based on ex-

perience from big data analytics projects we have become

aware statistical analysts cannot write correct and efficient

SQL code to extract data from the database system or to

transform their data sets for the big data analytics task at hand.

To overcome such limitations, statistical analysts use big data

analytics tools, statistical software, data mining packages to

manipulate and transform their data sets. Consequently, users

end up creating a complicated collection of programs mixing

data transformation and statistical modeling tasks together. In

a typical big data analytics workflow most of the development

(programming) effort is spent on transforming the data set: this

is the main aspect studied in this article. The data set represents

the core element in a big data analytics workflow: all data

transformation and modeling tasks must work on the entire

data set. We present evidence running workflows entirely

inside a database system improves workflow management and

accelerates workflow processing.

The article is organized as follows. Section II provides

the specification of the most common big data analytics

workflows. Section III discusses practical issues and solutions

when preparing and transforming data sets inside a database

system. We contrast processing workflows inside and outside

the database system. Section IV compares advantages and

time performance of processing big data analytics workflows

inside the database system and outside exploiting with external

data mining tools. Such comparisons are based on experience

from real-life data mining projects. Section V discusses related

work. Section VI presents conclusions and research issues for

2016 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

DOI 10.1109/CCGrid.2016.63

649

future work.

II. BIG DATA ANALYTICS WORKFLOWS

We consider big data analytics workflows in which Task

i must be finished before Task i + 1. Processing is mostly
sequential from task to task, but it is feasible to have cycles

from Task i back to Task 1 because data mining is an iterative
process. We distinguish two complementary big data analytics

workflows: (1) computing statistical models; (2) scoring data

sets based on a model. In general, statistical models are

computed on a new data set, whereas scoring takes place after

the model is well understood and an acceptable data mining

has been produced.

A typical big data analytics workflow to compute a

model

consists of the following major tasks (this basic workflow can

vary depending on users knowledge and focus):

1) Data preprocessing: Building data sets for analysis (se-

lecting records, denormalization and aggregation)

2) Exploratory analysis (descriptive statistics, OLAP)

3) Computing statistical models

4) Tuning models: add, modify data set attributes, fea-

ture/variable selection, train/test models

5) Create scoring scripts based on selected features and best

model

This modeling workflow tends to be linear (Task i executed
before Task i+1), but in general it has cycles. This is because
there are many iterations building data sets, analyzing variable

behavior and computing models before and acceptable can

be obtained. Initially the bottleneck of the workflow is data

preprocessing. Later, when the data mining project evolves

most workflow processing is testing and tuning the desired

model. In general, the database system performs some denor-

malization and aggregation, but it is common to perform most

transformations outside the database system. When features

need to be added or modified queries need to be recomputed,

going back to Task 1. Therefore, only a part of the first task

of the workflow is computed inside the database system; the

remaining stages of the workflow are computed with external

data mining tools.

The second workflow we consider is actually deploying a

model with arriving (new) data. This task is called scoring [7],

[12]. Since the best features and the best model are already

known this is generally processed inside the database system.

A scoring workflow is linear and non-iterative: Processing

starts with Task 1, Task i is executed before Task i + 1 and
processing ends with the last task. In a scoring workflow tasks

are as follows:

1) Data preprocessing: Building data set for scoring (select

records, compute best features).

2) Deploy model on test data set (e.g predict class or target

value), store model output per record as a new table or

view in the database system.

3) Evaluate queries and produce summary reports on scored

data sets joined with source tables. Explain models by

tracing data back to source tables in the database system.

Building the data set tends to be most important task in both

workflows since it requires gathering data from many source

tables. In this work, we defend the idea of preprocessing big

data inside a database system. This approach makes sense only

when the raw data originates from a data warehouse. That

is, we do not tackle the problem of migrating processing of

external data sets into the database system.

III. TASKS TO PREPARE DATA SETS

We discuss issues and recommendations to improve the data

processing stage in a big data analytics workflow. Our main

motivation is to bypass the use of external tools to perform

data preprocessing, using the SQL language instead.

A. Reasons for Processing big data analytics workflows Out-
side the database system

In general, the main reason is of a practical nature: users

do not want to translate existing code working on external

tools. Commonly such code has existed for a long time

(legacy programs), it is extensive (there exist many programs)

and it has been debugged and tuned. Therefore, users are

reluctant to rewrite it in a different language, given associated

risks. A second complaint is that, in general, the database

system provides elementary statistical functionality, compared

to sophisticated statistical packages. Nevertheless, such gap

has been shrinking over the years. As explained before, in a

data mining environment most user time is spent on preparing

data sets for analytic purposes. We discuss some of the most

important database issues when tables are manipulated and

transformed to prepare a data set for data mining or statistical

analysis.

B. Main Data Preprocessing Tasks

We consider three major tasks:

1) selecting records for analysis (filtering)

2) denormalization (including math transformation)

3) aggregation

Selecting Records for Analysis: Selection of rows can be
done at several stages, in different tables. Such filtering is done

to discard outliers, to discard records with a significant missing

information content (including referential integrity [14]), to

discard records whose potential contribution to the model

provides no insight or sets of records whose characteristics

deserve separate analysis. It is well known that pushing

selection is the basic strategy to accelerate SPJ (select-project-

join) queries, but it is not straightforward to apply into multiple

queries. A common solution we have used is to perform as

much filtering as possible on one data set. This makes code

maintenance easier and the query optimizer is able to exploit

filtering predicates as much as possible.

In general, for a data mining task it is necessary to select

a set of records from one of the largest tables in the database

based on a date range. In general, this selection requires a

scan on the entire table, which is slow. When there is a

secondary index based on date it is more efficient to select

rows, but it is not always available. The basic issue is that such

650

transaction table is much larger compared to other tables in the

database. For instance, this time window defines a set of active

customers or bank accounts that have recent activity. Common

solutions to this problem include creating materialized views

(avoiding join recomputation on large tables) and lean tables

with primary keys of the object being analyzed (record,

product, etc) to act as filter in further data preparation tasks.

Denormalization: In general, it is necessary to gather data
from many tables and store data elements in one place.

It is well-known that on-line transaction processing (OLTP)

database systems update normalized tables. Normalization

makes transaction processing faster and ACID [4] semantics

are easier to ensure. Queries that retrieve a few records

from normalized tables are relatively fast. On the other hand,

analysis on the database requires precisely the opposite: a large

set of records is required and such records gather information

from many tables. Such processing typically involves complex

queries involving joins and aggregations. Therefore, normal-

ization works against efficiently building data sets for analytic

purposes. One solution is to keep a few key denormalized

tables from which specialized tables can be built. In general,

such tables cannot be dynamically maintained because they

involve join computation with large tables. Therefore, they

are periodically recreated as a batch process. The query

shown below builds a denormalized table from which several

aggregations can be computed (e.g. similar to cube queries).

SELECT
customer_id
,customer_name
,product.product_id
,product_name
,department.department_id
,department_name

FROM sales
JOIN product
ON sales.product_id=product.product_id

JOIN department
ON product.department_id

=department.department_id;

For analytic purposes it is always best to use as much

data as possible. There are strong reasons for this. Statistical

models are more reliable, it is easier to deal with missing

information, skewed distributions, discover outliers and so on,

when there is a large data set at hand. In a large database

with tables coming from a normalized database being joined

with tables used in the past for analytic purposes may involve

joins with records whose foreign keys may not be found

in some table. That is, natural joins may discard potentially

useful records. The net effect of this issue is that the resulting

data set does not include all potential objects (e.g. records,

products). The solution is define a universe data set containing

all objects gathered with union from all tables and then use

such table as the fundamental table to perform outer joins. For

simplicity and elegance, left outer joins are preferred. Then

left outer joins are propagated everywhere in data preparation

and completeness of records is guaranteed. In general such

left outer joins have a “star” form on the joining conditions,

where the primary key of the master table is left joined with

the primary keys of the other tables, instead of joining them

with chained conditions (FK of table T1 is joined with PK

of table T2, FK of table T2 is joined with PK of T3, and so

on). The query below computes a global data set built from

individual data sets. Notice data sets may or may not overlap

each other.

SELECT
,record_id
,T1.A1
,T2.A2
..
,Tk.Ak
FROM T_0

LEFT JOIN T1 ON T_0.record_id= T1.record_id
LEFT JOIN T2 ON T_0.record_id= T2.record_id
..
LEFT JOIN Tk ON T_0.record_id= Tk.record_id;

Aggregation: Unfortunately, most data mining tasks require
dimensions (variables) that are not readily available from

the database. Such dimensions typically require computing

aggregations at several granularity levels. This is because most

columns required by statistical or machine learning techniques

require measures (or metrics), which translate as sums or

counts computed with SQL. Unfortunately, granularity levels

are not hierarchical (like cubes or OLAP [4]) making the use

of separate summary tables necessary (e.g. summarization by

product or by customer, in a retail database). A straightforward

optimization is to compute as many dimensions in the same

statement exploiting the same group-by clause, when possible.

In general, for a statistical analyst it is best to create as many

variables (dimensions) as possible in order to isolate those

that can help build a more accurate model. Then summariza-

tion tends to create tables with hundreds of columns, which

make query processing slower. However, most state-of-the-

art statistical and machine learning techniques are designed

to perform variable (feature) selection [7], [10] and many of

those columns end up being discarded.
A typical query to derive dimensions from a transaction

table is as follows:

SELECT
customer_id
,count(*) AS cntItems
,sum(salesAmt) AS totalSales
,sum(case when salesAmt<0 then 1 end) AS cntReturns

FROM sales
GROUP BY customer_id;

Transaction tables generally have two or even more levels

of detail, sharing some columns in their primary key. The

typical example is store transaction table with individual

items and the total count of items and total amount paid.

This means that many times it is not possible to perform a

statistical analysis only from one table. There may be unique

pieces of information at each level. Therefore, such large

651

tables need to be joined with each other and then aggregated

at the appropriate granularity level, depending on the data

mining task at hand. In general, such queries are optimized

by indexing both tables on their common columns so that

hash-joins can be used.

Combining Denormalization and Aggregation: Multiple pri-
mary keys: A last aspect is considering aggregation and
denormalization interacting together when there are multiple

primary keys. In a large database different subsets of tables

have different primary keys. In other words, such tables are

not compatible with each other to perform further aggregation.

The key issue is that at some point large tables with different

primary keys must be joined and summarized. Join operations

will be slow because indexing involves foreign keys with large

cardinalities. Two solutions are common: creating a secondary

index on the alternative primary key of the largest table, or

creating a denormalized table having both primary keys in

order to enable fast join processing.

For instance, consider a data mining project in a bank that

requires analysis by customer id, but also account id. One
customer may have multiple accounts. An account may have

multiple account holders. Joining and manipulating such tables

is challenging given their sizes.

C. Workflow Processing

Dependent SQL statements: A data transformation script is
a long sequence of SELECT statements. Their dependencies

are complex, although there exists a partial order defined by

the order in which temporary tables and data sets for analysis

are created. To debug SQL code it is a bad idea to create

a single query with multiple query blocks. In other words,

such SQL statements are not amenable to the query optimizer

because they are separate, unless it can keep track of historic

usage patterns of queries. A common solution is to create

intermediate tables that can be shared by several statements.

Those intermediate tables commonly have columns that can

later be aggregated at the appropriate granularity levels.

Computer resource usage: This aspect includes both disk
and CPU usage, with the second one being a more valuable

resource. This problem gets compounded by the fact that

most data mining tasks work on the entire data set or large

subsets from it. In an active database environment running

data preparation tasks during peak usage hours can degrade

performance since, generally speaking, large tables are read

and large tables are created. Therefore, it is necessary to

use workload management tools to optimize queries from

several users together. In general, the solution is to give data

preparation tasks a lower priority than the priority for queries

from interactive users. On a longer term strategy, it is best

to organize data mining projects around common data sets,

but such goal is difficult to reach given the mathematical

nature of analysis and the ever changing nature of variables

(dimensions) in the

data sets.

Comparing views and temporary tables: Views provide
limited control on storage and indexing. It may be better

to create temporary tables, especially when there are many

primary keys used in summarization. Nevertheless, disk space

usage grows fast and such tables/views need to be refreshed

when new records are inserted or new variables (dimensions)

are created.

Scoring: Even though many models are built outside the
database system with statistical packages and data mining

tools, in the end the model must be applied inside the database

system [6]. When volumes of data are not large it is feasible

to perform model deployment outside: exporting data sets,

applying the model and building reports can be done in no

more than a few minutes. However, as data volume increases

exporting data from the database system becomes a bottleneck.

This problem gets compounded with results interpretation

when it is necessary to relate statistical numbers back to the

original tables in the database. Therefore, it is common to build

models outside, frequently based on samples, and then once

an acceptable model is obtained, then it is imported back into

the database system. Nowadays, model deployment basically

happens in two ways: using SQL queries if the mathematical

computations are relatively simple or with UDFs [2], [12] if

the computations are more sophisticated. In most cases, such

scoring process can work in a single table scan, providing

good performance.

IV. COMPARING PROCESSING OF WORKFLOWS INSIDE

AND OUTSIDE A DATABASE SYSTEM

In this section we present a qualitative and quantitative

comparison between running big data analytics workflows

inside and outside the database system. This discussion is

a summary of representative successful projects. We first

discuss a typical database environment. Second, we present a

summary of the data mining projects presenting their practical

application and the statistical and data mining techniques

used. Third, we discuss advantages and accomplishments for

each workflow running inside the database system, as well

as the main objections or concerns against such approach

(i.e. migrating external transformation code into the database

system). Fourth, we present time measurements taken from

actual projects at each organization, running big data analytics

workflows completely inside and partially outside the database

system.

A. Data Warehousing Environment

The environment was a data warehouse, where several

databases were already integrated into a large enterprise-wide

database. The database server was surrounded by specialized

servers performing OLAP and statistical analysis. One of those

servers was a statistical server with a fast network connection

to the database

server.

First of all, an entire set of statistical language programs

were translated into SQL using Teradata data mining program,

the translator tool and customized SQL code. Second, in every

case the data sets were verified to have the same contents in

the statistical language and SQL. In most cases, the numeric

output from statistical and machine learning models was the

same, but sometimes there were slight numeric differences,

652

given variations in algorithmic improvements and advanced

parameters (e.g. epsilon for convergence, step-wise regression

procedures, pruning method in decision tree and so on).

B. Organizations: statistical models and business application

We now give a brief discussion about the organizations

where the statistical code migration took place. We also

discuss the specific type of data mining techniques used in

each case. Due to privacy concerns we omit discussion of

specific information about each organization, their databases

and the hardware configuration of their database system

servers. The big data analytics workflows were executed on the

organizations database servers, concurrently with other users

(analysts, managers, DBAs, and so on).

We now describe the computers processing the workflow

in more detail. The database system server was, in general,

a parallel multiprocessor computer with a large number of

CPUs, ample memory per CPU and several terabytes of

parallel disk storage in high performance RAID configurations.

On the other hand, the statistical server was generally a smaller

computer with less than 500 GB of disk space with ample

memory space. Statistical and data mining analysis inside the

database system was performed only with SQL. In general, a

workstation connected to each server with appropriate client

utilities. The connection to the database system was done with

ODBC. All time measurements discussed herein were taken

on 32-bit CPUs over the course of several years. Therefore,

they cannot be compared with each other and they should

only be used to understand performance gains within the same

organization.

The first organization was an insurance company. The

data mining goal involved segmenting customers into tiers

according to their profitability based on demographic data,

billing information and claims. The statistical techniques used

to determine segments involved histograms and clustering.

The final data set had about n = 300k records and d = 25
variables. There were four segments, categorizing customers

from best to worst.

The second organization was a cellular telephone service

provider. The data mining task involved predicting which

customers were likely to upgrade their call service package

or purchase a new handset. The default technique was logistic

regression [7] with stepwise procedure for variable selection.

The data set used for scoring had about n = 10M records and
d = 120 variables. The predicted variable was binary.

The third organization was an Internet Service Provider

(ISP). The predictive task was to detect which customers were

likely to disconnect service within a time window of a few

months, based on their demographic data, billing information

and service usage. The statistical techniques used in this case

were decision trees and logistic regression and the predicted

variable was binary. The final data set had n = 3.5M records
and d = 50 variables.

C. Database system-centric big data analytics workflows:
Users Opinion

We summarize pros and cons about running big data analyt-

ics workflows inside and outside the database system. Table I

contains a summary of outcomes. As we can see performance

to score data sets and transforming data sets are positive

outcomes in every case. Building the models faster turned

out not be as important because users relied on sampling to

build models and several samples were collected to tune and

test models. Since all databases and servers were within the

same firewall security was not a major concern. In general,

improving data management was not seen as major concern

because there existed a data warehouse, but users acknowledge

a “distributed” analytic environment could be a potential

management issue. We now summarize the main objections,

despite the advantages discussed above. We exclude cost as a

decisive factor to preserve anonymity of users opinion and

give an unbiased discussion. First, many users preferred a

traditional programming language like Java or C++ instead

of a set-oriented language like SQL. Second, some specialized

techniques are not available in the database system due to their

mathematical complexity; relevant examples include Support

Vector Machines, Non-linear regression and time series mod-

els. Finally, sampling is a standard mechanism to analyze large

data sets.

D. Workflow Processing Time Comparison

We compare workflow processing time inside the database

system using SQL and UDFs and outside the database system,

using an external server transforming exported data sets and

computing models. In general, the external server ran existing

data transformation programs developed by each organization.

On the other hand, each organization had diverse data mining

tools that analyzed flat files. We must mention the comparison

is not fair because the database system server was in general

a powerful parallel computer and the external server was a

smaller computer. Nevertheless, such setting represents a com-

mon IT environment where the fastest computer is precisely

the database system server.

We discuss tables from the database in more detail. There

were several input tables coming from a large normalized

database that were transformed and denormalized to build

data sets used by statistical or machine learning techniques.

In short, the input were tables and the output were tables as

well. No data sets were exported in this case: all processing

happened inside the database system. On the other hand,

analysis on the external server relied on SQL queries to extract

data from the database system, transform the data to produce

data sets in the statistical server and then building models or

scoring data sets based on a model. In general, data extraction

from the database system was performed using bulk utilities

which exported data records in blocks. Clearly, there was an

export bottleneck from the database system to the external

server.

Table II compares performance between both workflow pro-

cessing alternatives: inside and outside the database system. As

653

TABLE I
ADVANTAGES AND DISADVANTAGES ON RUNNING BIG DATA ANALYTICS WORKFLOWS INSIDE A DATABASE SYSTEM.

Insur. Phone ISP
Advantages:
Improve Workflow Management X X X
Accelerate Workflow Execution X
Prepare data sets more easily X X X
Compute models faster without sampling X
Score data sets faster X X X
Enforce database security X X
Disadvantages:
Workflow output independent X X
Workflow outside database system OK X X
Prefer program. lang. over SQL X X
Samples accelerate modeling X X
database system lacks statistical models X
Legacy code X

TABLE II
WORKFLOW PROCESSING TIME INSIDE AND OUTSIDE A DATABASE

SYSTEM (TIME IN MINUTES).

Task outside inside
Build model:
Segmentation 2 1
Predict propensity 38 8
Predict churn 120 20
Score data set:
Segmentation 5 1
Predict propensity 150 2
Predict churn 10 1

TABLE III
TIME TO COMPUTE MODELS INSIDE THE DATABASE SYSTEM AND TIME TO

EXPORT DATA SET (SECS).

n × 1000 d SQL/UDF BULK ODBC
100 8 4 17 168
100 16 5 32 311
100 32 6 63 615
100 64 8 121 1204

1000 8 40 164 1690
1000 16 51 319 3112
1000 32 62 622 6160
1000 64 78 1188 12010

introduced in Section II we distinguish two big data analytics

workflows: computing the model and scoring the model on

large data sets. The times shown in Table II include the time

to transform the data set with joins and aggregations the time

to compute the model and the time to score applying the best

model. As we can see the database system is significantly

faster. We must mention that to build the predictive models

both approaches exploited samples from a large data set. Then

the models were tuned with further samples. To score data

sets the gap is wider, highlighting the efficiency of SQL to

compute joins and aggregations to build the data set and then

computing mathematical equations on the data set. In general,

the main reason the external server was slower was the time

to export data from the database system. A secondary reason

was its more limited computing power.

Table III compares processing time to compute a model

inside the database system and the time to export the data

set (with a bulk utility and ODBC). In this case the database

system runs on a relatively small computer with 3.2 GHz, 4

GB on memory and 1 TB on disk. The models include PCA,

Naive Bayes and linear regression, which can be derived from

the correlation matrix [7] of the data set in a single table

scan using SQL queries and UDFs. Exporting the data set is

a bottleneck to perform data mining processing outside the

database system, regardless of how fast the external server is.

However, exporting a sample from the data set can be done

fast, but analyzing a large data set without sampling is much

faster to do inside the database system. Also, sampling can

also be exploited inside the database system.

V. RELATED WORK

There exist many proposals that extend SQL with data

mining functionality. Most proposals add syntax to SQL

and optimize queries using the proposed extensions. Several

techniques to execute aggregate UDFs in parallel are studied

in [9]; these ideas are currently used by modern parallel

relational database systems such as Teradata. SQL extensions

to define, query and deploy data mining models are proposed

in [11]; such extensions provide a friendly language interface

to manage data mining models. This proposal focuses on man-

aging models rather than computing them and therefore such

extensions are complementary to UDFs. Query optimization

techniques and a simple SQL syntax extension to compute

multidimensional histograms are proposed in [8], where a

multiple grouping clause is optimized.

Some related work on exploiting SQL for data manipulation

tasks includes the following. Data mining primitive operators

are proposed in [1], including an operator to pivot a table

and another one for sampling, useful to build data sets. The

pivot/unpivot operators are extremely useful to transpose and

transform data sets for data mining and OLAP tasks [3],

but they have not been standardized. Horizontal aggregations

were proposed to create tabular data sets [13], as required

by statistical and machine learning techniques, combining

pivoting and aggregation in one function. For the most part

research work on preparing data sets for analytic purposes in

a relational database system remains scarce. Data quality is a

fundamental aspect in a data mining application. Referential

integrity quality metrics are proposed in [14], where users can

654

isolate tables and columns with invalid foreign keys. Such

referential problems must be solved in order to build a data

set without missing information.

Mining workflow logs, a different problem, has received at-

tention [18]; the basic idea is to discover patterns in workflow

process logs. To the best of our knowledge, there is scarce

research work dealing with migrating data preprocessing into

a database system to improve management and processing of

big data analytics workflows.

VI. CONCLUSIONS

We presented practical issues and discussed common solu-

tions to push data preprocessing into a database system to

improve management and processing of big data analytics

workflows. It is important to emphasize data preprocessing

is generally the most time consuming and error-prone task

in a big data analytics workflow. We identified specific data

preprocessing issues. Summarization generally has to be done

at different granularity levels and such levels are generally not

hierarchical. Rows are selected based on a time window, which

requires indexes on date columns. Row selection (filtering)

with complex predicates happens on many tables, making code

maintenance and query optimization difficult. In general, it is

necessary to create a “universe” data set to define left outer

joins. Model deployment requires importing models as SQL

queries or UDFs to deploy a model on large data sets. Based

on experience from real-life projects, we compared advantages

and disadvantages when running big data analytics workflows

entirely inside and outside a database system. Our general

observations are the following. big data analytics workflows

are easier to manage inside a database system. A big data

analytics workflow is generally faster to run inside a database

system assuming a data warehousing environment (i.e. data

originates from the database). From a practical perspective

workflows are easier to manage inside a database system

because users can exploit the extensive capabilities of the

database system (querying, recovery, security and concurrency

control) and there is less data redundancy. However, external

statistical tools may provide more flexibility than SQL and

more statistical techniques. From an efficiency (processing

time) perspective, transforming and scoring data sets and

computing models are faster to develop and run inside the

database system. Nevertheless, sampling represent a practical

solution to accelerate big data analytics workflows running

outside a database system.

Improving and optimizing the management and processing

of big data analytics workflows provides many opportunities

for future work. It is necessary to specify models which take

into account processing with external data mining tools. The

data set tends to be the bottleneck in a big data analytics

workflow both from a programming and processing time

perspectives. Therefore, it is necessary to specify workflow

models which can serve as templates to prepare data sets. The

role of the statistical model in a workflow needs to studied in

the context of the source tables that were used to build the

data set; very few organizations reach that stage.

REFERENCES

[1] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman. Non-stop
SQL/MX primitives for knowledge discovery. In ACM KDD Conference,
pages 425–429, 1999.

[2] J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD
skills: New analysis practices for big data. In Proc. VLDB Conference,
pages 1481–1492, 2009.

[3] C. Cunningham, G. Graefe, and C.A. Galindo-Legaria. PIVOT and
UNPIVOT: Optimization and execution strategies in an RDBMS. In
Proc. VLDB Conference, pages 998–1009, 2004.

[4] R. Elmasri and S.B. Navathe. Fundamentals of Database Systems.
Addison-Wesley, 4th edition, 2003.

[5] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD process for
extracting useful knowledge from volumes of data. Communications of
the ACM, 39(11):27–34, November 1996.

[6] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, San Francisco, 2nd edition, 2006.

[7] T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements of Statistical
Learning. Springer, New York, 1st edition, 2001.

[8] A. Hinneburg, D. Habich, and W. Lehner. Combi-operator-database
support for data mining applications. In Proc. VLDB Conference, pages
429–439, 2003.

[9] M. Jaedicke and B. Mitschang. On parallel processing of aggregate
and scalar functions in object-relational DBMS. In ACM SIGMOD
Conference, pages 379–389, 1998.

[10] T.M. Mitchell. Machine Learning. Mac-Graw Hill, New York, 1997.
[11] A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt. Integrating data

mining with SQL databases: OLE DB for data mining. In Proc. IEEE
ICDE Conference, pages 379–387, 2001.

[12] C. Ordonez. Statistical model computation with UDFs. IEEE Transac-
tions on Knowledge and Data Engineering (TKDE), 22(12):1752–1765,
2010.

[13] C. Ordonez and Z. Chen. Horizontal aggregations in SQL to prepare
data sets for data mining analysis. IEEE Transactions on Knowledge
and Data Engineering (TKDE), 24(4):678–691, 2012.

[14] C. Ordonez and J. Garcı́a-Garcı́a. Referential integrity quality metrics.
Decision Support Systems Journal, 44(2):495–508, 2008.

[15] C. Ordonez and J. Garcı́a-Garcı́a. Database systems research on data
mining. In Proc. ACM SIGMOD Conference, pages 1253–1254, 2010.

[16] D. Pyle. Data preparation for data mining. Morgan Kaufmann
Publishers Inc., San Francisco, CA, USA, 1999.

[17] E. Rahm and D. Hong-Hai. Data cleaning: Problems and current
approaches. IEEE Bulletin of the Technical Committee on Data En-
gineering, 23(4), 2000.

[18] W. M. P. van der Aalst, B. F. van Dongen, J. Herbst, L. Maruster,
G. Schimm, and A. J. M. M. Weijters. Workflow mining: a survey of
issues and approaches. Data & Knowledge Engineering, 47(2):237–267,
2003.

655

Database Integrated Analytics using R: Initial
Experiences with SQL-Server + R

Josep Ll. Berral and Nicolas Poggi
Barcelona Supercomputing Center (BSC)

Universitat Politècnica de Catalunya (BarcelonaTech)

Barcelona, Spain

Abstract—Most data scientists use nowadays functional or
semi-functional languages like SQL, Scala or R to treat data,
obtained directly from databases. Such process requires to fetch
data, process it, then store again, and such process tends to
be done outside the DB, in often complex data-flows. Recently,
database service providers have decided to integrate “R-as-a-
Service” in their DB solutions. The analytics engine is called
directly from the SQL query tree, and results are returned as
part of the same query. Here we show a first taste of such
technology by testing the portability of our ALOJA-ML analytics
framework, coded in R, to Microsoft SQL-Server

, one of
the SQL+R solutions released recently. In this work we discuss
some data-flow schemes for porting a local DB + analytics engine
architecture towards Big Data, focusing specially on the new
DB Integrated Analytics approach, and commenting the first
experiences in usability and performance obtained from such
new services and capabilities.

I. INTRODUCTION

Current data mining methodologies, techniques and algo-

rithms are based in heavy data browsing, slicing and process-

ing. For data scientists, also users of analytics, the capability

of defining the data to be retrieved and the operations to be

applied over this data in an easy way is essential. This is the

reason why functional languages like SQL, Scala or R are so

popular in such fields as, although these languages allow high

level programming, they free the user from programming the

infrastructure for accessing and browsing data.

The usual trend when processing data is to fetch the data

from the source or storage (file system or relational database),

bring it into a local environment (memory, distributed workers,

…), treat it, and then store back the results. In such schema

functional language applications are used to retrieve and slice

the data, while imperative language applications are used to

process the data and manage the data-flow between systems.

In most languages and frameworks, database connection pro-

tocols like ODBC or JDBC are available to enhance this data-

flow, allowing applications to directly retrieve data from DBs.

And although most SQL-based DB services allow user-written

procedures and functions, these do not include a high variety

of primitive functions or operators.

The arrival of the Big Data favored distributed frameworks

like Apache Hadoop and Apache Spark, where the data is

distributed “in the Cloud” and the data processing can also be
distributed where the data is placed, then results are joined

and aggregated. Such technologies have the advantage of

distributed computing, but when the schema for accessing data

and using it is still the same, just that the data distribution is

transparent to the user. Still, the user is responsible of adapting

any analytics to a Map-Reduce schema, and be responsible of
the data infrastructure.

Recently, companies like Microsoft, IBM or Cisco,

providers of Analytics as a Service platforms, put special
effort into complementing their solutions by adding script-

ing mechanisms into their DB engines, allowing to embed

analytics mechanisms into the same DB environment. All of

them selected R [1

] as a language and analytics engine, a

free and open-source statistical oriented language and engine,

embraced by the data mining community since long time ago.

The current paradigm of “Fetch from DB, Process, Dump to

DB” is shifted towards an “In-DB Processing” schema, so the

operations to be done on the selected data are provided by

the same DB procedures catalog. All the computation remains

inside the DB service, so the daily user can proceed by simply

querying the DB in a SQL style. New R-based procedures,

built-in or user created by invoking R scripts and libraries, are

executed as regular operations inside the query execution tree.

The idea of such integration is that, not only this processing

will be more usable by querying the DB, where the data is

managed and distributed, but also will reduce the overhead

of the data pipe-line, as everything will remain inside a

single data framework. Further, for continuous data processing,

analytics procedures and functions can be directly called from

triggers when data is continuously introduced or modified.

As a case of use of such approach, in this paper we present

some experiences on porting the ALOJA framework [1

] to

the recently released SQL-Server 2016, incorporating this R-

Service functionality. The ALOJA-ML framework is a col-

lection of predictive analytics functions (machine learning

and data mining), written in R, originally purposed for mod-

eling and prediction High Performance Computing (HPC)

benchmarking workloads as part of the ALOJA Project [16],

deployed to be called after retrieving data from a MySQL

database. Such project collects traces and profiling of Big

Data framework technologies, and analyzing this data requires

predictive analytics. These procedures are deployed in R, and

communicating the framework with the database and R engine

results in a complex architecture, with a fragile data-pipeline

and susceptible to failures.

In this work we present how we could adapt our current

data-processing approach to a more Big-Data oriented archi-

2016 IEEE 16th International Conference on Data Mining Workshops

237

DOI 10.1109/ICDMW.2016.155

tecture, and how we tested using the ALOJA data-set [12], as

an example and as a review from the user’s point of view.

After testing the Microsoft version of SQL+R services [10],

we saw that the major complication, far away of deploying

the service, is to build the SQL wrapping procedures for the

R scripts to be executed. When reusing or porting already

existing code or R applications, the wrapper just has to source
(R function for “import”) the original code and its libraries,

and execute the corresponding functions (plus bridging the

parameters and the return of this function). Current services

are prepared to transform 2 dimensional R Data Frames into

tables, as a result of such procedures. Aside of Microsoft

services, we took a look into Cisco ParStream [6], displaying

a very similar approach, differing on the way of instantiating

the R scripts through R file calls instead of direct scripting.

It remains for the future work to test and compare the per-

formances among platforms, also to include some experiences

with IBM PureData Services [7] and any other new platform

providing such services.

This article is structured as follows: Section II presents

the current state-of-art and recent approaches used when

processing data from databases. Section III explains the current

and new data-flow paradigm, and required changes. Section IV

shows some code porting examples from our native R scripts to

Microsoft SQL Server. Section V provides comments and de-

tails on some experiments on running the ALOJA framework

in the new architecture. Finally, section VI summarizes this

current work and presents the conclusions and future work.

II. STATE OF THE ART

Current efforts on systems processing Big Data are mostly

focused on building and improving distributed systems. Plat-

forms like Apache Hadoop [1] and Apache Spark [

], with

all their “satellite” technologies, are on the rise on Big Data

processing environments. Those platforms, although being

originally designed towards Java or Python applications, the

significant weight of the data mining community using R,

Scala or using SQL interfaces, encouraged the platforms strate-

gists over the years to include interfaces and methodologies for

using such languages. Revolution Analytics published in 2011

RHadoop [9], a set of packages for R users to launch Hadoop

tasks. Such packages included HDFS [14] and HBase [2] han-

dlers, with Map-Reduce and data processing function libraries

adapted for Hadoop. This way, R scripts could dispatch par-

allelizable functions (e.g. R “apply” functions) to be executed

in distributed worker computing machines.

The Apache Spark platform, developed by the Berkeley’s

AMPlab [11] and Databricks [5] and released in 2014, focuses

on four main applied data science topics: graph processing,

machine learning, data streaming, and relational algebraic

queries (SQL). For these, Spark is divided in four big pack-

ages: GraphX, MLlib, SparkStream, and SparkSQL. SparkSQL
provides a library for treating data through SQL syntax or

through relational algebraic functions. Also recently, a R

scripting interface has been added to the initial Java, Python

and Scala interfaces, through the SparkR package, providing

the Spark based parallelism functions, Map-Reduce and HBase

or Hive [3] handlers. This way, R users can connect to

Spark deployments and process data frames in a Map-Reduce

manner, the same way they could do with RHadoop or using

other languages.

Being able to move the processing towards the database

side becomes a challenge, but allows to integrate analytics

into the same data management environment, letting the same

framework that receives and stores data to process it, in the

way it is configured (local, distributed…). For this purpose,

companies providing database platforms and services put effort

in adding data processing engines as integrated components to

their solutions. Microsoft recently acquired Revolution Analyt-

ics and its R engine, re-branded as R-Server [8], and connected

to the Microsoft SQL-Server 2016 release [10]. IBM also

released their platform Pure Data Systems for Analytics [7],

providing database services including the vanilla R engine
from the Comprehensive R Archive Network [17]. Also Cisco
recently acquired ParStream [6], a streaming database product

incorporating user defined functions, programmable as shared

object libraries in C++ or as external scripts in R.

Here we describe our first approach to integrate our R-based

analytics engine into a SQL+R platform, the Microsoft SQL-

Server 2016, primarily looking at the user experience, and

discussing bout cases of use where one architecture would be

preferred over others.

III. DATA-FLOW ARCHITECTURES

Here we show three basic schemes of data processing,

the local ad-hoc schema of pull-process-push data, the new
distributed schemes for Hadoop and Spark, and the “In-

DataBase” approach using the DB integrated analytics ser-

vices. Point out that there is not an universal schema that

works for each situation, and each one serves better or

worse depending on the situation. As an example we put the

case of the ALOJA-ML framework as an example of such

architectures, how it currently operates, the problematic, and

how it would be adapted to new approaches.

A. Local Environments

In architectures where the data and processing capacity is in

the same location, having data-sets stored in local file systems

or direct access DBs (local or remote databases where the

user or the application can access to retrieve or put data).

When an analytics application wants to process data, it can

access the DB to fetch the required data, store locally and

then pass to the analytics engine. Results are collected by

the application and pushed again to the DB, if needed. This

is a classical schema before having distributed systems with

distributed databases, or for systems where the processing is

not considered big enough to build a distributed computing-

power environment, or when the process to be applied on data

cannot be distributed.

For systems where the analytics application is just a user

of the data, this schema might be the corresponding one, as

the application fetches the data it is granted to view, then

do whatever it wants with it. Also for applications where

computation does not require select big amounts of data, as

the data required is a small fraction of the total Big Data, it is

affordable to fetch the slice of data and process it locally. Also

on systems where using libraries like snowfall [X], letting the
user to tune the parallelism of R functions, can be deployed

locally up to a point where this distribution requires heavier

mechanisms like Hadoop or Spark.

There are mechanisms and protocols, like ODBC or JDBC

(e.g. RODBC library for R), allowing applications to access

directly to DBs and fetch data without passing through the file

system, but through the application memory space. This is, in

case that the analytics application has granted direct access to

the DB and understands the returning format. Figure 1 shows

both archetypical local schemes.

Fig. 1. Schema of local execution approaches, passing data through the File
System or ODBC mechanisms

In the first case of Figure 1 we suppose an scenario where

the engine has no direct access to DB, as everything that

goes into analytics is triggered by the base application. Data

is retrieved and pre-processed, then piped to the analytics

engine through the file system (files, pipes, …). This requires

coordination between application and engine, in order to

communicate the data properly. Such scenarios can happen

when the DB, the application and the engine do not belong

to the same organization, or when they belong to services

from different providers. Also it can happen when security

issues arise, as the analytics can be provided by a not-so-

trusted provider, and data must be anonymized before exiting

the DB.

Also, if the DB and analytics engine are provided by the

same service provider, there will be means to communicate the

storage with the engine in an ODBC or other way, allowing

to fetch the data, process, and then return the results to the

DB. As an example, the Microsoft Azure-ML services [15]

allow the connection of its machine learning components with

the Azure storage service, also the inclusion of R code as

part of user-defined components. In such service, the machine

learning operations do not happen on the storage service, but

data is pulled into the engine and then processed.

At this moment, the ALOJA Project, consists of a web user

interface (front-end and back-end), a MySQL database and

the R engine for analytics, used this schema of data pipe-line.

The web interface provided the user the requested data from

the ALOJA database, and then displayed in the front-end. If

analytics were required, the back-end dumped the data to be

treated into the file system, and then started the R engine.

Results were collected by the back-end and displayed (also

stored in cache). As most of the information passed through

user filters, this data pipe-lining was preferred over the direct

connection of the scripts with the DB. Also this maintained the

engine independent of the DB queries in constant development

at the main framework.

Next steps for the project are planned towards incorporating

a version of the required analytics into the DB, so the back-

end of the platform can query any of the provided analytics,

coded as generic for any kind of input data, as a single SQL

query.

B. Distributed Environments

An alternative architecture for the project would be to

upload the data into a Distributed Database (DDB), thinking

on expanding the ALOJA database towards Big Data (we still

have Terabytes of data to be unpacked into the DB and to be

analyzed at this time). The storage of the data could be done

using Hive or HBase technologies, also the analytics could be

adapted towards Hadoop or Spark. For Hadoop, the package

RHadoop could handle the analytics, while the SparkSQL +

SparkR packages could do the same for Spark. Processing

the analytics could be done on a distributed system with

worker nodes, as most of the analytics in our platform can be

parallelized. Further, Hadoop and Spark have machine learning

libraries (Mahout and MLlib) that could be used as native

instead of some functionalities of our ALOJA-ML framework.

Figure 2 shows the execution schema of this approach.

Such approach would concentrate all the data retrieval

and processing in a distributed system, not necessarily near

the application but providing parallelism and agility on data

browsing, slicing and processing. However, this requires a

complex set-up for all the involved services and machines,

and the adaption of the analytics towards parallelism in a

different level than when using snowfall or other packages.
SparkR provides a Distributed Data-Frame structure, with

similar properties than regular R data-frames, but due to the

distributed property, some classic operations are not available

Fig. 2. Schema of distributed execution approach, with Distributed Databases
or File Systems

or must be performed in a different way (e.g. column binding,

aggregates like “mean”, “count”…).

This option would be chosen at that point where data is

large enough to be dealt with a single system, considering that

DB managers do not provide means to distribute and retrieve

data. Also, like in the previous approach, the data pipe-line

passes through calling the analytics engine to produce the

queries and invoke (in a Map-Reduce manner) the analytics

when parallelizable, or collect locally data to apply the non-

parallelizable analytics. The price to be paid would be the set-

up of the platform, and the proper adjustment of the analytics

towards a Map-Reduce schema.

C. Integrated Analytics

The second alternative to the local analytics schema is to

incorporate those functions to the DB services. At this point,

the DB can offer Analytics as a Service, as users can query
for analytics directly to the DB in a SQL manner, and data

will be directly retrieved, processed and stored by the same

framework, avoiding the user to plan a full data pipe-line.

As the presented services and products seem to admit R

scripts and programs directly, no adaption of the R code

is required. The effort must be put in coding the wrapping

procedures for being called from a SQL interface. Figure 3

shows the basic data-flow for this option.

Fig. 3. Schema of “In-DB” execution approach

The advantages of such approach are that 1) analytics, data

and data management are integrated in the same software

package, and the only deployment and configuration is of

the database; 2) the management of distributed data in Big

Data deployments will provided by the same DB software

(whether this is implemented as part of the service!), and the

user doesn’t need to implement Map-Reduce functions but data

can be aggregated in the same SQL query; 3) simple DB users

can invoke the analytics though a SQL query, also analytics

can be programmed generically to be applied with any required

input SQL query result; 4) DBs providing triggers can produce
Continuous Analytic Queries each time data is introduced or
modified.

This option still has issues to be managed, such as the capac-

ity of optimization and parallelism of the embedded scripts. In

the case of the Microsoft R Service, any R code can be inserted

inside a procedure, without any apparent optimization to be

applied as any script is directly sent to a external R sub-service

managed by the principal DB service. Such R sub-service

promises to apply multi-threading in Enterprise editions, as the

classic R engine is single-thread, and multi-threading could

be applied to vectorization functions like “apply”, without

having to load the previously mentioned snowfall (loadable
as the script runs on a compatible R engine). Also, comparing

this approach to the distributed computing system, if the

service supports “partitioning” of data according to determined

columns and values, SQL queries could be distributed in a

cluster of machines (not as replication, but as distribution),

and aggregated after that.

IV. ADAPTING AND EMBEDDING CODE

According to the SQL-Server published documentation and

API, the principal way to introduce external user-defined

scripts is to wrap the R code inside a Procedure. The SQL-

Server procedures admit the script, that like any “Rscript” can

source external R files and load libraries (previously copied

into the corresponding R library path set up by the Microsoft

R Service), also admit the parameters to be bridged towards

the script, and also admit SQL queries to be executed previous

to start the script for filling input data-frames. The procedure

also defines the return values, in the form of data-frames as

tables, values or a tuple of values.

When the wrapping procedure is called, input SQL queries

are executed and passed as input data-frame parameters, direct

parameters are also passed to the script, then the script is

executed in the R engine as it would do a “Rscript” or a script

dumped into the R command line interface. The variables

mapped as outputs are returned from the procedure into the

invoking SQL query.

Figure 4 shows an example of calling ALOJA-ML functions

in charge of learning a linear model from the ALOJA data-

set, read from a file while indicating the input and output

variables; also calling a function to predict a data-set using

the previously created model. The training process creates a

model and its hash ID, then the prediction process applies the

model to all the testing data-set. In the current ALOJA data

pipe-line, “ds” should be retrieved from the DB (to a file to

be read, or directly to a variable if RODBC is used). Then

source(“functions.r”);
library(digest);

## Training Process
model <- aloja_linreg(ds = read.table("aloja6.csv"), vin = c("maps","iofilebuf"), vout = "exe_time"); id_hash <- digest(x = model, algo = "md5"); # ID for storing the model in DB

## Prediction Example
predictions <- aloja_predict_dataset(learned_model = model, ds = read.table("aloja6test.csv"));

Fig. 4. Example of modeling and prediction example using the ALOJA-ML libraries. Loading ALOJA-ML functions allow the code to execute “aloja linreg”
to model a linear regression, also “aloja predict dataset” to process a new data-set using a previously trained model.

the results (“model”, “id hash” and “predictions”) should be

reintroduced to the DB using again RODBC, or writing the

results into a file and the model into a serialized R object file.

Figures 5 and 6 show how this R code is wrapped as

procedure, and examples of how these procedures are invoked.

Using this schema, in the modeling procedure, “ds” would be

passed to the procedure as a SQL query, “vin” and “vout”

would be bridged parameters, and the “model” and “id hash”

would be returned as a two values (a serialized blob/string

and a string) that can be saved into a models table. The return

would be introduced into the models table. Also the prediction

procedure would admit an “id hash” for retrieving the model

(using a SQL query inside the procedure), and would return a

data-frame/table with the row IDs and the predictions.

We observed that, when preparing the embedded script, all

sources, libraries and file paths must be prepared like a Rscript

to be executed from a command line. The environment for the

script must be set-up, as it will be executed each time in a

new R session.

In the usual work-flow on ALOJA-ML tools, models (R

serialized objects) are stored in the file system, and then

uploaded in the database as binary blobs. Working directly

from the DB server allow to directly encode serialized objects

into available DB formats. Although SQL-Server includes a

“large binary” data format, we found some problems when

returning binary information from procedures (syntax not

allowing the return of such data type in tuples), thus serialized

objects can be converted to text-formats like base64 to be
stored as a “variable size character array”.

V. EXPERIMENTS AND EXPERIENCES

A. Usability and Performance

For testing the approach we used the SQL-Server 2016

Basic version, with the R Service installed and with Windows

Server 2012, from a default image available at the Azure

repositories. Being the basic service, and not the enterprise,

we assumed that R Services would not provide improvements

on multi-threading, so additional packages for such functions

are installed (snowfall). The data from the ALOJA data-set is
imported through the CSV importing tools in the SQL-Server

Manager framework, the “Visual-Studio”-like integrated de-

velopment environment (IDE) for managing the services.

Data importing displayed some problems, concerning to

data-type transformation issues, as some numeric columns
couldn’t be properly imported due to precision and very big

values, and had to be treated as varchar (thus, not treatable
as number). After making sure that the CSV data is properly

imported, the IDE allowed to perform simple queries (SE-

LECT, INSERT, UPDATE). At this point, tables for storing

pair-value entries, containing the models with their specific

ID hash as key, is created. Procedures wrapping the basic

available functions of ALOJA-ML are created, as specified

in previous section IV. After executing some example calls,

the system works as expected.

During the process of creating the wrapping procedures

we found some issues, probably non-reported bugs or system

internal limitations, like the fact that a procedure can return a

large binary type (large blob in MySQL and similar solutions),
also can return tuples of diverse kinds of data types, but it

crashed with an internal error when trying to return a tuple

of a varchar and a large binary. Workarounds were found
by converting the serialized model object (binary type) into
a base64 encoding string (varchar type), to be stored with
its ID hash key. As none information about this issue was

found in the documentation, at the day of registering these

experiences, we expect that such issues will be solved by the

development team in the future.

We initially did some tests using the modeling and pre-

diction ALOJA-ML functions over the data, and comparing

times with a local “vanilla” R setup, performance is almost

identical. This indicating that with this “basic” version, the R

Server (former R from Revolution Analytics) is still the same

at this point.

Another test done was to run the outliers classification
function of our framework. The function, explained in detail

in its corresponding work [13], compares the original output

variables with their predictions, and if the difference is k times
greater than the expected standard deviation plus modeling er-

ror, and it doesn’t have enough support from similar values on

the rest of the data-set, such data is considered an outlier. This

implies the constant reading of the data table for prediction

and for comparisons between entries. The performance results

were similar to the ones on a local execution environment,

measuring only the time spent in the function. In a HDI-

A2 instance (2 virtual core, only 1 used, 3.5GB memory),

%% Creation of the Training Procedure, wrapping the R call
CREATE PROCEDURE dbo.MLPredictTrain @inquery nvarchar(max), @varin nvarchar(max),

@varout nvarchar(max) AS
BEGIN
EXECUTE sp_execute_external_script
@language = N’R’,
@script = N’

source(“functions.r”);
library(digest); library(base64enc);
model <- aloja_linreg(ds = InputDataSet, vin = unlist(strsplit(vin,",")), vout = vout); serial <- as.raw(serialize(model, NULL)); OutputDataSet <- data.frame(model = base64encode(serial),

id_hash = digest(serial, algo = “md5”));
’,
@input_data_1 = @inquery,
@input_data_1_name = N’InputDataSet’,
@output_data_1_name = N’OutputDataSet’,
@params = N’@vin nvarchar(max), @vout nvarchar(max)’,
@vin = @varin,
@vout = @varout
WITH RESULT SETS ((“model” nvarchar(max), “id_hash” nvarchar(50)));
END

%% Example of creating a model and storing into the DB
INSERT INTO aloja.dbo.trained_models (model, id_hash)
EXEC dbo.MLPredictTrain @inquery = “SELECT exe_time, maps, iofilebuf FROM aloja.dbo.aloja6”,

@varin = “maps,iofilebuf”, @varout = “exe_time”

Fig. 5. Version of the modeling call for ALOJA-ML functions in a SQL-Server procedure. The procedure generates the data-set for “aloja linreg” from a
parametrized query, also bridges the rest of parameters into the script. It also indicates the format of the output, being a value, a tuple or a table (data frame).

%% Creation of the Predicting Procedure, wrapping the R call
CREATE PROCEDURE dbo.MLPredict @inquery nvarchar(max), @id_hash nvarchar(max) AS
BEGIN
DECLARE @modelt nvarchar(max) = (SELECT TOP 1 model FROM aloja.dbo.trained_models

WHERE id_hash = @id_hash);
EXECUTE sp_execute_external_script
@language = N’R’,
@script = N’

source(“functions.r”);
library(base64enc);
results <- aloja_predict_dataset(learned_model = unserialize(as.raw(base64decode(model))),

ds = InputDataSet);
OutputDataSet <- data.frame(results);

’,
@input_data_1 = @inquery,
@input_data_1_name = N’InputDataSet’,
@output_data_1_name = N’OutputDataSet’,
@params = N’@model nvarchar(max)’,
@model = @modelt;
END

%% Example of predicting a dataset from a SQL query with a previously trained model in DB
EXEC aloja.dbo.MLPredict @inquery = ’SELECT exe_time, maps, iofilebuf FROM aloja.dbo.aloja6test’,

@id_hash = ’aa0279e9d32a2858ade992ab1de8f82e’;

Fig. 6. Version of the Prediction call for ALOJA-ML functions in a SQL-Server procedure. Like the training procedure in figure 5, the procedure primarily
retrieves the data to be processed from a SQL query, and passes it with the rest of parameters into the script. Here the result is directly a table (data frame).

it took 1h:9m:56s to process the 33147 rows, selecting just 3

features. Then, as a way to improve the performance, due to

the limitations of the single-thread R Server, we loaded snow-

fall, and invoked it from the ALOJA-ML “outlier dataset”
function, on a HDI-A8 instance (8 virtual core, all used, 14GB

memory). The data-set was processed in 11m:4s, barely 1/7

of the previous time, considering the overhead of sharing data

among R processes created by snowfall, demonstrating that
despite not being a multi-threaded set-up, using the traditional

resources available on R it is possible to scale R procedures.

B. Discussion

One of the concerns on the usage of such service is, despite

and because of the capability of multi-processing using built-

in or loaded libraries, the management of the pool of R

processes. R is not just a scripting language to be embedded

on a procedure, but it is a high-level language that allows from

creating system calls to parallelizing work among networked

working nodes. Given the complexity that a R user created

function can achieve, in those cases that such procedure is

heavily requested, the R server should be able to be detached

from the SQL-server and able in dedicated HPC deployments.

The same way snowfall can be deployed for multi-threading
(also for cluster-computing), clever hacks can be created by

loading RHadoop or SparkR inside a procedure, connecting
the script with a distributed processing system. As the SQL-

server bridges tables and query results as R data frames, such

data frames can be converted to Hadoop’s Resilient Distributed

Data-sets or Spark’s Distributed Data Frames, uploaded to a

HDFS, processed, then returned to the database. This could

bring to a new architecture of SQL-Server (or equivalent

solutions) to connect to distributed processing environments,

as slave HPC workers for the database. Also an improvement

could be that the same DB-server, instead of producing input

table/data frames already returned Distributed Data Frames,

being the data base distributed into working nodes (in a

partitioning way, not a replication way). All in all, the fact

that the embedded R code is directly passed to a nearly-

independent R engine allows to do whatever a data scientist

can do with a typical R session.

VI. SUMMARY AND CONCLUSIONS

The incorporation of R, the semi-functional programming

statistical language, as an embedded analytics service into

databases will suppose an improvement on the ease and

usability of analytics over any kind of data, from regular to

big amounts (Big Data). The capability of data scientists to

introduce their analytics functions as a procedure in databases,

avoiding complex data-flows from the DB to analytics engines,

allow users and experts a quick tool for treating data in-situ

and continuously.

This study discussed some different architectures for data

processing, involving fetching data from DBs, distributing data

and processing power, and embedding the data process into the

DB. All of this using the ALOJA-ML framework as reference,

a framework written in R dedicated to model, predict and

classify data from Hadoop executions, stored as the ALOJA

data-set. The shown examples and cases of use correspond

to the port of the current ALOJA architecture towards SQL-

Server 2016, integrating R Services.

After testing the analytics functions after the porting into

an SQL database, we observed that the major effort for this

porting are in the wrapping SQL structures for incorporating

the R calls into the DB, without modifying the original R code.

As performance results similar than R standalone distributions,

the advantages come from the input data retrieving and storing.

In future work we plan to test the system more in-depth, and

also compare different SQL+R solutions, as companies offer-

ing DB products have started putting efforts into integrating

R engines in their DB platforms. As this study focused more

in an initial hands-on with this new technology, future studies

will focus more on comparing performance, also against other

architectures for processing Big Data.

ACKNOWLEDGMENTS

This project has received funding from the European Research

Council (ERC) under the European Union’s Horizon 2020 research

and innovation programme (grant agreement No 639595).

REFERENCES

[1] Apache Hadoop. http://hadoop.apache.org (Aug 2016).
[2] Apache HBase. https://hbase.apache.org/ (Aug 2016).
[3] Apache Hive. https://hive.apache.org/ (Aug 2016).
[4] Apache Spark. https://spark.apache.org/ (Aug 2016).
[5] Databricks inc. https://databricks.com/ (Aug 2016).
[6] ParStream. Cisco corporation. http://www.cisco.com/c/en/us/products/

analytics-automation-software/parstream/index.html (Aug 2016).
[7] PureData Systems for Analytics. IBM corporation. https://www-01.ibm.

com/software/data/puredata/analytics/ (Aug 2016).
[8] R-Server. Microsoft corporation. https://www.microsoft.com/en-us/

cloud-platform/r-server (Aug 2016).
[9] RHadoop. Revolution Analytics. https://github.com/

RevolutionAnalytics/RHadoop/wiki (Aug 2016).
[10] SQL-Server 2016. Microsoft corporation. https://www.microsoft.com/

en-us/cloud-platform/sql-server (Aug 2016).
[11] UC Berkeley, AMPlab. https://amplab.cs.berkeley.edu/ (Aug 2016).
[12] Barcelona Supercomputing Center. ALOJA home page. http://aloja.bsc.

es/ (Aug 2016).
[13] J. L. Berral, N. Poggi, D. Carrera, A. Call, R. Reinauer, and D. Green.

ALOJA-ML: A framework for automating characterization and knowl-
edge discovery in hadoop deployments. In Proceedings of the 21th ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining, Sydney, NSW, Australia, August 10-13, 2015, pages 1701–1710,
2015.

[14] D. Borthakur. The Hadoop Distributed File System: Architecture and
Design.
http://hadoop.apache.org/docs/r0.18.0/hdfs design . The Apache
Software Foundation, 2007.

[15] Microsoft Corporation. Azure 4 Research. http://research.microsoft.
com/en-us/projects/azure/default.aspx (Jan 2016).

[16] N. Poggi, J. L. Berral, D. Carrera, A. Call, F. Gagliardi, R. Reinauer,
N. Vujic, D. Green, and J. A. Blakeley. From performance profiling to
predictive analytics while evaluating hadoop cost-efficiency in ALOJA.
In 2015 IEEE International Conference on Big Data, Big Data 2015,
Santa Clara, CA, USA, October 29 – November 1, 2015, pages 1220–
1229, 2015.

[17] R Core Team. R: A Language and Environment for Statistical Comput-
ing. R Foundation for Statistical Computing, Vienna, Austria, 2014.

Turn in your highest-quality paper
Get a qualified writer to help you with

“ qsn ”

Get high-quality paper

NEW! AI matching with writer

Order an Essay Now & Get These Features For Free:

Turnitin Report

Formatting

Title Page

Citation

Outline

Place an Order