Create the following matrix, which stores the name and suit of every card in a royal
flush.## [,1] [,2]## [1,] “ace” “spades”## [2,] “king” “spades”## [3,] “queen” “spades”## [4,] “jack” “spades”## [5,] “ten” “spades”
Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
1
Module 4: Analytics Theory/Methods
1
Advanced Analytics – Theory and Methods
Time Series Analysis
During this lesson the following topics are covered:
• Time Series Analysis and its applications in forecasting
• ARMA and ARIMA Models
• Implementing the Box-Jenkins Methodology using R
• Reasons to Choose (+) and Cautions (-) with Time Series Analysis
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
2
The topics covered in this lesson are listed. ARIMA and Box-Jenkins methodology are explained
in the following slides.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
2
Time Series Analysis
• Time Series: Ordered sequence of equally spaced values over time
• Time Series Analysis: Accounts for the internal structure of
observations taken over time
Trend
Seasonality
Cycles
Random
• Goals
To identify the internal structure of the time series
To forecast future events
Example: Based on sales history, what will next December sales be?
• Method: Box-Jenkins (ARMA)
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
3
Businesses perform sales forecasting to look ahead in order to plan their investments, launch
new products, decide when to close or withdraw products, etc. The sales forecasting process is
a critical one for most businesses. Part of the sales forecasting process is to examine the past.
How well did we do in the last few months or what were our sales in the same time period for
the last few years? Time Series Analysis provides a scientific methodology for sales forecasting.
Time Series Analysis is the analysis of sequential data across equally spaced units of time. Time
Series is a basic research methodology in which data for one or more variables are collected for
many observations at different time periods. The main objectives in Time Series Analysis are:
• To understand the underlying structure of the time series by breaking it down to its
components.
• Fit a mathematical model and then proceed to forecast the future
The time periods are usually regularly spaced and the observations may be either univariate or
multivariate. Univariate time series are those where only one variable is measured over time,
whereas multivariate time series are those, where multiple variables are measured
simultaneously. The internal structure of the data may specify a trend, seasonality or cycles:
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
3
Box-Jenkins Method: What is it?
• Models historical behavior to forecast the future
• Applies ARMA (Autoregressive Moving Averages)
Input: Time Series
Accounting for Trends and Seasonality components
Output: Expected future value of the time series
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
5
Box-Jenkins methodology developed by Professors G.E.P. Box and G.M. Jenkins, enables the
forecasting with time series data with both high accuracy and low computational requirements.
The technique may be applied to quickly determine forecasts that are as uncomplicated in form
as the simple smoothing methods, or that involve a number of economic variables. In either
case, use of this technique enables efficient utilization of other predictive information
contained in the data. It offers assurance of obtaining the highest forecasting accuracy possible
in terms of the variables on which the forecast is based.
The input for the model is the trend and seasonality adjusted time series and the output is the
expected future value of the time series.
Box Jenkins Methodology applies autoregressive moving average ARMA models to find the best
fit of a time series to past values of this time series, in order to make forecasts.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
5
Use Cases
Forecast:
• Next month’s sales
• Tomorrow’s stock price
• Hourly power demand
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
6
The key application of Time Series Analysis is in forecasting. Economic and business planning,
inventory and production control of industrial processes are some of the key applications in
which time series analysis is deployed.
Time Series data provide useful information about the physical, biological, social or economic
systems generating the time series, such as:
Economics/ Finance: share prices, profits, imports, exports, stock exchange indices
Sociology: school enrollments, unemployment, crime rate
Environment: Amount of pollutants, such as suspended particulate matter (SPM), in the
environment
Meteorology: Rainfall, temperature, wind speed
Epidemiology: Number of SARS cases over time
Medicine: Blood pressure measurements over time for evaluating drugs to control
hypertension
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
6
Modeling a Time Series
• Let’s model the time series as
Yt =Tt +St +Rt,
t=1,…,n.
• Tt: Trend term
Air travel steadily increased over the last few years
• St: The seasonal term
Air travel fluctuates in a regular pattern over the course of a year
• Rt: Random component
To be modeled with ARMA
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
7
We present a simple model for the time series with the trend, seasonality and a random
fluctuation. There is sometimes a low frequency cyclic term as well, but we are ignoring that
for simplicity.
Examples of trend and seasonality are also detailed in the slide
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
7
Stationary Sequences
• Box-Jenkins methodology assumes the random component is a
stationary sequence
Constant mean
Constant variance
Autocorrelation does not change over time
Constant correlation of a variable with itself at different times
• In practice, to obtain a stationary sequence, the data must be:
De-trended
Seasonally adjusted
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
8
A stationary sequence is a random sequence in which the joint probability distribution does not
vary over time. In other words the mean, variance and auto correlations do not change in the
sequence over time.
In order to render a sequence stationary we need to remove the effects of trend and
seasonality. The ARIMA model (implemented with Box Jenkins) uses the method of
differencing to render the data stationary.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
8
De-trending
• In this example, we see a
linear trend, so we fit a
linear model
Tt = m·t + b
• The de-trended series is then
Y1t = Yt – Tt
• In some cases, may have to
fit a non-linear model
Quadratic
Exponential
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
9
Trend in a time series is a slow, gradual change in some property of the series over the whole
interval under investigation.
De-trending is a pre-processing step to prepare time series for analysis by methods that
assume stationarity.
A simple linear trend can be removed by subtracting a least-squares-fit straight line. In the
example shown we fit a linear model and obtain the difference. The graph shown next is a detrended time series.
More complicated trends might require different procedures such a fitting a non-linear model
such as a quadratic or a exponential model.
Use a Linear Trend Model if the first differences are more or less constant [ (y2-y1) = (y3-y2) =
……. = (yn-yn-1) ]
Use a Quadratic Trend Model if the second differences are more or less constant. [ (y3-y2) –
(y2-y1) = ………= (yn-yn-1)-(yn-1-yn-2) ]
Use an Exponential Trend Model if the percentage differences are more or constant. [ ((y2-y1)
/y1 ) * 100% = …….((yn-yn-1)/yn-1 ) * 100%
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
9
Seasonal Adjustment
• Plotting the de-trended
series identifies seasons
For CO2 concentration, we
can model the period as
being a year, with variation
at the month level
• Simple ad-hoc adjustment:
take several years of data,
calculate the average value
for each month, and
subtract that from Y1t
Y2t = Y1t – St
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 10
Unlike the trend and cyclical components, seasonal components, theoretically, happen with
similar magnitude during the same time period each year.
The holiday sales spike is an example of seasonality. By removing the seasonal component, it is
easier to focus on other components. The seasonal component of a series typically makes the
interpretation of a series more difficult.
A simple adjustment for seasonality is done with taking several years of data, calculating
average value for each month and subtracting them from the actual value.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
10
ARMA(p, q) Model
• The simplest Box-Jenkins Model
Yt is de-trended and seasonally adjusted
• Combination of two process models
Autoregressive: Yt is a linear combination of its last p values
Moving average: Yt is a constant value plus the effects of a
dampened white noise process over the last q time values (lags)
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 11
Autoregressive (AR) models can be coupled with moving average (MA) models to form a
general and useful class of time series models called Autoregressive Moving Average (ARMA)
models. This is the simplest Box-Jenkins model.
AR model predicts Yt as a linear combination of its last p values. An autoregressive model is
simply a linear regression of the current value of the series on one or more prior values of the
same series. Several options are available for analyzing autoregressive models, including
standard linear least squares techniques. They also have a straightforward interpretation.
The time series Yt is called an autoregressive process of order p and is denoted as AR(p)
process.
A moving average (MA) model adds to Yt the effects of a dampened white noise process over
the last q steps. The simple moving average is one of the most basic of the forecasting
methods. Moving backwards in time, minus 1, minus, 2, minus 3 and so forth until we have n
data points, divide the sum of those points by the number of data points, n, and that gives you
the forecast for the next period. So it’s called a single moving average or simple moving
average. The forecast is simply a constant value that projects the next time period. “n” is also
the order of the moving averages.
moving average: like a random walk, or brownian motion
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
11
ARIMA(p, d, q) Model
• ARIMA adds a differencing term, d, to the ARMA model
Autoregressive Integrated Moving Average
Includes the de-trending as part of the model
linear trend can be removed by d=1
quadratic trend by d=2
and so on for higher order trends
• The general non-seasonal model is known as ARIMA (p, d, q):
p is the number of autoregressive terms
d is the number of differences
q is the number of moving average terms
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 12
ARMA models can be used when the series is weakly stationary; in other words, the series has
a constant variance around a constant mean.. This class of models can be extended to nonstationary series by allowing the differencing of the data series. These are called Autoregressive
Integrated Moving Average(ARIMA) models. There are a large variety of ARIMA models.
ARIMA – difference the Yt d times to “induce stationarity”. d is usually 1 or 2. “I” stands for
integrated – the outputs of the model are summed up (or “integrated”) to recover Yt
The general ARIMA (p, d, q) model gives a tremendous variety of patterns in the ACF and PACF,
so it is not practical to state rules for identifying general ARIMA models. In practice, it is seldom
necessary to deal with values of p, d, or q other than 0, 1, or 2. It is remarkable that such a
small range of values for p, d, or q can cover such a large range of practical forecasting
situations.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
12
ACF & PACF
• Auto Correlation Function (ACF)
Correlation of the values of the time series with itself
Autocorrelation “carries over”
Helps to determine the order, q, of a MA model
Where does ACF go to zero?
• Partial Auto Correlation Function (PACF)
An autocorrelation calculated after removing the linear
dependence of the previous terms
Helps to determine the order, p, of an AR model
Where does PACF go to zero?
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 13
A common assumption in many time series techniques is that the time series is stationary. A
stationary process has the property that the mean, variance and autocorrelation structure do
not change over time.
An ACF plot provides an indication of the stationarity of the data. If the time series is not
stationary, we can often transform it to stationarity with the simple technique of differencing.
It should be noted that the autocorrelation carries over; if Yt is correlated with Yt-1, it is also
correlated with Yt-2 (though to a lesser degree).
PACF – The partial autocorrelation at lag k is the autocorrelation between Yt and Yt-k that is not
accounted for by lags 1 through k-1.
One looks for the point on the plot where the partial autocorrelations for all higher lags are
essentially zero.
We will look into ACF and PACF graphs in the next Lab.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
13
Model Selection
• Based on the data, the Data Scientist selects p, d and q
An “art form” that requires domain knowledge, modeling
experience, and a few iterations
Use a simple model when possible
AR model (q = 0)
MA model (p = 0)
• Multiple models need to be built
and compared
Using ACF and PACF
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 14
Identification of the most appropriate model is the most important part of the process, where
it becomes as much ‘art’ as ‘science’.
The first step is to determine if the time series is stationary. This can be done with a
correlogram, plots of the ACF and PACF. If the time series is not stationary, it needs to be firstdifferenced. (it may need to be differenced again to induce stationarity)
The next stage is to determine the p and q in the ARIMA (p, d, q) model (the d refers to how
many times the data needs to be differenced to produce a stationary series).
In the diagnostic stage we assess the model’s adequacy by checking whether the model
assumptions are satisfied. If the model is inadequate, this stage will provide some information
for us to re-identify the model. We also perform: checking normality, constant variance, and
independence assumption among residuals.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
14
Time Series Analysis – Reasons to Choose (+) &
Cautions (-)
Reasons to Choose (+)
Minimal data collection
Only have to collect the series
itself
Do not need to input drivers
Designed to handle the inherent
autocorrelation of lagged time series
Accounts for trends and seasonality
Copyright © 2014 EMC Corporation. All Rights Reserved.
Cautions (-)
No meaningful drivers: prediction
based only on past performance
No explanatory value
Can’t do “what-if” scenarios
Can’t stress test
It’s an “art form” to select appropriate
parameters
Only suitable for short term
predictions
Module 4: Analytics Theory/Methods 15
The Reasons to Choose (+) and Cautions (-) of Time Series Analysis are listed.
Time Series Analysis is not a common “tool” in a Data Scientist’s tool kit. Though the models
require minimal data collection and handle the inherent auto correlations of lagged time series,
it does not produce meaningful drivers for the prediction.
The selection of (p,d,q) appropriately is not very straight forward. A complete understanding of
the domain knowledge and very detailed analysis of trend and seasonality may be required.
Further this method is suitable for short term predictions only.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
15
Time Series Analysis with R
• The function “ts” is used to create time series objects
mydata $100K THEN default=T with
75% probability“
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
3
Decision Trees are a flexible method very commonly deployed in data mining applications. In
this lesson we will focus on Decision Trees used for classification problems.
There are two types of trees; Classification Trees and Regression (or Prediction) Trees
• Classification Trees – are used to segment observations into more homogenous groups
(assign class labels). They usually apply to outcomes that are binary or categorical in nature.
• Regression Trees – are variations of regression and what is returned in each node is the
average value at each node (type of a step function with which the average value can be
computed). Regression trees can be applied to outcomes that are continuous (like account
spend or personal income).
The input values can be continuous or discrete. Decision Tree models output a tree that
describes the decision flow. The leaf nodes return class labels and in some implementations
they also return the probability scores. In theory the tree can be converted into decision rules
such as the example shown in the slide.
Decision Trees are a popular method because they can be applied to a variety of situations. The
rules of classification are very straight forward and the results can easily be presented visually.
Additionally, because the end result is a series of logical “if-then” statements, there is no
underlying assumption of a linear (or non-linear) relationship between the predictor variables
and the dependent variable.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
3
Decision Tree – Example of Visual Structure
Female
Male
Gender
Female
Male
Income
Branch – outcome of test
Internal Node – decision on variable
Age
45,000
Yes
No
40
Yes
Income
No
Leaf Node – class label
Age
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
4
Decision Trees are typically depicted in a flow-chart like manner.
Branches refer to the outcome of a decision and are represented by the connecting lines here.
When the decision is numerical, the “greater than” branch is usually shown on the right and
“less than” on the left.
Depending on the nature of the variable, you may need to include an “equal to” component on
one branch.
Internal Nodes are the decision or test points. Each refers to a single variable or attribute.
In the example here the outcomes are binary, although there could be more than 2 branches
stemming from an internal node.
For example, if the variable was categorical and had 3 choices, you might need a branch for
each choice.
The Leaf Nodes are at the end of the last branch on the tree. These represent the outcome of
all the prior decisions. The leaf nodes are the class labels, or the segment in which all
observations that follow the path to the leaf would be placed.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
4
Decision Tree Classifier – Use Cases
• When a series of questions (yes/no) are answered to arrive at a
classification
Biological species classification
Checklist of symptoms during a doctor’s evaluation of a patient
• When “if-then” conditions are preferred to linear models.
Customer segmentation to predict response rates
Financial decisions such as loan approval
Fraud detection
• Short Decision Trees are the most popular “weak learner” in
ensemble learning techniques
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
5
An example of Decision Trees in practice is the method for classifying biological species. A
series of questions (yes/no) are answered to arrive at a classification.
Another example is a checklist of symptoms during a doctor’s evaluation of a patient. People
mentally perform these types of analysis frequently when assessing a situation.
Other use cases can be customer segmentation to better predict response rates to marketing
and promotions. Computers can be “taught” to evaluate a series of criteria and automatically
approve or deny an application for a loan. In the case of loan approval, computers can use the
logical “if-then” statements to predict whether the customer will default on the loan. For
customers with a clear (strong) outcome, no human interaction is required, for observations
which may not generate a clear response, a human is needed for the decision.
Short Decision Trees (where we have limited the number of splits) are often used as
components (called “weak learners” or “base learners”) in ensemble techniques (a set of
predictive models which will all vote and we take decisions based on the combination of the
votes) such as Random forests, bagging and boosting (Beyond the scope for this class). The
very simplest of the short trees are decision stumps: Decision Trees with one internal node (the
root) which is immediately connected to the terminal nodes. A decision stump makes a
prediction based on the value of just a single input feature.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
5
Example: The Credit Prediction Problem
700/1000
p(good)=0.70
savings= =1000,no known savings
245/294
p(good)=0.83
housing=free, rent
housing=own
349/501
p(good)=0.70
personal=female, male div/sep
personal=male mar/wid, male single
36/88
p(good) = 0.41
70/117
p(good)=0.60
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
6
We will use the same example we used in the previous lesson with Naïve Bayesian classifier.
For the people with good credit and we start at the top of the tree the probability is 70% (700 out of 1000 people
have good credit). The process has decided that we are going to split how much is in the savings account into two
groups.
One group with savings less than $100 or between $100 to $ 500.
The second group is the rest of the population which has savings of $500 to $1000 or greater than $1000 or no
known savings.
We compute the probability of good credit at the second node and we find in the second savings category 245 out of
294 have good credit and the probability at this node is 83%.
Looking at the other node (Savings