Create the following matrix, which stores the name and suit of every card in a royalflush.## [,1] [,2]## [1,] “ace” “spades”## [2,] “king” “spades”## [3,] “queen” “spades”## [4,] “jack” “spades”## [5,] “ten” “spades
Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
1
Module 4: Analytics Theory/Methods
1
Advanced Analytics – Theory and Methods
Naïve Bayesian Classifiers
During this lesson the following topics are covered:
• Naïve Bayesian Classifier
• Theoretical foundations of the classifier
• Use cases
• Evaluating the effectiveness of the classifier
• The Reasons to Choose (+) and Cautions (-) with the use of
the classifier
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
2
The topics covered in this lesson are listed.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
2
Classifiers
Where in the catalog should I place this product listing?
Is this email spam?
Is this politician Democrat/Republican/Green?
• Classification: assign labels to objects.
• Usually supervised: training set of pre-classified examples.
• Our examples:
Naïve Bayesian
Decision Trees
(and Logistic Regression)
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
3
The primary task performed by classifiers is to assign labels to objects. Labels in classifiers are
pre-determined unlike in clustering where we discover the structure and assign labels.
Classifier problems are supervised learning methods. We start with a training set of preclassified examples and with the knowledge of probabilities we assign class labels.
Some use case examples are shown in the slide. Based on the voting pattern on issues we
could classify whether a politician has an affiliation to a party or a principle. Retailers use
classifiers to assign proper catalog entry locations for their products. Most importantly the
classification of emails as spam is another useful application of classifier methods.
Logistic regression, discussed in the previous lesson, can be viewed and used as a classifier. We
will discuss Naïve Bayesian Classifiers in this lesson and the use of Decision Trees in the next
lesson.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
3
Naïve Bayesian Classifier
• Determine the most probable class label for each object
Based on the observed object attributes
Naïvely assumed to be conditionally independent of each other
Example:
Based on the objects attributes {shape, color, weight}
A given object that is {spherical, yellow, < 60 grams},
may be classified (labeled) as a tennis ball
Class label probabilities are determined using Bayes’ Law
• Input variables are discrete
• Output:
Probability score – proportional to the true probability
Class label – based on the highest probability score
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
4
The Naïve Bayesian Classifier is a probabilistic classifier based on Bayes' Law and naïve
conditional independence assumptions. In simple terms, a Naïve Bayes Classifier assumes
that the presence (or absence) of a particular feature of a class is unrelated to the presence (or
absence) of any other feature.
For example, an object can be classified into a particular category based on its attributes such
as shape, color, and weight. A reasonable classification for an object, that is spherical, yellow
and less than 60 grams in weight, may be a tennis ball. Even if these features depend on each
other or upon the existence of the other features, a Naïve Bayesian Classifier considers all of
these properties to independently contribute to the probability that the object is a tennis ball.
The input variables are generally discrete (categorical) but there are variations to the
algorithms that work with continuous variables as well. For this lesson, we will consider only
discrete input variables. Although weight may be considered a continuous variable, in the
tennis ball example, weight was grouped into intervals in order to make weight a categorical
variable.
The output typically returns a probability score and class membership. The output from most
implementations are log probability scores for the class (we will address this later in the lesson)
and we assign the class label that corresponds to the highest log probability score.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
4
Naïve Bayesian Classifier - Use Cases
• Preferred method for many text classification problems.
Try this first; if it doesn't work, try something more complicated
• Use cases
Spam filtering, other text classification tasks
Fraud detection
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
5
Naïve Bayesian Classifiers are among the most successful known algorithms for learning to
classify text documents. Spam filtering is the best known use of Naïve Bayesian Text
Classification. Bayesian Spam Filtering has become a popular mechanism to distinguish
illegitimate spam email from legitimate email. Many modern mail clients implement Bayesian
Spam Filtering.
Naïve Bayesian Classifiers are used to detect fraud. For example in auto insurance, based on a
training data set with attributes (such as driver’s rating, vehicle age, vehicle price, is it a claim
by the policy holder, police report status, claim genuine ) we can classify a new claim as
genuine or not.
References:
Spam filtering (http://en.wikipedia.org/wiki/Bayesian_spam_filtering)
http://www.cisjournal.org/archive/vol2no4/vol2no4_1.pdf
Hybrid Recommender System Using Naive Bayes Classifier and Collaborative Filtering
(http://eprints.ecs.soton.ac.uk/18483/
Online applications (http://www.convo.co.uk/x02/)
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
5
Building a Training Dataset to Predict Good or Bad Credit
• Predict the credit behavior of
a credit card applicant from
applicant's attributes:
Personal status
Job type
Housing type
Savings amount
• These are all categorical
variables and are better suited
to Naïve Bayesian Classifier
than to logistic regression.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
6
Let us look into a specific use case example. We present here the same example we worked
with in Lesson 2 of this module with the Apriori algorithm. The training dataset consists of
attributes: personal status, job type, housing type and amount of money in their savings
account. They are represented as categorical variables which are well suited for Naïve Bayesian
Classifier.
With this training set we want to predict the credit behavior of a new customer. This problem
could be solved with logistic regression as well. If there are multiple levels for the outcome you
want to predict, then Naïve Bayesian Classifier is a better solution.
Next, we will go through the technical basis for Naïve Bayesian Classifiers and will revisit this
credit dataset later.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
6
Technical Description - Bayes' Law
P(C | A)
P( A C ) P( A | C ) P(C )
P( A)
P( A)
• C is the class label:
C ϵ {C1, C2, … Cn}
• A is the observed object attributes
A = (a1, a2, … am)
• P(C | A) is the probability of C given A is observed
Called the conditional probability
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
7
Bayes' Law states: P(C | A)*P(A) = P(A | C)*P(C) = P(A ^ C).
That is, the conditional probability that C is true given that A is true, denoted P(C|A), times the probability of A is
the same as the conditional probability that A is true given that C is true, denoted P(A|C), times the probability of
C. Both of these terms are equal to P(A^C) that is the probability A and C are simultaneously true. If we divide all
three terms by P(A), then we get the form shown on the slide.
The reason that Bayes’ Law is important is that we may not know P(C|A) (and we want to), but we do know P(A|C)
and P(C) for each possible value of C from the training data. As we will see later, it is not necessary to know P(A)
for the purposes of Naïve Bayes Classifiers.
An example using Bayes Law:
John flies frequently and likes to upgrade his seat to first class. He has determined that, if he checks in for his
flight at least two hours early, the probability that he will get the upgrade is .75; otherwise, the probability that he
will get the upgrade is .35. With his busy schedule, he checks in at least two hours before his flight only 40% of the
time. Suppose John didn’t receive an upgrade on his most recent attempt. What is the probability that he arrived
late?
C = John arrives late
A = John did not receive an upgrade
P(C) = Probability John arrives late = .6
P(A) = Probability John did not receive an upgrade = 1 – ( .4 x .75 + .6 x .35) = 1 -.51 = .49
P(A|C) = Probability that John did not receive an upgrade given that he arrived late = 1 - .35 = .65
P(C|A) = Probability that John arrived late given that he did not receive his upgrade
= P(A|C)P(C)/P(A) = (.65 x .6)/.49 = .80 (approx)
In this simple example, C can take one of two possible values {arriving early, arriving late) and there is only one
attribute which can take one of two possible values {received upgrade, did not receive upgrade}. Next, we will
generalize Bayes’ Law to multiple attributes and apply the naïve independence assumptions.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
7
Apply the Naïve Assumption and Remove a Constant
• For observed attributes A = (a1, a2, … am), we want to compute
P(Ci | A)
P(a1 , a2 ,...,am | Ci ) P(Ci )
P(a1 , a2 ,...,am )
i 1, 2,...,n
and assign the classifier, Ci , with the largest P(Ci|A)
• Two simplifications to the calculations
Apply naïve assumption - each aj is conditionally independent of
each other, then
m
P(a1 , a2 ,...,am | Ci ) P(a1 | Ci ) P(a2 | Ci ) P(am | Ci ) P(a j | Ci )
j 1
Denominator P(a1,a2,…am) is a constant and can be ignored
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
9
The general approach is to assign the classifier label, Ci, to the object with attributes A = (a1, a2,
… am) that corresponds to the largest value of P(Ci|A).
The probability that a set of attribute values A (comprised of m variables a 1 thru am) should be
labeled with a classification Ci will equal the probability that of the set of variables a1 thru am
given Ci is true, times the probability of Ci all divided by the probability of the set of attribute
values a1 thru am .
The conditional independence assumption is that the probability of observing the value of a
particular attribute given Ci is independent of the other attributes. This naïve assumption
simplifies the calculation of P(a1, a2, …, am|Ci) as shown on the slide.
Since P(a1, a2, …, am) appears in the denominator of P(Ci|A), for all values of i, removing the
denominator will have no impact on the relative probability scores and will simplify
calculations. Next, these two simplifications to the calculations will be applied to build the
Naïve Bayesian Classifier.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
9
Building a Naïve Bayesian Classifier
• Applying the two simplifications
m
P(Ci | a1 , a2 ,...,am ) P(a j | Ci ) P(Ci ) i 1, 2,...,n
j 1
• To build a Naïve Bayesian Classifier, collect the following
statistics from the training data:
P(Ci) for all the class labels.
P(aj| Ci) for all possible aj and Ci
Assign the classifier label, Ci, that maximizes the value of
m
P (a j | Ci ) P (Ci )
j 1
Copyright © 2014 EMC Corporation. All Rights Reserved.
i 1, 2,...,n
Module 4: Analytics Theory/Methods 10
Applying the two simplifications, P(Ci|a1, a2, …, am) is proportional to the product of the
various P(aj|Ci), for j=1,2,…m, times P(Ci). From a training dataset, these probabilities can be
computed and stored for future classifier assignments. We now return to the credit applicant
example.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
10
Naïve Bayesian Classifiers for the Credit Example
• Class labels: {good, bad}
P(good) = 0.7
P(bad) = 0.3
• Conditional Probabilities
P(own|bad) = 0.62
P(own|good) = 0.75
P(rent|bad)
= 0.23
P(rent|good) = 0.14
… and so on
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 11
To build a Naïve Bayesian Classifier we need to collect the following statistics:
1. Probability of all class labels – Probability of good credit and probability of bad credit. From
the all data available in the training set we determine P(good) = 0.7 and P(bad) = 0.3
2. In the training set, there are several attributes: personal_status, job, housing, and
saving_status. For each attribute and its possible values, we need to compute the
conditional probabilities given bad or good credit. For example, relative to the housing
attribute, we need to compute P(own|bad), P(own|good), P(rent|bad), P(rent|good), etc.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
11
Naïve Bayesian Classifier for a Particular Applicant
• Given applicant attributes of
A= {female single,
owns home,
self-employed,
savings > $1000}
• Since P(good|A) > (bad|A),
assign the applicant the label
“good” credit
aj
Ci
P(aj| Ci)
female single
good
0.28
female single
bad
0.36
own
good
0.75
own
bad
0.62
self emp
good
0.14
self emp
bad
0.17
savings>1K
good
0.06
savings>1K
bad
0.02
P(good|A) ~ (0.28*0.75*0.14*0.06)*0.7 = 0.0012
P(bad|A) ~ (0.36*0.62*0.17*0.02)*0.3 = 0.0002
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 12
Here we have an example of an applicant who is female, single, owns a home, is self-employed
and has savings over $1000 in her savings account. How will we classify this person? Will she
be scored as a person with good or bad credit?
Having built the classifier with the training set we find P(good|A) which is equal to 0.0012 (see
the computation on the slide) and P(bad|A) is 0.0002. Since P(good|A) is the maximum of the
two probability scores, we assign the label “good” credit.
The score is only proportional to the probability. It doesn’t equal the probability, because we
haven’t included the denominator. However, both formulas have the same denominator, so we
don’t need to calculate it in order to know which quantity is bigger.
Notice, though, how small in magnitude these scores are. When we are looking at problems
with a large number of attributes, or attributes with a very high number of levels, these values
can become very small in magnitude.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
12
Naïve Bayesian Implementation Considerations
• Numerical underflow
Resulting from multiplying several probabilities near zero
Preventable by computing the logarithm of the products
• Zero probabilities due to unobserved attribute/classifier pairs
Resulting from rare events
Handled by smoothing (adjusting each probability by a small amount)
• Assign the classifier label, Ci, that maximizes the value of
m
log P(a j | Ci ) log P(Ci )
j 1
where i = 1,2,…,n and
P’ denotes the adjusted probabilities
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 13
Multiplying several probability values, each possibly close to zero, invariably leads to the
problem of numerical underflow. So an important implementation guideline is to compute the
logarithm of the product of the probabilities, which is equivalent to the summation of the
logarithm of the probabilities. Although the risk of underflow may increase as the number of
attributes increase, the use of logarithms should be applied regardless of the number of
attribute dimensions.
Additionally, to address the possibility of probabilities equal to zero, smoothing techniques can
be employed to adjust the probabilities to ensure non-zero values. Applying a smoothing
technique assigns a small non-zero probability to rare events not included in the training
dataset. Also, the smoothing addresses the possibility of taking the logarithm of zero.
The R implementation of Naïve Bayes incorporates the smoothing directly into the probability
tables. Essentially, the Laplace smoothing that R uses adds one (or a small value) to every
count. For example, if we have 100 “good” customers, and 20 of them rent their housing, the
“raw” P(rent | good) = 20/100 = 0.2; with Laplace smoothing add adding one to the counts,
the calculation would be P(rent | good) ~ (20 + 1)/(100+3) = 0.20388, where there are 3
possible values for housing (own, rent, for free).
Fortunately, the use of the logarithms and the smoothing techniques are already implemented
in standard software packages for Naïve Bayes Classifiers. However, if for performance
reasons, the Naïve Bayes Classifier algorithm needs to be coded directly into an application,
these considerations should be implemented.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
13
Diagnostics
• Hold-out data
How well does the model classify new instances?
• Cross-validation
• ROC curve/AUC
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 14
The diagnostics we used in regression can be used to validate the effectiveness of the model
we built. The technique of using the hold-out data and performing N-fold cross validations and
using the ROC/Area Under the Curve methods can be deployed with Naïve Bayesian Classifier
as well.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
14
Diagnostics: Confusion Matrix
Prediction
true positives (TP)
Actual
Class
good
bad
good
671
29
700
bad
38
262
300
709
291
1000
false positives (FP)
false negatives (FN)
true negatives (TN)
Overall success rate (or accuracy):
(TP + TN) / (TP+TN+FP+FN) = (671+262)/1000 ≈ 0.93
TPR:
FPR:
FNR:
TP / (TP + FN) = 671 / (671+29) = 671/700 ≈ 0.96
FP / (FP + TN) = 38 / (38 + 262) = 38/300 ≈ 0.13
FN / (TP + FN) = 29 / (671 + 29) = 29/700 ≈ 0.04
Precision:
TP/ (TP + FP) = 671/709 ≈ 0.95
Recall (or TPR): TP / (TP + FN) ≈ 0.96
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 15
A confusion matrix is a specific table layout that allows visualization of the performance of a model. In
the hypothetical example of confusion matrix shown:
Of 1000 credit score samples, the system predicted that there were good and bad credit, and of the 700
good credits, the model predicted 29 as bad and similarly 38 of the actual bad credits were predicted as
good. All correct guesses are located in the diagonal of the table, so it’s easy to visually inspect the table
for errors, as they will be represented by any non-zero values outside the diagonal.
We define overall success rate (or accuracy) as a metric defining – what we got right – which is the ratio
between the sum of the diagonal values (i.e., TP and TN) vs. the sum of the table. In other words, the
confusion table of a good model has large numbers diagonally and small (ideally zero) numbers offdiagonally.
We saw a true positive rate (TPR) and a false positive rate (FPR) when we discussed ROC curves:
• TPR – what percent of positive instances did we correctly identify.
• FPR – what percent of negatives we marked positive.
Additionally we can measure the false negative rate (FNR):
• FNR– what percent of positives we marked negative
The computation of TPR, FPR and FNR are shown in the slide.
Precision and Recall are accuracy metrics used by the information retrieval community; they are often
used to characterize classifiers as well. We will detail these metrics in lesson 8 of this module.
Note:
• precision – what percent of things we marked positive really are positive
• recall – what percent of positive instances did we correctly identify. Recall is equivalent to TPR.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
15
Naïve Bayesian Classifier – Reasons to Choose (+)
and Cautions (-)
Reasons to Choose (+)
Handles missing values quite well
Robust to irrelevant variables
Easy to implement
Cautions (-)
Numeric variables have to be discrete
(categorized) Intervals
Sensitive to correlated variables
“Double-counting”
Not good for estimating probabilities
Stick to class label or yes/no
Easy to score data
Resistant to over-fitting
Computationally efficient
Handles very high dimensional
problems
Handles categorical variables with a
lot of levels
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 16
The Reasons to Choose (+) and Cautions (-) of the Naïve Bayesian Classifier are listed. Unlike
Logistic regression, missing values are handled well by the Naïve Bayesian Classifier. It is also
very robust to irrelevant variables (irrelevant variables are distributed among all the classes and
their effects are not pronounced).
The model is easy to implement and we will see how easily a basic version can be implemented
in the lab without using any packages. Scoring data (predicting) is very simple and the model is
resistant to over fitting. (Over fitting refers to fitting training data so well that we fit the
idiosyncrasies such as the data that are not relevant in characterizing the data). It is
computationally efficient and handles high dimensional problems efficiently. Unlike logistic
regression Naïve Bayesian Classifier handles categorical variables with a lot of levels.
The Cautions (-) are that it is sensitive to correlated variables as the algorithm double counts
the effect of the correlated variables. For example people with low income tend to default and
people with low credit tend to default. It is also true that people with low income tend to have
low credit. If we try to score “default” with both low income and low credit as variables we will
see the double counting effect in our model output and in the scoring.
Though the probabilities are provided as an output of the scored data, Naïve Bayesian Classifier
is not very reliable for the probability estimation and should be used for class label assignments
only. Naïve Bayesian Classifier in its simple form is used only with categorical variables and any
continuous variables should be rendered discrete into intervals. You will learn more about this
in the lab. However it is not necessary to have the continuous variables as “discrete” and
several standard implementations can handle continuous variables as well.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
16
Check Your Knowledge
1. Consider the following Training Data Set:
•
Apply the Naïve Bayesian Classifier to this
data set and compute the probability
score for P(y = 1|X) for X = (1,0,0)
Show your work
Your Thoughts?
Training Data Set
X1
1
1
0
0
1
0
X2
1
1
0
1
0
1
X3
1
0
0
0
1
1
Y
0
0
0
1
1
1
2. List some prominent use cases of the Naïve Bayesian Classifier.
3. What gives the Naïve Bayesian Classifier the advantage of being
computationally inexpensive?
4. Why should we use log-likelihoods rather than pure probability
values in the Naïve Bayesian Classifier?
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 17
Record your answers here. More Check Your Knowledge questions are on the next page.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
17
Check Your Knowledge (Continued)
5. What is a confusion matrix and how it is used to evaluate the
Your Thoughts?
effectiveness of the model?
6. Consider the following data set with two input features
temperature and season
• What is the Naïve Bayesian assumption?
• Is the Naïve Bayesian assumption satisfied for this problem?
Temperature
-10 to 50 F
50 to 70 F
70 to 85 F
85 to 110 F
Copyright © 2014 EMC Corporation. All Rights Reserved.
Season
Winter
Winter
Summer
Summer
Electricty Usage
High
Low
Low
High
Module 4: Analytics Theory/Methods 18
Record your answers here.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
18
Advanced Analytics – Theory and Methods
Naïve Bayesian Classifiers – Summary
During this lesson the following topics were covered:
• Naïve Bayesian Classifier
• Theoretical foundations of the classifier
• Use cases
• Evaluating the effectiveness of the classifier
• The Reasons to Choose (+) and Cautions (-) with the use of
the classifier
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods 19
This lesson covered these topics. Please take a moment to review them.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
19
Advanced Analytics – Theory and Methods
Copyright © 2014 EMC Corporation. All Rights Reserved.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
1
Module 4: Analytics Theory/Methods
1
Advanced Analytics – Theory and Methods
Decision Trees
During this lesson the following topics are covered:
• Overview of Decision Tree classifier
• General algorithm for Decision Trees
• Decision Tree use cases
• Entropy, Information gain
• Reasons to Choose (+) and Cautions (-) of Decision Tree
classifier
• Classifier methods and conditions in which they are best
suited
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
2
The topics covered in this lesson are listed.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
2
Decision Tree Classifier – What is it?
• Used for classification:
Returns probability scores of class membership
Well-calibrated, like logistic regression
Assigns label based on highest scoring class
Some Decision Tree algorithms return simply the most likely class
Regression Trees: a variation for regression
Returns average value at every node
Predictions can be discontinuous at the decision boundaries
• Input variables can be continuous or discrete
• Output:
A tree that describes the decision flow.
Leaf nodes return either a probability score, or simply a classification.
Trees can be converted to a set of “decision rules“
“IF income < $50,000 AND mortgage_amt > $100K THEN default=T with
75% probability“
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
3
Decision Trees are a flexible method very commonly deployed in data mining applications. In
this lesson we will focus on Decision Trees used for classification problems.
There are two types of trees; Classification Trees and Regression (or Prediction) Trees
• Classification Trees – are used to segment observations into more homogenous groups
(assign class labels). They usually apply to outcomes that are binary or categorical in nature.
• Regression Trees – are variations of regression and what is returned in each node is the
average value at each node (type of a step function with which the average value can be
computed). Regression trees can be applied to outcomes that are continuous (like account
spend or personal income).
The input values can be continuous or discrete. Decision Tree models output a tree that
describes the decision flow. The leaf nodes return class labels and in some implementations
they also return the probability scores. In theory the tree can be converted into decision rules
such as the example shown in the slide.
Decision Trees are a popular method because they can be applied to a variety of situations. The
rules of classification are very straight forward and the results can easily be presented visually.
Additionally, because the end result is a series of logical “if-then” statements, there is no
underlying assumption of a linear (or non-linear) relationship between the predictor variables
and the dependent variable.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
3
Decision Tree – Example of Visual Structure
Female
Male
Gender
Female
Male
Income
Branch – outcome of test
Internal Node – decision on variable
Age
45,000
Yes
No
40
Yes
Income
No
Leaf Node – class label
Age
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
4
Decision Trees are typically depicted in a flow-chart like manner.
Branches refer to the outcome of a decision and are represented by the connecting lines here.
When the decision is numerical, the “greater than” branch is usually shown on the right and
“less than” on the left.
Depending on the nature of the variable, you may need to include an “equal to” component on
one branch.
Internal Nodes are the decision or test points. Each refers to a single variable or attribute.
In the example here the outcomes are binary, although there could be more than 2 branches
stemming from an internal node.
For example, if the variable was categorical and had 3 choices, you might need a branch for
each choice.
The Leaf Nodes are at the end of the last branch on the tree. These represent the outcome of
all the prior decisions. The leaf nodes are the class labels, or the segment in which all
observations that follow the path to the leaf would be placed.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
4
Decision Tree Classifier – Use Cases
• When a series of questions (yes/no) are answered to arrive at a
classification
Biological species classification
Checklist of symptoms during a doctor’s evaluation of a patient
• When “if-then” conditions are preferred to linear models.
Customer segmentation to predict response rates
Financial decisions such as loan approval
Fraud detection
• Short Decision Trees are the most popular “weak learner” in
ensemble learning techniques
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
5
An example of Decision Trees in practice is the method for classifying biological species. A
series of questions (yes/no) are answered to arrive at a classification.
Another example is a checklist of symptoms during a doctor’s evaluation of a patient. People
mentally perform these types of analysis frequently when assessing a situation.
Other use cases can be customer segmentation to better predict response rates to marketing
and promotions. Computers can be “taught” to evaluate a series of criteria and automatically
approve or deny an application for a loan. In the case of loan approval, computers can use the
logical “if-then” statements to predict whether the customer will default on the loan. For
customers with a clear (strong) outcome, no human interaction is required, for observations
which may not generate a clear response, a human is needed for the decision.
Short Decision Trees (where we have limited the number of splits) are often used as
components (called “weak learners” or “base learners”) in ensemble techniques (a set of
predictive models which will all vote and we take decisions based on the combination of the
votes) such as Random forests, bagging and boosting (Beyond the scope for this class). The
very simplest of the short trees are decision stumps: Decision Trees with one internal node (the
root) which is immediately connected to the terminal nodes. A decision stump makes a
prediction based on the value of just a single input feature.
Copyright © 2014 EMC Corporation. All rights reserved.
Module 4: Analytics Theory/Methods
5
Example: The Credit Prediction Problem
700/1000
p(good)=0.70
savings= =1000,no known savings
245/294
p(good)=0.83
housing=free, rent
housing=own
349/501
p(good)=0.70
personal=female, male div/sep
personal=male mar/wid, male single
36/88
p(good) = 0.41
70/117
p(good)=0.60
Copyright © 2014 EMC Corporation. All Rights Reserved.
Module 4: Analytics Theory/Methods
6
We will use the same example we used in the previous lesson with Naïve Bayesian classifier.
For the people with good credit and we start at the top of the tree the probability is 70% (700 out of 1000 people
have good credit). The process has decided that we are going to split how much is in the savings account into two
groups.
One group with savings less than $100 or between $100 to $ 500.
The second group is the rest of the population which has savings of $500 to $1000 or greater than $1000 or no
known savings.
We compute the probability of good credit at the second node and we find in the second savings category 245 out of
294 have good credit and the probability at this node is 83%.
Looking at the other node (Savings