Question 1In your own words, discuss the reasons behind Data Analysis and Data Mining becoming more and more popular (almost to a degree of being a requirement for any mid/large size businesses). Give at least 3 reasons and explain them (please use numbering for your 3 reasons):
Question 2Assume, two attributes have a correlation of 0.02; what does this tell you about the relationship of the two attributes? Answer the same question assuming the correlation is -0.98.
Question 3 :
Give the definitions of
Training set and Test set:
Also, Explain the functionality of each one:
Question 4
What is overfitting? Why is it so problematic for Decision Tree Induction? How to address overfitting? Question 8 Given two models of classification – Model M1: accuracy = 85%, tested on 30 instances – Model M2: accuracy = 75%, tested on 5000 instances What test would help to find which model is better? a. Test of Reliability b. Test of Accuracy c. Test of Model Fitness d. Test of Significance Give the definitions of
Training set and Test set:
Also, Explain the functionality of each one:
Question 2
For the tree given below:
a) Estimate the training error (optimistic error) for the parent: (2 pts)
b) Estimate the training error (optimistic error) for the children: (8 pts.)
c) Estimate the pessimistic error for the parent: (2 pts.)
d) Estimate the pessimistic error for the children: (8pts)
Question 3
Using tables given, calculate following and answer the question on e) by providing a reason
a. Accuracy of M1 (as percentage, e.g 65%, please enter the % sign as well) [4 pts]
b. Accuracy of M2 (as percentage, e.g 65%, please enter the % sign as well) [4 pts]
c. Cost (M1) [c] [6 pts]
d. Cost (M2) [[6 pts]
e) which model would you choose based on your calculations.. Why? (you can assume if the cost difference is less
than %10 you can choose high accuracy one if it is more than %20 you can choose low accuracy with low
cost) [5 pts]
Question 4
The method that predicts a value of a given continuous valued variable based on the values of
other variables, assuming a linear or nonlinear model of dependency is :
a. Correlation
b. Cohesion
c. Regression
d. Separation
Question 5
Assume, two attributes have a correlation of 0.02; what does this tell you about the
relationship of the two attributes? Answer the same question assuming the correlation is -0.98.
Question 6
In your own words, discuss the reasons behind Data Analysis and Data Mining becoming more
and more popular (almost to a degree of being a requirement for any mid/large size
businesses). Give at least 3 reasons and explain them (please use numbering for your 3
reasons):
Question 7
What is overfitting? Why is it so problematic for Decision Tree Induction? How to address
overfitting?
Question 8
Given two models of classification
– Model M1: accuracy = 85%, tested on 30 instances
– Model M2: accuracy = 75%, tested on 5000 instances
What test would help to find which model is better?
a. Test of Reliability
b. Test of Accuracy
c.
Test of Model Fitness
d. Test of Significance