Question in text doc
Running Head: BASIC CONCEPTS AND TECHNIQUES 1
BASIC CONCEPTS AND TECHNIQUES 4
Basic Concepts and Techniques
Classification
Name
Institution
Course
Tutor
Date
Basic Concepts in Data Classification
Data classification refers to the process involved in organizing data in different categories for it to be used effectively. Classification of data make it easier for retrieval and location. Additionally, it also reduces several duplications of data thereby reducing storage as well as backup costs. The main types of data classification involves: content, context and user (Ghaddar and Naoum, 2018). In the content, the classification is based on looking for sensitive information. On context, the classification is based on searching for indirect indicators of sensitive information.
General framework for classification
Data classification consist of grouping depending on the relevance. Data is classified on the bases of the content carried, the knowledge involved and the content contained. One of the necessity in data classification is the data framework. Framework provide the structure. The framework is significant to the enterprise organisation who benefit from big data.
What is a decision tree and decision tree modifier?
Decision tree refers to a supervised machine where the data is split depending on specific parameters. Decision tree consists of nodes, edges and the leaf nodes. The nodes test the value of specific attribute. Branch correlate with the outcome of the test. Leaf nodes predicts the outcome. On the other hand, decision tree modifier refers to the discriminator class that separate the training set such that each portion contains entirely of one class.
What is a hyper parameter?
Hyper parameter refers to an external configuration to the model and its value cannot be calculated from the data. Hyper parameter are mostly used in estimation of model parameters and are specified by the practitioner. Hyper parameters are adjustable parameters used in obtaining a model with optical performances (Chen et al, 2019).
Hyper parameter optimization is a challenge especially when selecting a set of optical hyper parameters. The parameter is used to regulate the learning process.
Model evaluation is a significant part in the development process. It is significant in finding the best model representing the data as well as how well the selected model perform various activities. Some of the cross validation pitfalls when choosing and assessing data include selection of model performance, selection of variables and performance of single cross validation.
References
Ghaddar, B., & Naoum-Sawaya, J. (2018). High dimensional data classification and feature selection using support vector machines. European Journal of Operational Research, 265(3), 993-1004.
Wu, J., Chen, X. Y., Zhang, H., Xiong, L. D., Lei, H., & Deng, S. H. (2019). Hyperparameter optimization for machine learning models based on Bayesian optimization. Journal of Electronic Science and Technology, 17(1), 26-40.
Running Head: BASIC CONCEPTS AND TECHNIQUES
1
Basic Concepts and Techniques
Classification
Name
Institution
Course
Tutor
Date
Running Head: BASIC CONCEPTS AND TECHNIQUES 1
Basic Concepts and Techniques
Classification
Name
Institution
Course
Tutor
Date
About
Press
Copyright
Contact us
Creators
Advertise
Developers
Terms
Privacy
Policy & Safety
How YouTube works
Test new features
© 2022 Google LLC
1
Data collection methods
Student’s name
Affiliation
Course
Professor
Date of submission
This study source was downloaded by 100000802314458 from CourseHero.com on 01-24-2022 19:10:04 GMT -06:00
https://www.coursehero.com/file/107950179/w4downloaddocx/
https://www.coursehero.com/file/107950179/w4downloaddocx/
2
Introduction
The process of data collection is an essential part of our daily routine activities. The data
collection process can be identified as the guidance engine, which usually drives us to a quality
improvement in areas one is investigating. For instance, Capri’s (2015) reading on the manual
data collection has mainly assisted many individuals in understanding how data is collected. This
has primarily made a promising discovery of how things are mainly operating as one mostly
becomes more interested in the data collection process. Through the data collection, we can
investigate wide occurrences of the research questions.
What were the traditional methods of data collection in the transit system?
Several methods of data collection were traditionally used in the transit system. The
traditional methods which were mainly used included invasive techniques. This method was used
primarily to use a piezo- sensor or a magnet as a local data collection method. There was the use
of a human surveyor in the terrain or from a video in collecting data (Lai, et al., 2020). This was
mainly done through both direct and indirect personal interviews. The human surveyor mostly
involved the idea of questionnaire use. Radars and other simple techniques or a form of image
analysis were mainly used through a machine vision. The traditional methods were primarily
used as they offered a good platform for collecting data effectively.
Why are the traditional methods insufficient in satisfying the requirement of data
collection?
The traditional methods of data collection were not more effective as they had some
limitations. The data collected was not to the standards as there were not many considerations
with the required standards. This had an impact on the comparability of the data collection. The
context of data collection was not much considered. For instance, those who were collecting the
This study source was downloaded by 100000802314458 from CourseHero.com on 01-24-2022 19:10:04 GMT -06:00
https://www.coursehero.com/file/107950179/w4downloaddocx/
https://www.coursehero.com/file/107950179/w4downloaddocx/
3
data were not much valued. There was a complexity as there were many risks and dangers in the
data collection process (Baumfeld Andre, et al., 2020). The barriers made it difficult to attain the
quality requirements of the data. Finally, the data collected through the traditional methods were
not to the standards as there was a lack of training in collecting data. These challenges made it
difficult for the conventional approaches to be more satisfying in data collection.
Give a synopsis of the Capri (2015) case study and your thoughts regarding the
requirements of the optimization and performance measurement requirements and the
impact to expensive and labor-intensive nature
After reading the chapter by Capri (2015) on manual data collection, I realized an
apparent breakdown. This chapter is mainly regarding equipping the researchers with the best
techniques to apply in the data collection process. This chapter has helped chiefly the
engineering sector be more conversant with data collection methods. According to my thoughts, I
would mainly have to appreciate all those individuals who would take time and read this chapter;
they will be more equipped with the best data collection methods. Through this idea, all the
requirements and best measurements will be emphasized; hence the impacts to the expensive and
labor-intensive nature will be utilized effectively.
The Capri (2015) chapter reading has mainly helped several individuals minimize the
expense they would experience in data collecting processes.
This chapter has taught us how we can work in the data collection process without
experiencing many challenges. There is also an improvement as a whole as the issue of labor-
intensive is well outlined in nature. The techniques and alternatives offered are mainly likely to
impact the engineering sector as a whole. This is evidenced as there is a good optimization and
an improvement in the performance requirements.
This study source was downloaded by 100000802314458 from CourseHero.com on 01-24-2022 19:10:04 GMT -06:00
https://www.coursehero.com/file/107950179/w4downloaddocx/
https://www.coursehero.com/file/107950179/w4downloaddocx/
4
In conclusion, the engineering department has been identified to improve in general as
there are guidelines and processes pertaining there data collection methods. As we are today, we
cannot compare this as it was before. There were several challenges which impacted the data
collection process. In the current situation, we are happy as there is a clear light of where we will
be in the future.
This study source was downloaded by 100000802314458 from CourseHero.com on 01-24-2022 19:10:04 GMT -06:00
https://www.coursehero.com/file/107950179/w4downloaddocx/
https://www.coursehero.com/file/107950179/w4downloaddocx/
5
References
Baumfeld Andre, E., Reynolds, R., Caubel, P., Azoulay, L., & Dreyer, N. A. (2020). Trial designs
using real‐world data: The changing landscape of the regulatory approval
process. Pharmacoepidemiology and drug safety, 29(10), 1201-1212.
Lai, X., Teng, J., & Ling, L. (2020, September). Evaluating Public Transportation Service in a
Transit Hub based on Passengers Energy Cost. In 2020 IEEE 23rd International
Conference on Intelligent Transportation Systems (ITSC) (pp. 1-7). IEEE.
This study source was downloaded by 100000802314458 from CourseHero.com on 01-24-2022 19:10:04 GMT -06:00
https://www.coursehero.com/file/107950179/w4downloaddocx/
Powered by TCPDF (www.tcpdf.org)
https://www.coursehero.com/file/107950179/w4downloaddocx/
http://www.tcpdf.org
Dr. Oner Celepcikay
ITS 632
ITS 632
Week 4
Classification
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52″
Width: 10.02″
Scale: 70%
Position on slide:
Horizontal – 0″
Vertical – 0″
Machine Learning Methods – Classification
ITS 632
Given a collection of records (training set)
– Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
A test set is used to estimate the accuracy of the model.
Goal: previously unseen records (test set) should be assigned a class as accurately as possible.
Machine Learning – Classification Example
ITS 632
categorical
categorical
continuous
class
Test
Set
Training
Set
Model
Learn
Classifier
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Splitting Attributes
Model: Decision Tree
Machine Learning – Classification Example
categorical
categorical
continuous
ITS 632
class
MarSt
Refund
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
There could be more than one tree that fits the same data!
categorical
categorical
continuous
Another Example of Decision Tree
ITS 632
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
ITS 632
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
ITS 632
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
ITS 632
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
ITS 632
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
ITS 632
Test Data
Start from the root of tree.
Apply Model to Test Data
ITS 632
Assign “Cheat” No
No
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Machine Learning – Classification Example
ITS 632
categorical
categorical
continuous
class
Model
Learning
Algorithm
Induction
Deduction
General Structure of Hunt’s Algorithm
Let Dt be the set of training records that reach a node t
General Procedure:
If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt
If Dt is an empty set, then t is a leaf node labeled by the default class, yd
If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
Dt
?
ITS 632
Don’t
Cheat
Refund
Don’t
Cheat
Don’t
Cheat
Yes
No
Refund
Don’t
Cheat
Yes
No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Taxable
Income
Don’t
Cheat
< 80K
>= 80K
Refund
Don’t
Cheat
Yes
No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Hunt’s Algorithm
ITS 632
Decision Tree Application to Oil & Gas Data
ITS 632
British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms that replaced an earlier rule-based expert system.
We will do a similar (but simpler) decision tree example towards the end of the semester.
Greedy strategy.
Split the records based on an attribute test that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Tree Induction
ITS 632
How to determine the Best Split
ITS 632
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
How to determine the Best Split
ITS 632
Greedy approach:
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity:
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
Measures of Node Impurity
ITS 632
Gini Index
Entropy
Misclassification error
How to Find the Best Split
ITS 632
B?
Yes
No
Node N3
Node N4
A?
Yes
No
Node N1
Node N2
Before Splitting:
M0
M1
M2
M3
M4
M12
M34
Gain = M0 – M12 vs M0 – M34
Measure of Impurity: GINI
ITS 632
Gini Index for a given node t :
Need a measure of node impurity:
(NOTE: p( j | t) is the relative frequency of class j at node t).
Maximum (0.5) when records are equally distributed among all classes, implying least interesting information
Minimum (0.0) when all records belong to one class, implying most interesting information
Examples for computing GINI
ITS 632
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Examples for computing GINI
ITS 632
A?
Yes
No
Node N1
Node N2
Gini(N1)
= 1 – (4/7)2 – (3/7)2
= 0.4898
Gini(N2)
= 1 – (2/5)2 – (3/5)2
= 0.48
Gini(Children)
= 7/12 * 0.4898 +
5/12 * 0.48
= 0.486
Examples for computing GINI
ITS 632
B?
Yes
No
Node N1
Node N2
Gini(N1)
= 1 – (/)2 – (/)2
=
Gini(N2)
= 1 – (/)2 – (/)2
=
Gini(Children)
=
Classification error at a node t :
Measures misclassification error made by a node.
Maximum (0.5) when records are equally distributed among all classes, implying least interesting information
Minimum (0) when all records belong to one class, implying most interesting information
Splitting Criteria based on Classification Error
ITS 632
Splitting Criteria based on Classification Error
ITS 632
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Greedy strategy.
Split the records based on an attribute test that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting (Next class!)
ANY IDEAS??
Tree Induction
ITS 632
Classification Methods
ITS 632
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Single
75K
?
Yes
Married
50K
?
No
Married
150K
?
Yes
Divorced
90K
?
No
Single
40K
?
No
Married
80K
?
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Single
75K
?
Yes
Married
50K
?
No
Married
150K
?
Yes
Divorced
90K
?
No
Single
40K
?
No
Married
80K
?
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Single
75K
?
Yes
Married
50K
?
No
Married
150K
?
Yes
Divorced
90K
?
No
Single
40K
?
No
Married
80K
?
10
Tid Refund Marital
Status
Taxable
Income
Cheat
1 Yes Single 125K
No
2 No Married 100K
No
3 No Single 70K
No
4 Yes Married 120K
No
5 No Divorced 95K
Yes
6 No Married 60K
No
7 Yes Divorced 220K
No
8 No Single 85K
Yes
9 No Married 75K
No
10 No Single 90K
Yes
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Own
Car?
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type?
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Student
ID?
…
Yes
No
Family
Sports
Luxuryc
1
c
10
c
20
C0: 0
C1: 1
…
c
11
Own Car?�
C0: 6
C1: 4�
C0: 4
C1: 6�
Car Type?�
C0: 1
C1: 3�
C0: 8
C1: 0�
C0: 1
C1: 7�
C0: 1
C1: 0�
C0: 1
C1: 0�
C0: 0
C1: 1�
Student ID?�
…�
Yes�
No�
Family�
Sports�
Luxury�
c1�
c10�
c20�
C0: 0
C1: 1�
…�
c11�
C0: 5
C1: 5
C0: 9
C1: 1
C0: 5
C1: 5�
C0: 9
C1: 1�
C0 N10
C1 N11
C0 N20
C1 N21
C0 N30
C1 N31
C0 N40
C1 N41
C0 N00
C1 N01
C0
N40
C1
N41
C0
N00
C1
N01
C0
N10
C1
N11
C0
N20
C1
N21
C0
N30
C1
N31
å
–
=
j
t
j
p
t
GINI
2
)]
|
(
[
1
)
(
C1
0
C2
6
Gini=0.000
C1
2
C2
4
Gini=0.444
C1
3
C2
3
Gini=0.500
C1
1
C2
5
Gini=0.278
C1
1
C2
5
Gini=0.278
C1
0
C2
6
Gini=0.000
C1
2
C2
4
Gini=0.444
C1
3
C2
3
Gini=0.500
C1
0
C2
6
C1
2
C2
4
C1
1
C2
5
C1
0
C2
6
C1
2
C2
4
C1
1
C2
5
Parent
C1
6
C2
6
Gini = 0.500
N1 N2
C1 4 2
C2 3 3
Gini=0.486
N1 N2
C1 4 2
C2 3 3
Gini=0. 486
Parent
C1
6
C2
6
Gini = 0.500
N1
N2
C1
4
2
C2
3
3
Gini=0.486
N1 N2
C1 1 5
C2 4 2
Gini=?
N1 N2
C1 1 5
C2 4 2
Gini= ?
Parent
C1
6
C2
6
Gini = 0.500
N1
N2
C1
1
5
C2
4
2
Gini=?
)
|
(
max
1
)
(
t
i
P
t
Error
i
–
=
C1
1
C2
5
C1
0
C2
6
C1
2
C2
4
In an essay format answer the following questions:In essay format answer the following questions:After reading the chapter by Capri (2015) on manual data collection. Answer the following questions:
What were the traditional methods of data collection in the transit system?
Why are the traditional methods insufficient in satisfying the requirement of data collection?
Give a synopsis of the case study and your thoughts regarding the requirements of the optimization and performance measurement requirements and the impact to expensive and labor-intensive nature.
In an APA7 formatted essay answer all questions above. There should be headings to each of the questions above as well. Ensure there are at least two-peer reviewed sources to support your work. The paper should be at least two pages of content (this does not include the cover page or reference page).