After completing the reading this week answer the following questions:
Chapter 2:
- What is an attribute and note the importance?
- What are the different types of attributes?
- What is the difference between discrete and continuous data?
- Why is data quality important?
- What occurs in data preprocessing?
- In section 2.4, review the measures of similarity and dissimilarity, select one topic and note the key factors.
©Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Dr. Oner Celepcikay
ITS 632
Data Mining
Summer 2019Week 2: Data & Data Exploration
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Chapter 3 Exploring Data
1st Step of Machine Learning
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
What is data exploration?
! Key motivations of data exploration include
– Helping to select the right tool for preprocessing or analysis
– Making use of humans’ abilities to recognize patterns
u People can recognize patterns not captured by data analysis
tools
! Related to the area of Exploratory Data Analysis (EDA)
– Created by statistician John Tukey
– Seminal book is Exploratory Data Analysis by Tukey
– A nice online introduction can be found in Chapter 1 of the NIST
Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/index.htm
A preliminary exploration of the data to
better understand its characteristics.
http://www.itl.nist.gov/div898/handbook/index.htm
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Techniques Used In Data Exploration
! In EDA, as originally defined by Tukey
– The focus was on visualization
– Clustering and anomaly detection were viewed as
exploratory techniques
– In data mining, clustering and anomaly detection are
major
areas
of interest, and not thought of as just
exploratory
! In our discussion of data exploration, we focus on
– Summary statistics
–
Visualization
– Online Analytical Processing (OLAP)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Summary Statistics
! Summary statistics are numbers that summarize
properties of the data
– Summarized properties include frequency, location and
spread
u Examples: location – mean
spread – standard deviation
– Most summary statistics can be calculated in a single
pass through the data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Frequency and Mode
!The frequency of an attribute value is the
percentage of time the value occurs in the
data set
– For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time.
! The mode of a an attribute is the most frequent
attribute value
! The notions of frequency and mode are typically
used with categorical data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Measures of Location: Mean and Median
! The mean is the most common measure of the
location of a set of points.
! However, the mean is very sensitive to outliers.
! Thus, the median or a trimmed mean is also
commonly used.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Measures of Spread: Range and Variance
! Range is the difference between the max and min
! The variance or standard deviation is the most
common measure of the spread of a set of points.
! However, this is also sensitive to outliers, so that
other measures are often used.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Visualization
Visualization is the conversion of data into a visual
or tabular format so that the characteristics of the
data and the relationships among data items or
attributes can be analyzed or reported.
! Visualization of data is one of the most powerful
and appealing techniques for data exploration.
– Humans have a well developed ability to analyze large
amounts of information that is presented visually
– Can detect general patterns and trends
– Can detect outliers and unusual patterns
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Example: Sea Surface Temperature
! The following shows the Sea Surface
Temperature (SST) for July 1982
– Tens of thousands of data points are summarized in a
single figure
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Representation
! Is the mapping of information to a visual format
! Data objects, their attributes, and the relationships
among data objects are translated into graphical
elements such as points, lines, shapes, and
colors.
! Example:
– Objects are often represented as points
– Their attribute values can be represented as the
position of the points or the characteristics of the
points, e.g., color, size, and shape
– If position is used, then the relationships of points, i.e.,
whether they form groups or a point is an outlier, is
easily perceived.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
One Great Example
! The Power of Visualization by Hans Rosling
https://www.ted.com/talks/hans_rosling_shows_the_best
_stats_you_ve_ever_seen?language=en
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Arrangement
! Is the placement of visual elements within a
display
! Can make a large difference in how easy it is to
understand the data
! Example:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Selection
! Is the elimination or the de-emphasis of certain
objects and attributes
! Selection may involve the chossing a subset of
attributes
– Dimensionality reduction is often used to reduce the
number of dimensions to two or three
– Alternatively, pairs of attributes can be considered
! Selection may also involve choosing a subset of
objects
– A region of the screen can only show so many points
– Can sample, but want to preserve points in sparse
areas
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Visualization Techniques: Histograms
! Histogram
– Usually shows the distribution of values of a single variable
– Divide the values into bins and show a bar plot of the number of
objects in each bin.
– The height of each bar indicates the number of objects
– Shape of histogram depends on the number of bins
! Example: Petal Width (10 and 20 bins, respectively)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Two-Dimensional Histograms
! Show the joint distribution of the values of two
attributes
! Example: petal width and petal length
– What does this tell us?
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Visualization Techniques: Box Plots
! Box Plots
– Invented by J. Tukey
– Another way of displaying the distribution of data
– Following figure shows the basic part of a box plot
outlier
10th percentile
25th percentile
75th percentile
50th percentile
10th percentile
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Example of Box Plots
! Box plots can be used to compare attributes
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Visualization Techniques: Scatter Plots
! Scatter plots
– Attributes values determine the position
– Two-dimensional scatter plots most common, but can
have three-dimensional scatter plots
– Often additional attributes can be displayed by using
the size, shape, and color of the markers that
represent the objects
– It is useful to have arrays of scatter plots can
compactly summarize the relationships of several pairs
of attributes
u See example on the next slide
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Iris Sample Data Set
! Many of the exploratory data techniques are illustrated
with the Iris Plant data set.
– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
u Setosa
u Virginica
u Versicolour
– Four (non-class) attributes
u Sepal width and length
u Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
http://www.ics.uci.edu/~mlearn/MLRepository.html
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Scatter Plot Array of Iris Attributes