Please use the attached sample and Project reports.
How to Prepare the Portfolio
The portfolio report must be typewritten and should be a minimum of 3 complete pages in length. A project report that is two and a half pages (2.5 pages) is not acceptable. Large margins that have been increased to meet the length requirement are also not acceptable. If your report is not submitted according to the stated specifications, you will be asked to re-write it.
Do not submit the original project; the report is meant to capture the project highlights and significant points of the original project.
You will write a report on the project that includes:
- Introduction
- Explanation of the solution
- Description of the results
- Description of your contributions to the project
- Explanation of what new skills, techniques, or knowledge you acquired from the project and if it was a group project, you should also include a list of team members who worked on the project.
- A reference section with at least 4 references and citations to those references inside the text. Use the IEEE Referencing Style Sheet for citation guidelines.
1
Individual Contribution Report
Pradeep Peddnade
Id: 122096257
4
2
Reflection:
My overall role in the team was Data Analyst where I was responsible for combining
theory in the group and practices to make and communicate data insights that enabled my
team to make informed inferences regarding the data. Through skills such as data analytics and
statistical modeling, my role as a data analyst was crucial in mining and gathering data. Once
data is ready, performed exploratory analysis for native-country, race, education, and work
class variables of the dataset.
The other role was charged with as a data analyst in the group was to apply statistical
tools to construe the mined data by giving specific attention to the trends and the various
patterns which would lead to predictive analytics to enable the group to make informed
decisions and predictions.
Another role that I did for the group was to work on data cleansing. The specific role
involved managing data though procedure that ensures data us properly formatted and
irrelevant data points are removed.
Lessons Learned:
The wisdom that I would share with others regarding research design is to ensure that
the design is straightforward and aimed towards answering the research question. Having an
appropriate research design will assist the group to answer the research question effectively. I
would also share with the team that it is very appropriate to consider at the time of data
collection from sources and analyze the data into something that the researcher the team
would want to consider. On how to best apply them is to consider that it is appropriate for the
team to ensure that the data is analyzed appropriately and structured appropriately. Make sure
data is cleansed and outliers are removed or normalized.
From the group, we can conclude that the research was an honest effort that was
established to identify how the lessons learned are beyond the project. The data analytics skills
ensured that the analyzed data was collected from the primary sources of data, this prevent
3
the group from the biasedness of another research that was previously conducted. In this, data
world there is unlimited data choosing right variable among the data to answer the research
questions is very important by using correlation and other techniques.
Assessment:
Additional skills that I learned from the course and during the project work is choosing
the visualization type and variables from data set, which is a very important in the analysis of
data. Through this skill, I was able to conceptualize and properly analyze and interpret big data
that requires data modeling and management. Despite that is through the group that I was able
to develop my communication skills since the data analytic role needed an excellent
communicator who would interpret and explain the various inferences to my group.
Group members are in different time zones, scheduling a time to meet was
strenuousness. Everyone in the team was accommodating.
Future Application:
In my current role, I will analyze the metrics of the cluster and logs to monitor the health
of the different services using Elasticsearch Kibana and Grafana. The topics I learned in this
course will be greatly useful and I can apply it in building metrics based Kibana dashboard for
Management to see the usage and cost incurred for each service running in the cluster. And I
will use statistical methods on picking the fields interested among thousands of available fields.
4
2/26/22, 9:04 PM CSE578Project
localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 1/13
In [71]: import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
import numpy
from statsmodels.graphics.mosaicplot import mosaic
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_scor
e, f1_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, tr
ain_test_split
import warnings
%matplotlib inline
df = pd.read_csv(“data/adult.data”, header=None, sep=”, “)
df.columns = [“age”, “workclass”, “fnlwgt”, “education”, “education-num”
, “marital-status”, “occupation”, “relationship”, “race”, “sex”, “capita
l-gain”, “capital-loss”, “hours-per-week“, “native-country”, “class”]
df = df[df[“workclass”] != ‘?’]
df = df[df[“education”] != ‘?’]
df = df[df[“marital-status”] != ‘?’]
df = df[df[“occupation”] != ‘?’]
df = df[df[“relationship”] != ‘?’]
df = df[df[“race”] != ‘?’]
df = df[df[“sex”] != ‘?’]
df = df[df[“native-country”] != ‘?’]
below = df[df[“class”] == “<=50K"]
above = df[df["class"] == ">50K”]
‘python’ engine because the ‘c’ engine does not support regex separator
s (separators > 1 char and different from ‘\s+’ are interpreted as rege
x); you can avoid this warning by specifying engine=’python’.
df = pd.read_csv(“data/adult.data”, header=None, sep=”, “)
2/26/22, 9:04 PM CSE578Project
localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 2/13
In [61]: above_50k = Counter(above[‘native-country’])
below_50k = Counter(below[‘native-country’])
print(‘native-country’)
fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(5,10))
axes[0].pie(above_50k.values(), labels=above_50k.keys(), autopct=’%1.0f%
%’)
axes[0].set_title(“>50K”)
axes[1].pie(below_50k.values(), labels=below_50k.keys(), autopct=’%1.0f%
%’)
axes[1].set_title(“<=50K")
plt.show()
native-country
2/26/22, 9:04 PM CSE578Project
localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 3/13
In [62]: above_50k = Counter(above[‘race’])
below_50k = Counter(below[‘race’])
print(‘race’)
fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(5,10))
axes[0].pie(above_50k.values(), labels=above_50k.keys(), autopct=’%1.0f%
%’)
axes[0].set_title(“>50K”)
axes[1].pie(below_50k.values(), labels=below_50k.keys(), autopct=’%1.0f%
%’)
axes[1].set_title(“<=50K")
plt.show()
race
2/26/22, 9:04 PM CSE578Project
localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 4/13
In [63]: above_50k = Counter(above[‘education’])
below_50k = Counter(below[‘education’])
print(‘education’)
fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(5,10))
axes[0].pie(above_50k.values(), labels=above_50k.keys(), autopct=’%1.0f%
%’)
axes[0].set_title(“>50K”)
axes[1].pie(below_50k.values(), labels=below_50k.keys(), autopct=’%1.0f%
%’)
axes[1].set_title(“<=50K")
plt.show()
education
2/26/22, 9:04 PM CSE578Project
localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 5/13
In [64]: above_50k = Counter(above[‘workclass’])
below_50k = Counter(below[‘workclass’])
print(‘workclass’)
fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(5,10))
axes[0].pie(above_50k.values(), labels=above_50k.keys(), autopct=’%1.0f%
%’)
axes[0].set_title(“>50K”)
axes[1].pie(below_50k.values(), labels=below_50k.keys(), autopct=’%1.0f%
%’)
axes[1].set_title(“<=50K")
plt.show()
workclass
2/26/22, 9:04 PM CSE578Project
localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 6/13
In [65]: fig, axes = plt.subplots(ncols=2, nrows=3, figsize=(8,8))
fig.subplots_adjust(hspace=.5)
x = below[‘capital-gain’]
y = below[‘age’]
axes[0, 0].scatter(x,y)
axes[0, 0].set_title(“<=50K")
axes[0, 0].set_xlabel('capital-gain')
axes[0, 0].set_ylabel('age')
x = above['capital-gain']
y = above['age']
axes[0, 1].scatter(x,y)
axes[0, 1].set_title(">50K”)
axes[0, 1].set_xlabel(‘capital-gain’)
axes[0, 1].set_ylabel(‘age’)
x = below[‘age’]
y = below[‘hours-per-week’]
axes[1, 0].scatter(x,y)
axes[1, 0].set_title(“<=50K")
axes[1, 0].set_xlabel('age')
axes[1, 0].set_ylabel('hours-per-week')
x = above['age']
y = above['hours-per-week']
axes[1, 1].scatter(x,y)
axes[1, 1].set_title(">50K”)
axes[1, 1].set_xlabel(‘age’)
axes[1, 1].set_ylabel(‘hours-per-week’)
x = below[‘hours-per-week’]
y = below[‘capital-gain’]
axes[2, 0].scatter(x,y)
axes[2, 0].set_title(“<=50K")
axes[2, 0].set_xlabel('hours-per-week')
axes[2, 0].set_ylabel('capital-gain')
x = above['hours-per-week']
y = above['capital-gain']
axes[2, 1].scatter(x,y)
axes[2, 1].set_title(">50K”)
axes[2, 1].set_xlabel(‘hours-per-week’)
axes[2, 1].set_ylabel(‘capital-gain’)
plt.show()
2/26/22, 9:04 PM CSE578Project
localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 7/13
2/26/22, 9:04 PM CSE578Project
localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 8/13
In [50]: fig, axes = plt.subplots(ncols=1, nrows=1, figsize=(15,10))
fig.subplots_adjust(hspace=.5)
mosaic(df, [‘occupation’, ‘class’], ax=axes, axes_label=False)
plt.show()
2/26/22, 9:04 PM CSE578Project
localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 9/13
In [51]: fig, axes = plt.subplots(ncols=1, nrows=1, figsize=(15,10))
fig.subplots_adjust(hspace=.5)
mosaic(df, [‘marital-status’, ‘class’], ax=axes, axes_label=False)
plt.show()
2/26/22, 9:04 PM CSE578Project
localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 10/13
In [54]: fig, axes = plt.subplots(ncols=1, nrows=1, figsize=(15,12))
fig.subplots_adjust(hspace=.5)
mosaic(df, [‘education-num’, ‘class’], ax=axes, axes_label=False)
plt.show()
2/26/22, 9:04 PM CSE578Project
localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 11/13
In [90]: train = df
train = train.drop(“capital-loss”, axis=1)
train = train.drop(“native-country”, axis=1)
train = train.drop(“fnlwgt”, axis=1)
train = train.drop(“education”,axis=1)
def get_occupation(x):
if x in [“Exec-managerial”, “Prof-specialty”, “Protective-serv”]:
return 1
elif x in [“Sales”, “Transport-moving”, “Tech-support”, “Craft-repai
r”]:
return 2
else:
return 3
def get_relationship(x):
if x == “Own-child”:
return 6
elif x == “Other-relative”:
return 5
elif x == “Unmarried”:
return 4
elif x == “Not-in-family”:
return 3
elif x == “Husband”:
return 2
else:
return 1
def get_race(x):
if x == “Other”:
return 5
elif x == “Amer-Indian-Eskimo”:
return 4
elif x == “Black”:
return 3
elif x == “White”:
return 2
else:
return 1
def get_sex(x):
if x == “Male”:
return 2
else:
return 1
def get_class(x):
if x == “>50K”:
return 1
else:
return 0
def get_workclass(x):
if x == “Without-pay”:
2/26/22, 9:04 PM CSE578Project
localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 12/13
return 7
elif x == “Private”:
return 6
elif x == “State-gov”:
return 5
elif x == “Self-emp-not-inc”:
return 4
elif x == “Local-gov”:
return 3
elif x == “Federal-gov”:
return 2
else:
return 1
def get_marital_status(x):
if x == “Never-married”:
return 7
elif x == “Separated”:
return 6
elif x == “Married-spouse-absent”:
return 5
elif x == “Widowed”:
return 4
elif x == “Divorced”:
return 3
elif x == “Married-civ-spouse”:
return 2
else:
return 1
train[‘workclass’] = train[‘workclass’].apply(get_workclass)
train[‘marital-status’] = train[‘marital-status’].apply(get_marital_stat
us)
train[‘occupation’] = train[‘occupation’].apply(get_occupation)
train[‘relationship’] = train[‘relationship’].apply(get_relationship)
train[‘race’] = train[‘race’].apply(get_race)
train[‘sex’] = train[‘sex’].apply(get_sex)
train[‘class’] = train[‘class’].apply(get_class)
Out[90]:
age workclass
education-
num
marital-
status occupation relationship race sex
capital-
gain
hours-
per-
week
cla
0 39 5 13 7 3 3 2 2 2174 40
1 50 4 13 2 1 2 2 2 0 13
2 38 6 9 3 3 3 2 2 0 40
3 53 6 7 2 3 2 3 2 0 40
4 28 6 13 2 1 1 3 1 0 40
2/26/22, 9:04 PM CSE578Project
localhost:8888/nbconvert/html/CSE578Project.ipynb?download=false 13/13
In [96]: test=pd.read_csv(“data/adult.test”, header=None, sep=”, “)
feature = train.iloc[:, :-1]
labels = train.iloc[:, -1]
feature_matrix1 = feature.values
labels1 = labels.values
train_data, test_data, train_labels, test_labels = train_test_split(feat
ure_matrix1, labels1, test_size=0.2, random_state=42)
transformed_train_data = MinMaxScaler().fit_transform(train_data)
transformed_test_data = MinMaxScaler().fit_transform(test_data)
In [97]: t
In [114]: mod=LogisticRegression().fit(transformed_train_data,train_labels)
test_predict=mod.predict(transformed_test_data)
acc=accuracy_score(test_labels, test_predict)
f1=f1_score(test_labels, test_predict)
prec=precision_score(test_labels,test_predict)
rec=recall_score(test_labels, test_predict)
In [115]: print(“%.4f\t%.4f\t%.4f\t%.4f\t%s” % (acc, f1, prec, rec, ‘Logistic Regr
ession’))
In [ ]:
‘python’ engine because the ‘c’ engine does not support regex separator
s (separators > 1 char and different from ‘\s+’ are interpreted as rege
x); you can avoid this warning by specifying engine=’python’.
test=pd.read_csv(“data/adult.test”, header=None, sep=”, “)
0.8409 0.6404 0.7500 0.5588 Logistic Regression
Course Title Portfolio
Name
Abstract—This document This document This document
This document This document This document This document
This document This document This document This document
This document This document This document This document
This document This document This document This document
This document This document This document This document
This document This document This document This document
This document This document This document This document
This document.
Keywords—mean, standard deviation, variance, probability
density function, classifier
I. INTRODUCTION
This document This document This document This
document This document This document This document This
document This document This document This document This
document This document This document This document This
document. [1].
This project practiced the use of density estimation
through several calculations via the Naïve Bayes Classifier.
The data for each equation was used to find the probability of
the mean for. Without using a built-in function, the first
feature, the mean, could be calculated using the equation in
Fig. 1. The second feature, the standard deviation, could be
calculated using the equation in Fig. 2. Utilizing the training
set for digit 0, the mean of the pixel brightness values was
determined by calling ‘numpy.mean()digit 0 or digit 1. The
test images were then classified based on the previous
calculations and the accuracy of the computations were
determined.
The project consisted of 4 tasks:
A. Extract features from the original training set
There were two features that needed to be extracted from
the original training set for each image. The first feature was
the average pixel brightness values within an image array.
The second was the standard deviation of all pixel
brightness values within an image array.
B. Calculate the parameters for the two-class Naïve Bayes
Classifiers
Using the features extracted from task A, multiple
calculations needed to be performed. For the training set
involving digit 0, the mean of all the average brightness
values was calculated. The variance was then calculated for
the same feature, regarding digit 0. Next, the mean of the
standard deviations involving digit 0 had to be computed. In
addition, the variance for the same feature was determined.
These four calculations had to then be repeated using the
training set for digit 1.
C. Classify all unknown labels of incoming data
Using the parameters obtained in task B, every image in
each testing sample had to be compared with the
corresponding training set for that particular digit, 0 or 1.
The probability of that image being a 0 or 1 needed to be
determined so it can then be classified.
D. Calculate the accuracy of the classifications
Using the predicted classifications from task C, the
accuracy of the predictions needed to be calculated for both
digit 0 and digit 1, respectively.
Each equation was used to find the probability of the
mean for. Without using a built-in function, the first feature,
the mean, could be calculated using the equation in Fig. 1.
The second feature, the standard deviation, could be
calculated using the equation in Fig. 2. Utilizing the training
set for digit 0, the mean of the pixel brightness values was
determined by calling ‘numpy.mean()of the data. These
features helped formulate the probability density function
when determining the classification.
II. DESCRIPTION OF SOLUTION
This project required a series of computations in order to
successfully equation was used to find the probability of the
mean for. Without using a built-in function, the first feature,
the mean, could be calculated using the equation in Fig. 1.
The second feature, the standard deviation, could be
calculated using the equation in Fig. 2. Utilizing the training
set for digit 0, the mean of the pixel brightness values was
determined by calling ‘numpy.mean(). Once acquiring the
data, the appropriate calculations could be made.
A. Finding the mean and standard deviation
The data was provided in the form of NumPy arrays,
which made it useful for performing routine mathematical
operations equation was used to find the probability of the
mean for. Without using a built-in function, the first feature,
the mean, could be calculated using the equation in Fig. 1.
The second feature, the standard deviation, could be
calculated using the equation in Fig. 2. Utilizing the training
set for digit 0, the mean of the pixel brightness values was
determined by calling ‘numpy.mean()by calling
‘numpy.std()’, another useful NumPy function. These
extracted features from the training set for digit 0 also had to
be evaluated from the training set for digit 1. Once all the
features for each image were obtained from both training
sets, the next task could be completed.
Equ. 1. Mean formula
B. Determining the parameters for the Naïve Bayes
Classifiers
To equation was used to find the probability of the mean
for. Without using a built-in function, the first feature, the
mean, could be calculated using the equation in Fig. 1. The
second feature, the standard deviation, could be calculated
using the equation in Fig. 2. Utilizing the training set for
digit 0, the mean of the pixel brightness values was
determined by calling ‘numpy.mean() and the array of the
standard deviations created for digit 1.
Equ. 2. Variance formula
This equation was used to find the probability of the
mean for. Without using a built-in function, the first feature,
the mean, could be calculated using the equation in Fig. 1.
The second feature, the standard deviation, could be
calculated using the equation in Fig. 2. Utilizing the training
set for digit 0, the mean of the pixel brightness values was
determined by calling ‘numpy.mean()’ for each image in the
set. In addition, the standard deviation of the pixel
brightness values was calculated for each image by calling
‘numpy.std()’, another useful NumPy function. These
extracted features from the training. This was multiplied by
the prior probability, which is 0.5 in this case because the
value is either a 0 or a 1.
This ]. Without using a built-in function, the first feature,
the mean, could be calculated using the equation in Fig. 1.
The second feature, the standard deviation, could be
calculated using the equation in Fig. 2. Utilizing the training
set for digit 0, the mean of the pixel brightness values was
determined by calling ‘numpy.mean()’ for each image in the
set. In addition, the standard deviation of the pixel
brightness values was calculated for each image by calling
‘numpy.std()’, another useful NumPy function. These
extracted features from the training.
This entire procedure had to be conducted once again but
utilizing the test sample for digit 1 instead. This meant
finding the mean and standard deviation of each image, using
the probability density function to calculate the probability of
the mean and probability of the standard deviation for digit 0,
and calculating the probability that the image is classified as
digit 0. The same operations had to be performed again, but
for the training set for digit 1. The probability of the image
being classified as digit 0 had to be compared to the
probability of the image being classified as digit 1. Again,
the larger of the two values suggested which digit to classify
as the label.
One aspect of machine learning that I understood better
after completion of the project was Gaussian distribution.
This normalized distribution style displays a bell-shape of
data in which the peak of the bell is where the mean of the
data is located [4]. A bimodal distribution is one that
displays two bell-shaped distributions on the same graph.
After calculating the features for both digit 0 and digit 1, the
probability density function gave statistical odds of that
particular image being classified under a specific bell-
shaped curve. An example of a bimodal distribution can be
seen in Fig. 7 below.
C. Determining the accuracy of the label
The mean for. Without using a built-in function, the first
feature, the mean, could be calculated using the equation in
Fig. 1. The second feature, the standard deviation, could be
calculated using the equation in Fig. 2. Utilizing the training
set for digit 0, the mean of the pixel brightness values was
determined by calling ‘numpy.mean()’ for each image in the
set. In addition, the standard deviation of the pixel
brightness values was calculated for each image by calling
‘numpy.std()’, another useful NumPy function. These
extracted features from the by the total number of images in
the test sample for digit 1.
III. RESULTS
mean for. Without using a built-in function, the first
feature, the mean, could be calculated using the equation in
Fig. 1. The second feature, the standard deviation, could be
calculated using the equation in Fig. 2. Utilizing the training
set for digit 0, the mean of the pixel brightness values was
determined by calling ‘numpy.mean()’ for each image in the
set. In addition, the standard deviation of the pixel
brightness values was calculated for each image by calling
‘numpy.std()’, another useful NumPy function. These
extracted features from the also higher.
TABLE I. TRAINING SET FOR DIGIT 0
TTTTTTTT 000000
XXXXX 000000
When comparing the test images, the higher values of
the means and the standard deviations typically were labeled
as digit 0 and the lower ones as digit 1. However, this was
not always the case because then the calculated accuracy
would then be 100%.
The e. After classifying all the images in the test sample
for digit 0, the total amount predicted as digit 0 was 899.
This meant that the accuracy of classification was 0000%,
which can be represented in Fig. 5.
Fig. 1. Accuracy of classification for digit 0
The total amount of images in the test sample for digit 1
0000. After classifying all the images in the test sample for
digit 1, the total amount predicted as digit 00000. This
meant that the accuracy of classification was 00000%,
which can be represented in Fig. 6.
IV. LESSONS LEARNED
The procedures practiced in this project required skill in
the Python programming language, as well as understanding
concepts of statistics. It required plenty of practice to
implement statistical equations, such as finding the mean,
the standard deviation, and the variance. My foundational
knowledge of mathematical operations helped me gain an
initial understanding of how to set up classification
problems. My lack of understanding of the Python language
made it difficult to succeed initially. Proper syntax and
built-in functions had to be learned first before continuing
with solving the classification issue. For example, I had very
little understanding of NumPy prior to this project. I learned
that it was extremely beneficial for producing results of
mathematical operations. One of the biggest challenges for
me was creating and navigating through NumPy arrays
rather than a Python array. Looking back, it was a simple
issue that I solved after understanding how they were
uniquely formed. Once I had a grasp on the language and
built-in functions, I was able to create the probability
density function in the code and then apply classification
towards each image.
One aspect of machine learning that I understood better
after completion of the project was Gaussian distribution.
This normalized distribution style displays a bell-shape of
data in which the peak of the bell is where the mean of the
data is located [4]. A bimodal distribution is one that
displays two bell-shaped distributions on the same graph.
After calculating the features for both digit 0 and digit 1, the
probability density function gave statistical odds of that
particular image being classified under a specific bell-
shaped curve. An example of a bimodal distribution can be
seen in Fig. 7 below.
One aspect of machine learning that I understood better
after completion of the project was Gaussian distribution.
This normalized distribution style displays a bell-shape of
data in which the peak of the bell is where the mean of the
data is located [4]. A bimodal distribution is one that
displays two bell-shaped distributions on the same graph.
After calculating the features for both digit 0 and digit 1, the
probability density function gave statistical odds of that
particular image being classified under a specific bell-
shaped curve. An example of a bimodal distribution can be
seen in Fig. 7 below.
One aspect of machine learning that I understood better
after completion of the project was Gaussian distribution.
This normalized distribution style displays a bell-shape of
data in which the peak of the bell is where the mean of the
data is located [4]. A bimodal distribution is one that
displays two bell-shaped distributions on the same graph.
After calculating the features for both digit 0 and digit 1, the
probability density function gave statistical odds of that
particular image being classified under a specific bell-
shaped curve. An example of a bimodal distribution can be
seen in Fig. 7 below.
Fig. 2. Bimodal distribution example [5]
Upon completion of the project, I was able to realize that
One aspect of machine learning that I understood better after
completion of the project was Gaussian distribution. This
normalized distribution style displays a bell-shape of data in
which the peak of the bell is where the mean of the data is
located [4]. A bimodal distribution is one that displays two
bell-shaped distributions on the same graph. After
calculating the features for both digit 0 and digit 1, the
probability density function gave statistical odds of that
particular image being classified under a specific bell-
shaped curve. An example of a bimodal distribution can be
seen in Fig. 7 below.
One aspect of machine learning that I understood better
after completion of the project was Gaussian distribution.
This normalized distribution style displays a bell-shape of
data in which the peak of the bell is where the mean of the
data is located [4]. A bimodal distribution is one that
displays two bell-shaped distributions on the same graph.
After calculating the features for both digit 0 and digit 1, the
probability density function gave statistical odds of that
particular image being classified under a specific bell-
shaped curve. An example of a bimodal distribution can be
seen in Fig. 7 below.
One aspect of machine learning that I understood better
after completion of the project was Gaussian distribution.
This normalized distribution style displays a bell-shape of
data in which the peak of the the project was Gaussian
distribution. This normalized distribution style the project
was Gaussian distribution. This normalized distribution
style bell is where the mean of the data is located [4]. A
bimodal distribution is one that displays classified under a
specific bell-shaped curve. An example of a bimodal
distribution can be seen in Fig. 7 below.
Accuracy for Digit 0
Predicted as
digit 0
Predicted as
digit 1
V. REFERENCES
[1] N. Kumar, Naïve Bayes Classifiers, GeeksforGeeks, May 15, 2020.
Accessed on: Oct. 15, 2021. [Online]. Available:
https://www.geeksforgeeks.org/naive-bayes-classifiers/
[2] J. Brownlee, How to Develop a CNN for MNIST Handwritten Digit
Classification, Aug. 24, 2020. Accessed on: Oct. 15, 2021. [Online].
Available: https://machinelearningmastery.com/how-to-develop-a-
convolutional-neural-network-from-scratch-for-mnist-handwritten-
digit-classification/
[3] “What is NumPy,” June 22, 2021. Accessed on: Oct. 15, 2021.
[Online]. Available:
https://numpy.org/doc/stable/user/whatisnumpy.html
[4] J. Chen, Normal Distribution, Investopedia, Sept. 27, 2021. Accessed
on: Oct. 15, 2021. [Online]. Available:
https://www.investopedia.com/terms/n/normaldistribution.asp
[5] “Bimodal Distribution,” Velaction, n.d. Accessed on: Oct. 15, 2021.
[Online]. Available: https://www.velaction.com/bimodal-distribution/
- I. Introduction
- II. Description of Solution
- III. Results
- IV. Lessons Learned
- V. References
A. Extract features from the original training set
B. Calculate the parameters for the two-class Naïve Bayes Classifiers
Using the features extracted from task A, multiple calculations needed to be performed. For the training set involving digit 0, the mean of all the average brightness values was calculated. The variance was then calculated for the same feature, regard…
C. Classify all unknown labels of incoming data
D. Calculate the accuracy of the classifications
A. Finding the mean and standard deviation
B. Determining the parameters for the Naïve Bayes Classifiers
C. Determining the accuracy of the label
The mean for. Without using a built-in function, the first feature, the mean, could be calculated using the equation in Fig. 1. The second feature, the standard deviation, could be calculated using the equation in Fig. 2. Utilizing the training set fo…
CSE578: Data Visualization
Systems Documentation Report
Members of Team 44: Pradeep Peddnade, Jieqiong Zhou, Tian Liang, Sukhwan Yun
1. Roles and responsibilities
Product owners: XYZ corporation
Stakeholders: UVW College
Data analysis team members:
• Pradeep Peddnade: exploratory analysis for native-country, race, education and work
class of the dataset, machining learning model training and testing of these variables.
• Jieqiong Zhou: Progress report, exploratory analysis for sex, marital-status of the
dataset.
• Tian Liang: Systems documentation report; exploratory analysis for occupation, capital-
loss, weight and working hours per week of the dataset; insight analysis for 2 variables
• Sukhwan Yun: Executive report, data exploration and data analysis of age, education–
num, capital-gain and relationship of the data set.
2. Team goals and a business objective
Our understanding of the project is to assist UVW College in their effort in boosting enrollment. They
believe they should target individuals based on their annual income. They drew a line at 50k and would
like us to classify individuals into two categories: annual salary above and below 50k.
We are going to use US census bureau data to establish correlation between annual income and the
other status and data of an individual, such as capital gain, capital loss, education, work class, marital
status, etc. We will start with an exploratory analysis to determine which parameters are important and
which ones are irrelevant. Then we will select the most relevant data for in-depth visualization and
machine learning. Eventually, we will be able to predict an individual’s annual salary based on this
person’s other status and data.
3. Assumptions
UVW College assumes people within a certain salary range are more likely to enroll in their degree
program. Therefore, they need to know if a person’s annual salary is above or below $50,000.
UVW College assumes the US census data can be used to indicate the likelihood of a person’s annual
income based on other status and data such as age, gender, education status, marital status,
occupation, etc.
It is assumed that the data from the United States Census Bureau is accurate. The data used for this
study is representative of the individuals to be included in this data analysis.
4.User Stories
User Story #1: To increase the enrollment number, a staff member of UVW marketing team would like
to know the relationship between occupation and income.
User Story #2: An associate in UVW marketing group would like to get an understanding of capital loss of
people in the data.
User Story #3: A marketing analyst suggested that work hours per week could be a factor affecting the
income of people and would like to have data to back this hypothesis
User Story #4: The director of marketing would like to know if final weight has anything to do with
income of people interviewed in the census data.
User Story #5: A senior staff member in marketing department is interested in how education-num is
correlated to income.
User Story #6: Marketing group just had a meeting on how to increase enrollment number. One of the
action items is to understand if marital status of an individual is related to this person’s salary.
User Story #7: An intern in marketing group suggested to study the relationship between capital gain
and income of individuals in this data.
User Story #8: The director of marketing asked the team members to analyze the relationship between
work class and annual salary of an individual.
5. Visualizations
Figure 1. Percentage of people with a salary > 50K for each occupation.
To visualize occupation data, we choose bar chart since it is very good for visualize categorical data.
Figure 1 shows the percentage of people with a salary > 50K for each occupation. It shows that certain
occupations such as the executive and managerial positions, professional specialty, protective services,
tech support have a higher percentage (>30%) of individuals with an annual salary more than $50,000.
While certain occupations such as private house services, other services, handlers-cleaners have a lower
percentage (<10%) of individuals with an annual salary higher than $50,000.
Figure 2. Education number of individuals with salary above (left) and below (right) $ 50,000.
To visualize education number data, we choose box chart since education number data are widespread
and box plot is very good at visualize this type of data. Figure 2 shows the education number of
individuals with salary above and below $ 50,000. In the group of individuals with annual salary above
$50,000, the median number of educations is higher than that of the group with annual salary below
$50,000. In addition, the top quartile of the people in the above $50,000 group is higher than that of the
people in the below $50,000 group. These two huge differences show that individuals with more
education are likely to be in the group with an annual salary above $50,000.
Figure 3. Percentage of individuals with annual salary above $ 50,000.
To visualize capital gain data, we choose scatter plot since it is suitable for showing relationship between
continuous data points. Figure 3 shows the percentage of individuals with annual salary above $ 50,000.
We processed the data in the following way to make the graph clearer. Each scatter in Figure 3
represents a group of individuals within a capital gain range of $2,000. For example, the scatter on the
most left side of the figure represents individuals with a capital gain between $0 to $2,000.
Based on Figure 3, we can see that for individuals with capital gain below $10,000, the percentage of
individuals with an annual salary above $50,000 is increased with the increase of capital gain. This
means there is a correlation between the salary and capital gain in this data range.
However, for individuals with a capital gain above $10,000, the correlation between salary and capital
gain is very weak, if there is any. We can see data points jump between 0 to 100% and there is no
pattern or correlation can be found. This is due to the scarcity of capital gain data points in the range
above $10,000. As we can see from Figure 4, Histogram of capital-gain, the majority of capital gain data
points are below $10,000. The data points above $10,000 are very scarce and have no statistical
significance.
In summary, annual salary is positively correlated to capital gain for individuals with capital gain less
than $10,000. For individuals with capital gain above $10,000, it is hard to draw a conclusion due to lack
of statistical significance of the data.
Figure 4. Histogram of capital-gain.
Figure 5. Histogram of capital-loss.
To explore capital loss data, we choose histogram to get some idea about if there is a correlation
between it and salary. Figure 5 shows the histogram of capital-loss. The vast majority of data points fall
in the very first bin on the left side of the figure. This indicates that there is no statistical correlation
between the capital-loss and annual salary.
Figure 6. Percentage of individuals with annual salary above $ 50,000 as a function of work hours-per-
week.
To visualize hours per week data, we choose scatter plot since it gives reader a very good idea about
continuous correlations. Figure 6 shows the percentage of individuals with annual salary above $50,000
as a function of work hours-per-week. We processed the data in the same way as we did for Figure 3 for
clarification purpose. Each scatter in Figure 6 represents a group of people work within a certain range
of hours per week. For examples, the scatter on the most left side of the graph represents people work
between 20 to25 hours per week.
We can see a very week correlation between salary and hour-per-week. Generally, individuals work less
than 40 hours per week are less likely to earn more than $50,000. While people work more than 40
hours a week are much more likely to make more than $50,000 annually.
Figure 7. Percentage of individuals with annual salary above $ 50,000 as a function of final weight.
We choose scatter plot to show the salary data as a function of final weight since final weight is a
continuous data. Figure 7 shows the percentage of individuals with annual salary above $ 50,000 as a
function of final weight. We process the data in the same as we did for Figure 6. Based on the scatter
plot, we can see there is no correlation between the salary and final weight.
Figure 8. Percentage of individuals in each work class with annual salary above $50,000 (top) and below
$50,000 (bottom).
To visualize work class data, we choose pie chart to see different classed in each salary group since pie
chart is very good at showing the composition of a data set. Figure 8 shows the percentage of individuals
in each work class with annual salary above $50,000 (top) and below $50,000 (bottom). It is very clear
that individuals work in the private sector are less likely to make more than $50,000 a year. While self
employed with an Inc are more likely to make more than $50,000 a year.
6. Questions
Question 1: There is a total of 14 parameters. Some of them are relevant to the annual salary of an
individual, while some of them are not. We need to determine which parameters to use for in-depth
analysis and machine learning.
Solution to question 1: We started with a thorough discussion during one of our team meetings at the
beginning of this project. After the discussion, we decided to start with an exploratory analysis of each
parameter. After the initial analysis, we picked 4~8 parameters which are the most relevant to an
individual’s annual salary for the next step.
Question 2: Which machine learning method we should use for this study.
Solution to question 2: There are a total of 14 parameters, with some being numerical and the others
being categorical data. The outcome is either above or below 50k. Typically for this type of outcome,
logistic regression is an ideal method for machine learning analysis. Therefore, we choose logistic
regression and found how certain parameters are related to people’s annual salary.
7. Not doing
In the machine learning analysis, we choose logistic regression model. This model is good enough to
provide information regarding the relationship between the parameters we choose, and individual’s
annual salary. In the future, more models can be included in the analysis. Several other metrics could be
used to compare different models and their accuracy.
In this study, we didn’t include parameters such as age, education, relationship, race, sex and native
country. These parameters can be included in further studies in the future.
8. Appendix
Code:
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
import numpy
from statsmodels.graphics.mosaicplot import mosaic
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, trai
n_test_split
import warnings
%matplotlib inline
df = pd.read_csv(“/content/adult.data”, header=None, sep=”, “
)
df.columns = [“age”, “workclass”, “fnlwgt”, “education”, “education-
num”, “marital-
status”, “occupation”, “relationship”, “race”, “sex”, “capital-
gain”, “capital-loss”, “hours-per-week”, “native-country”, “class”]
df = df[df[“workclass”] != ‘?’]
df = df[df[“education”] != ‘?’]
df = df[df[“marital-status”] != ‘?’]
df = df[df[“occupation”] != ‘?’]
df = df[df[“relationship”] != ‘?’]
df = df[df[“race”] != ‘?’]
df = df[df[“sex”] != ‘?’]
df = df[df[“native-country”] != ‘?’]
below = df[df[“class”] == “<=50K"]
above = df[df[“class”] == “>50K”]
figg = plt.figure()
axx = figg.gca()
below.boxplot(column=’education-num’, ax =axx)
axx.set_title(“Boxplot with a salary <= 50K for each education-num")
plt.show()
above_50k = Counter(above[‘workclass’])
below_50k = Counter(below[‘workclass’])
print(‘workclass’)
fig, axes = plt.subplots(ncols=1, nrows=2, figsize=(5,10))
axes[0].pie(above_50k.values(), labels=above_50k.keys(), autopct=’%1.0f%%’
)
axes[0].set_title(“>50K”)
axes[1].pie(below_50k.values(), labels=below_50k.keys(), autopct=’%1.0f%%’
)
axes[1].set_title(“<=50K")
plt.show()
fig, axes = plt.subplots(ncols=1, nrows=1, figsize=(15,10))
fig.subplots_adjust(hspace=.5)
mosaic(df, [‘marital-status’, ‘class’], ax=axes, axes_label=False)
plt.show()
occupation_list = df.groupby(‘occupation’)
occupations = occupation_list.groups.keys()
occupation_salary = []
for occupation in occupations:
occupation_member = df[df[‘occupation’] == occupation]
above_total = sum(occupation_member[‘salary’] == ‘ >50K’)
below_total = sum(occupation_member[‘salary’] == ‘ <=50K')
occupation_salary.append([occupation, 100 * above_total/(below_total +
above_
total)])
occupation_salary_df = pd.DataFrame(occupation_salary, columns = [‘occupat
ion’, ‘per of >
50K’])
plt.barh(occupation_salary_df[‘occupation’],occupation_salary_df[‘per of >
50K’])
plt.ylabel(‘Occupation’)
plt.xlabel(‘Percentage of people with a salary > 50K (%)’)
plt.title(‘Percentage of people with a salary > 50K for each occupation ‘,
fontdict = {‘fontsize’ : 20})
plt.show()
train = df
train = train.drop(“capital-loss”, axis=1)
train = train.drop(“native-country”, axis=1)
train = train.drop(“fnlwgt”, axis=1)
train = train.drop(“education”,axis=1)
def get_occupation(x):
if x in [“Exec-managerial”, “Prof-specialty”, “Protective-serv”]:
return 1
elif x in [“Sales”, “Transport-moving”, “Tech-support”, “Craft-
repair”]:
return 2
else:
return 3
def get_relationship(x):
if x == “Own-child”:
return 6
elif x == “Other-relative”:
return 5
elif x == “Unmarried”:
return 4
elif x == “Not-in-family”:
return 3
elif x == “Husband”:
return 2
else:
return 1
def get_race(x):
if x == “Other”:
return 5
elif x == “Amer-Indian-Eskimo”:
return 4
elif x == “Black”:
return 3
elif x == “White”:
return 2
else:
return 1
def get_sex(x):
if x == “Male”:
return 2
else:
return 1
def get_class(x):
if x == “>50K”:
return 1
else:
return 0
def get_workclass(x):
if x == “Without-pay”:
return 7
elif x == “Private”:
return 6
elif x == “State-gov”:
return 5
elif x == “Self-emp-not-inc”:
return 4
elif x == “Local-gov”:
return 3
elif x == “Federal-gov”:
return 2
else:
return 1
def get_marital_status(x):
if x == “Never-married”:
return 7
elif x == “Separated”:
return 6
elif x == “Married-spouse-absent”:
return 5
elif x == “Widowed”:
return 4
elif x == “Divorced”:
return 3
elif x == “Married-civ-spouse”:
return 2
else:
return 1
train[‘workclass’] = train[‘workclass’].apply(get_workclass)
train[‘marital-status’] = train[‘marital-
status’].apply(get_marital_status)
train[‘occupation’] = train[‘occupation’].apply(get_occupation)
train[‘relationship’] = train[‘relationship’].apply(get_relationship)
train[‘race’] = train[‘race’].apply(get_race)
train[‘sex’] = train[‘sex’].apply(get_sex)
train[‘class’] = train[‘class’].apply(get_class)
test=pd.read_csv(“/content/adult.data”, header=None, sep=”, “)
feature = train.iloc[:, :-1]
labels = train.iloc[:, -1]
feature_matrix1 = feature.values
labels1 = labels.values
train_data, test_data, train_labels, test_labels = train_test_split(featur
e_matrix1, labels1, test_size=0.2, random_state=42)
transformed_train_data = MinMaxScaler().fit_transform(train_data)
transformed_test_data = MinMaxScaler().fit_transform(test_data)
mod=LogisticRegression().fit(transformed_train_data,train_labels)
test_predict=mod.predict(transformed_test_data)
acc=accuracy_score(test_labels, test_predict)
f1=f1_score(test_labels, test_predict)
prec=precision_score(test_labels,test_predict)
rec=recall_score(test_labels, test_predict)
print(“%.4f\t%.4f\t%.4f\t%.4f\t%s” % (acc, f1, prec, rec, ‘Logistic Regres
sion’))
factorC = 2000
df[‘capitalGainBin’] = df[‘capital-gain’] / factorC
df[‘capitalGainBin’] = df[‘capitalGainBin’].apply(np.ceil)
df[‘capitalGainBin’] = df[‘capitalGainBin’] * factorC
capitalGainBin_list = df.groupby(‘capitalGainBin’)
capitalGainBins = capitalGainBin_list.groups.keys()
capitalGainBin_salary = []
for capitalGainBin in capitalGainBins:
capitalGainBin_member = df[df[‘capitalGainBin’] == capitalGainBin]
above_total = sum(capitalGainBin_member[‘salary’] == ‘ >50K’)
below_total = sum(capitalGainBin_member[‘salary’] == ‘ <=50K')
capitalGainBin_salary.append([capitalGainBin, 100 * above_total/(below_t
otal + above_total)])
capitalGainBin_salary_df = pd.DataFrame(capitalGainBin_salary, columns = [
‘capital-gain’, ‘per
of >50K’])
plt.scatter(capitalGainBin_salary_df[‘capital-
gain’],capitalGainBin_salary_df[‘per of >50K’])
plt.xlabel(‘Capital-gain’)
plt.ylabel(‘Percentage of people with a salary > 50K (%)’)
plt.title(‘Percentage of people with a salary > 50K for different capitl g
ains’, fontdict = {‘fontsize’ : 20})
plt.show()
plt.hist(df[‘capital-loss’])
plt.xlabel(‘Capital-loss’)
plt.ylabel(‘Count’)
plt.title(‘Distribution of capital-loss’, fontdict = {‘fontsize’ : 20})
plt.show()
factorA = 100000
df[‘wgtBin’] = df[‘fnlwgt’]/factorA
df[‘wgtBin’] = df[‘wgtBin’].apply(np.ceil)
df[‘wgtBin’] = df[‘wgtBin’]*factorA
plt.hist(df[‘wgtBin’])
plt.xlabel(‘Final weight’)
plt.ylabel(‘Count’)
plt.title(‘Distribution of final we
ight’, fontdict = {‘fontsize’ : 20})
plt.show()
wgtBin_list = df.groupby(‘wgtBin’)
wgtBins = wgtBin_list.groups.keys()
wgtBins_salary = []
for wgtBin in wgtBins:
wgtBin_member = df[df[‘wgtBin’] == wgtBin]
above_total = sum(wgtBin_member[‘salary’] == ‘ >50K’)
below_total = sum(wgtBin_member[‘salary’] == ‘ <=50K')
wgtBins_salary.append([wgtBin, 100 * above_total/(below_total + above_
total)])
wgtBins_salary_df = pd.DataFrame(wgtBins_salary, columns = [‘fnlwgt’, ‘per
of >50K’])
plt.scatter(wgtBins_salary_df[‘fnlwgt’],wgtBins_salary_df[‘per of >50K’])
plt.xlabel(‘Final weight’)
plt.ylabel(‘Percentage of people with a salary > 50K (%)’)
plt.title(‘Percentage of people with a salary > 50K for different final we
ight’, fontdict = {‘fontsize’ : 20})
plt.show()
plt.hist(df[‘hours-per-week’])
plt.xlabel(‘Hours-per-week’)
plt.ylabel(‘Count’)
plt.title(‘Distribution of hours-per-week’, fontdict = {‘fontsize’ : 20})
plt.show()
factorB = 10
df[‘hours_per_weekBin’] = df[‘hours-per-week’] / factorB
df[‘hours_per_weekBin’] = df[‘hours_per_weekBin’].apply(np.ceil)
df[‘hours_per_weekBin’] = df[‘hours_per_weekBin’] * factorB
hours_per_weekBin_list = df.groupby(‘hours_per_weekBin’)
hours_per_weekBins = hours_per_weekBin_list.groups.keys()
hours_per_weekBin_salary = []
for hours_per_weekBin in hours_per_weekBins:
hours_per_weekBin_member = df[df[‘hours_per_weekBin’] == hours_per_wee
kBin]
above_total = sum(hours_per_weekBin_member[‘salary’] == ‘ >50K’)
below_total = sum(hours_per_weekBin_member[‘salary’] == ‘ <=50K')
hours_per_weekBin_salary.append([hours_per_weekBin, 100 * above_total/
(below_total + above_total)])
hours_per_weekBin_salary_df = pd.DataFrame(hours_per_weekBin_salary, colum
ns = [‘hours-per-week’, ‘per of >50K’])
plt.scatter(hours_per_weekBin_salary_df[‘hours-per-
week’],hours_per_weekBin_salary_df[‘per of >50K’])
plt.xlabel(‘hours-per-week’)
plt.ylabel(‘Percentage of people with a salary > 50K (%)’)
plt.title(‘Percentage of people with a salary > 50K for different hours-
per-weeks’, fontdict = {‘fontsize’ : 20})
plt.show()