I need help in coding K-cross validation and confusion matrix classification
IT348/448Intro to ML
Project 1: Classification and Performance Evaluation
You are to write a Naïve Bayes classification system.
The user of your program must be able to do each of the following (in any reasonable
order and repeatedly). This must be one menu item that then requests both required files.
You must not have a separate menu item requesting the meta file. The user must be able
to train on new files as often as desired.
1) Part 1: Cross validation
a) Use the data file named: car.data
b) Ask the user to provide the number for cross validations. For example, if the user
input k, then you will create the k-cross validation.
c) Perform k-cross validation. Print each cross-validation status. That is, print out the
accuracy of each cross-validation phase. For example, if it works for k-cross
validation, your program should print out the accuracy k times.
d) Generate the average accuracy for the k-cross validation. Print a report of the
average accuracy of the k-cross validation to the screen.
2) Part 2: Confusion matrix.
a) Use the data file named: car.train and car.test
b) Have the system train based on training data (.train file). Ask the user for
the names of the files with the training data.
c) Have the system read a set of data that may or may not have classifications and
provide classifications for each instance. Ask the user for input and output file
names. The output must be in exactly the same format as the training data. The
user must be able to classify as many different files as desired before retraining.
Set up your program so that it takes data without labels but can also accept data
with labels but ignores them.
d) Have the system read a set of data (.test file) that has labels and determine
its accuracy by comparing its computed labels to the actual labels. Ask the user
for the name of the data file.
e) Print a report of the accuracy to the screen. This must be completely independent
of the previous bullet as far as the user is concerned (though obviously there will
be significant code reuse between the two).
f) Generate and print the confusion matrix for the test data.
g) Also make sure the user can quit the program.
You may wish to have an option to print computed probabilities to the screen for
debugging purposes. That’s fine. However, you must not force me to deal with any sort
of debugging output during grading.
The training process will have 2 inputs: a metadata file that will list a set of attributes (or
feature or variables) and each attribute’s possible values, ending with the classification,
and a data file in which each line represents one example. Examples consist of a comma-
IT348/448 Intro to ML
separated list of attributes (in the same order as the metadata file), ending with the
classification. A key challenge here will be to design a data structure to hold all of the
counts you’ll need to compute your needed probabilities. A second key challenge is
getting the math right.
Smooth your data by adding one to all of your counts (so that you have no zero counts).
The program may be written in a language of your choice. Any of the available
languages should work fine.
Quality of the user interface counts. I am more concerned about functionality and
convenience than about looks, but it should look reasonably professional. It should not
be annoying to use. For example, I must be able to do things in different orders (as long
as they make sense) as indicated above and to repeat activities without having to repeat
other activities. So I must be able to train once and test on several different files.
Create a README file with a description of the program, instructions for compiling the
code, and instructions for using your program. A README that has little more than
“Follow the program instructions” will be deemed unacceptable and cost you points.
Writing good instructions for a user of your program is an important job skill. Here’s a
chance to practice.