Car Evaluation - Another Classifier Program

After my successful first classifier program in Python, I took up another data set “Car Evaluation” for my second classifier program. I took this data set from UCI Machine Learning Repository. I was expecting this program to be a repeat of what I did with Iris data set. However, surprisingly it was not a repeat but there were a couple of learning from this exercise.

This data set has these 6 attributes with these values. 

buying: vhigh, high, med, low.
maint: vhigh, high, med, low.
doors: 2, 3, 4, 5more.
persons: 2, 4, more.
lug_boot: small, med, big.
safety: low, med, high

The output class values are

unacc (Unacceptable)
acc (Acceptable)
good
vgood

If we observe the attributes carefully, all attributes have string values. This was my first obstacle in the program. Some of the methods of scikit did not work because of string values. After a little bit of Googling I understood that I need to convert these string to numbers. This is where Encoding Categorical features helped!

So after I got my array of attributes something like this

array = dataset.values

X = array[:,0:4]

y = array[:,4]

I encoded the string values in array X to integers using OrdinalEncoder:

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
enc.fit(X)
XX = enc.transform(X)

Now the array XX will have only numerical values. I have not changed array y. So Label will remain string so that we can easily make out the end result. Once I converted string to numbers, all other steps were pretty much same as the first program.

These were the accuracy of different algorithms.

LR: 0.678811 (0.024192)
LDA: 0.676595 (0.024399)
KNN: 0.898090 (0.023665)
CART: 0.977592 (0.011820)
NB: 0.624506 (0.024056)
SVM: 0.898862 (0.027459)

Comparing different algorithms in a Box and Whisker plot:


As we can see, the algorithm Decision Tree Classifier showed a better accuracy with test data. So I chose this algorithm for final validation and this is the end result:

It is giving approx 97% accuracy, which is pretty good.

The source code is available in Github.


Comments

Popular posts from this blog

Ordinal Encoder, OneHotEncoder and LabelBinarizer in Python

Natural Language Toolkit (NLTK)

Data Visualization using Pandas - Univariate Plots