My first Classifier program in Python

I’m very excited about executing the first classifier program using Python. I learnt Python a few months back. So some of the commands were not really alien. I followed the steps mentioned in this article.

The key take-away from this learning are:

  1. The process involved in the ML solution
  2. How to use Python and other libraries and methods
  3. In the process, how some of the data splitting and slicing works and how do we split data into train, test and validation sets
  4. Finally, what do we understand from the output

Let me highlight some of the key notes I have taken in this exercise.

When it comes to ML project, more or less these steps need to be carried out:

  • Define Problem
  • Prepare Data
  • Evaluate Algorithms
  • Improve Results
  • Present Results

Since the article provides step by step methods, I’m not going to repeat it.

The Iris dataset was divided into 2 parts: 80% was used to train the model and evaluate, and remaining 20% was used for validation.

Also used stratified 10-fold cross validation to estimate model accuracy. Basically, the 80% of the data will be split into 10 parts - train on 9 parts and test on 1 part and repeat for all combinations of train-test splits.

Also used six algorithms to test:

  1. Logistic Regression (LR)
  2. Linear Discriminant Analysis (LDA)
  3. K-Nearest Neighbors (KNN)
  4. Classification and Regression Trees (CART)
  5. Gaussian Naive Bayes (NB)
  6. Support Vector Machines (SVM)

My result

Accuracy of each model looks like this:


As you can observe, the model SVM shows 98.3% accuracy and it is the best of all 6 models.

If we represent in Box and Whisker plot,


Among all 6 models, the variations are between around 84% to 100%.

Since SVM model provides better accuracy, this model was chosen to validate against 20% validation set.

The end result is:


As you can see, the accuracy with Validation set is 96.6% which is pretty good. You can also observe confusion matrix and classification report.

The precision is Positive Predicted Value = TP / (TP + FP).

The recall is True Positive Rate (TPR) = TP / (TP + FN).

The f1-score gives the harmonic mean of precision and recall.

The support is the number of occurrences of each class in y_true (Y_validation).

This entire exercise gave me a lot of knowledge on ML problem solving. Yes, I don’t have any knowledge on specifics of each model and those are all black boxes. That’s exactly the top-down approach. Without getting into the mechanics of each algorithm, now I know how to test and validate each model for a set of dataset. With this learning, my next step would be repeat this for some other dataset. Exciting days ahead!

Entire source code for this example is available in Github.

Comments

Popular posts from this blog

Ordinal Encoder, OneHotEncoder and LabelBinarizer in Python

Natural Language Toolkit (NLTK)

Data Visualization using Pandas - Univariate Plots