### My first Classifier program in Python

I’m very excited about executing the first classifier program using Python. I learnt Python a few months back. So some of the commands were not really alien. I followed the steps mentioned in this article.

The key take-away from this learning are:

- The process involved in the ML solution
- How to use Python and other libraries and methods
- In the process, how some of the data splitting and slicing works and how do we split data into train, test and validation sets
- Finally, what do we understand from the output

Let me highlight some of the key notes I have taken in this exercise.

When it comes to ML project, more or less these steps need to be carried out:

- Define Problem
- Prepare Data
- Evaluate Algorithms
- Improve Results
- Present Results

Since the article provides step by step methods, I’m not going to repeat it.

The Iris dataset was divided into 2 parts: 80% was used to train the model and evaluate, and remaining 20% was used for validation.

Also used stratified 10-fold cross validation to estimate model accuracy. Basically, the 80% of the data will be split into 10 parts - train on 9 parts and test on 1 part and repeat for all combinations of train-test splits.

Also used six algorithms to test:

- Logistic Regression (LR)
- Linear Discriminant Analysis (LDA)
- K-Nearest Neighbors (KNN)
- Classification and Regression Trees (CART)
- Gaussian Naive Bayes (NB)
- Support Vector Machines (SVM)

My result

Accuracy of each model looks like this:

As you can observe, the model SVM shows 98.3% accuracy and it is the best of all 6 models.

If we represent in Box and Whisker plot,

Among all 6 models, the variations are between around 84% to 100%.

Since SVM model provides better accuracy, this model was chosen to validate against 20% validation set.

The end result is:

As you can see, the accuracy with Validation set is 96.6% which is pretty good. You can also observe confusion matrix and classification report.

The precision is Positive Predicted Value = TP / (TP + FP).

The recall is True Positive Rate (TPR) = TP / (TP + FN).

The f1-score gives the harmonic mean of precision and recall.

The support is the number of occurrences of each class in `y_true (Y_validation)`

.

This entire exercise gave me a lot of knowledge on ML problem solving. Yes, I don’t have any knowledge on specifics of each model and those are all black boxes. That’s exactly the top-down approach. Without getting into the mechanics of each algorithm, now I know how to test and validate each model for a set of dataset. With this learning, my next step would be repeat this for some other dataset. Exciting days ahead!

Entire source code for this example is available in Github.

## Comments