My first Classifier program in Python
I’m very excited about executing the first classifier program using Python. I learnt Python a few months back. So some of the commands were not really alien. I followed the steps mentioned in this article.
The key take-away from this learning are:
- The process involved in the ML solution
- How to use Python and other libraries and methods
- In the process, how some of the data splitting and slicing works and how do we split data into train, test and validation sets
- Finally, what do we understand from the output
Let me highlight some of the key notes I have taken in this exercise.
When it comes to ML project, more or less these steps need to be carried out:
- Define Problem
- Prepare Data
- Evaluate Algorithms
- Improve Results
- Present Results
Since the article provides step by step methods, I’m not going to repeat it.
The Iris dataset was divided into 2 parts: 80% was used to train the model and evaluate, and remaining 20% was used for validation.
Also used stratified 10-fold cross validation to estimate model accuracy. Basically, the 80% of the data will be split into 10 parts - train on 9 parts and test on 1 part and repeat for all combinations of train-test splits.
Also used six algorithms to test:
- Logistic Regression (LR)
- Linear Discriminant Analysis (LDA)
- K-Nearest Neighbors (KNN)
- Classification and Regression Trees (CART)
- Gaussian Naive Bayes (NB)
- Support Vector Machines (SVM)
Accuracy of each model looks like this:
As you can observe, the model SVM shows 98.3% accuracy and it is the best of all 6 models.
If we represent in Box and Whisker plot,
Among all 6 models, the variations are between around 84% to 100%.
Since SVM model provides better accuracy, this model was chosen to validate against 20% validation set.
The end result is:
As you can see, the accuracy with Validation set is 96.6% which is pretty good. You can also observe confusion matrix and classification report.
The precision is Positive Predicted Value = TP / (TP + FP).
The recall is True Positive Rate (TPR) = TP / (TP + FN).
The f1-score gives the harmonic mean of precision and recall.
The support is the number of occurrences of each class in
This entire exercise gave me a lot of knowledge on ML problem solving. Yes, I don’t have any knowledge on specifics of each model and those are all black boxes. That’s exactly the top-down approach. Without getting into the mechanics of each algorithm, now I know how to test and validate each model for a set of dataset. With this learning, my next step would be repeat this for some other dataset. Exciting days ahead!
Entire source code for this example is available in Github.