Different Train/Test Split

April 01, 2020

Continuing with the Car classification program, I did a small experiment with the splitting of data into Training and Test set. In the previous programs, this split was done with 80:20 ratio. That is 80% of the data was used to train a model and the remaining 20% was used to predict.

I wanted to see the difference in accuracy if we change this split ratio. I tested with 4 additional split ratio for test data i.e., 10%, 30%, 40%, and 50%.

I verified each of those 6 algorithms for each 5 ratio (including 20%). Here is the interesting result. The y- axis represent the accuracy of the model.

I was expecting a considerable change, but the data doesn’t reflect that. Perhaps, the volume of the data set might not have brought the change I was expecting. Again, I’m not sure about that at this point of time, but i’ll test with a huge volume of data in future.

Only the algorithms SVM, KNN, and NB have shown noticeable change from 10% to 50%. The rest of the algorithms have negligible changes.

As I said before, I need to do a similar test with a huge amount of data to conclude the optimal split ratio. Though this exercise doesn’t yield a concrete result, I’m happy that I did this experiment.

The ML Journey of a Developer

Different Train/Test Split

Comments

Popular posts from this blog

Ordinal Encoder, OneHotEncoder and LabelBinarizer in Python

Data Visualization using Pandas - Univariate Plots

Stemming in Python