One level deeper into the ML program

In the last post, I talked about my first ML program in Python. Though I got into the details of the procedure, it was still explained at a high level. After executing that program, I decided to get into a level deeper to understand “why” we are doing what we did! With this one level deeper, I got to understand some of the key concepts or pattern which can be applied in future programs as well. So, let me get into that.

I was working on Iris dataset which had 150 records. The first step was to split this dataset into 2 sets - one for training a model, and the other for validating a model. This dataset was split into 80% training set and 20% validation set. So out of 150 records, 120 records will be used as training set and 30 records will be used as validation set. This was achieved with train_test_split method of sklearn.

X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)

In the 80% training set, input data points (features) will be stored in X_train, and the output data points (labels) will be stored in Y_train. Similarly, in the remaining 20% validation set, input data points will be stored in X_validation, and the output data points will be stored in Y_validation.

The next step was to use this 80% dataset to train models. Instead of using this 80% dataset as it is, this was further divided into 10 groups.

For each group, 1 group will be used as a test data and the remaining 9 groups will be used as training data. This approach is called cross-validation or specifically k-Fold cross validation, where in my case k=10.

So for example when group 1 is taken as test data, remaining groups 2 - 10 will be taken for training a model. With this set, a model will be trained and evaluated. After the evaluation score is recorded, this model will be discarded and the same steps will be repeated for next group. So, in the next iteration group 2 will be taken as test data, and group 1 and group 3-10 will be taken for training. This is repeated for all 10 folds.

One more thing to note here is - in the example, Stratified cross-validation was used, which ensures each set contains approximately same percentage of samples of each target class as the complete set.

These two lines basically does what is explained above.

kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)

cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=‘accuracy’)

The method cross_val_score returns summary of mean and standard deviation from all 10 folds. At this stage, we would know which model is giving a better accuracy.

Finally the SVC (Support Vector Classification) was chosen to predict on “validation set”. First this model will be fit for a given training data (X_train and Y_train) using method. Next, this model is predicted on validation data (X_validation) using SVC.predict() method.

As the last step, we wanted to check the accuracy of the prediction by comparing the predicted values with output values in validation set (Y_validation). The methods accuracy_scoreconfusion_matrix and classification_report are defined in sklearn.metrics.

As the experts say, the k-Fold cross-validation method is used popularly because it results in a less bias. With this exercise, I was first able to execute a ML program and also get into one level deeper to understand some of the concepts behind this method. As I said earlier, I’m only enjoying every new learning from these exercises.


Popular posts from this blog

Ordinal Encoder, OneHotEncoder and LabelBinarizer in Python

Failed to create Anaconda menus

Natural Language Toolkit (NLTK)