Ordinal Encoder, OneHotEncoder and LabelBinarizer in Python

Usually in the data set we work on, we will have both numerical data as well as categorical data (non-numeric). There are algorithms which do not work on categorical data. These algorithms are designed that way to enhance efficiency. So in such cases how do we train the model with categorical data? This is where we need to apply a technique called Encoding.

Encoding transforms categorical data into numeric. There are various Encoders available in sci-kit learn library. In this post, I'll share my learning on three Encoders.

Ordinal Encoder

This performs ordinal (integer) encoding of categorical data. Let's look at an example. I have an array of categorical data. This is a 2 dimensional array.

multiArray = np.array([['Karnataka''KA'], ['Maharastra''MH'], ['Gujarat''GJ']])

Then I use OrdinalEncoder to transform this data.

ordinalEncoder = OrdinalEncoder()

ordinalEncoderArray = ordinalEncoder.fit_transform(multiArray)

The output looks like this. Observe the input as well as output. Basically it has assigned an integer value to each categorical data.

Before encoding

[['Karnataka' 'KA']
 ['Maharastra' 'MH']
 ['Gujarat' 'GJ']]
After Ordinal Encoding
[[1. 1.]
 [2. 2.]
 [0. 0.]]

One-hot Encoder

Now let's look at One-hot Encoder. The code and the output is shown below.

oneHotEncoder = OneHotEncoder(sparse=False)

onehotArray = oneHotEncoder.fit_transform(multiArray)

Before encoding
[['Karnataka' 'KA']
 ['Maharastra' 'MH']
 ['Gujarat' 'GJ']]

After One-hot encoding
[[0. 1. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0.]]

As you can observe, instead of replacing an integer value, One-hot encoder assigns a numeric array for each categorical data. The difference lies in where the digit 1 is in the numeric array.

LabelBinarizer

Scikit-learn site suggests to use LabelBinarizer to encode y labels. One major difference of LabelBinarizer vs OneHotEncoder is LabelBinarizer works on only 1 dimensional array. If you provide 2 dimensional array to LabelBinarizer, it throws error.

So I have defined a 1-d array and use this encoder to transform the data.

labelArray = np.array([['Male'], ['Female'], ['Unknown']])

binarizer = LabelBinarizer()

labelBinarizerArray = binarizer.fit_transform(labelArray)

Before Label Encoding
[['Male']
 ['Female']
 ['Unknown']]
After LabelBinarizer
[[0 1 0]
 [1 0 0]
 [0 0 1]]

If you observe the data, the output is a numerical array for each categorical data. What if I apply OneHotEncoder for the same 1-d array? This is what I get as an output.

After One-hot for 1d Array

[[0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]

Doesn't it look similar to LabelBinarizer? Yes, it is. So can I use any of these for 1-dimensional y label? Well, it turns out there is a fundamental difference between these two encoders. Based on what algorithm we choose, we need to apply either of them. I'm going to learn that next and hopefully I should be able to share in my next post.

As a note, if you are using these Encoders, do not forget to import these from sklearn library as shown below.

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelBinarizer

Comments

Popular posts from this blog

Data Visualization using Pandas - Univariate Plots

Natural Language Toolkit (NLTK)