Stemming in Python

While working on NLP, we come across huge volume of words. Before using these words to the ML algorithm, a lot of pre-processing happens. These pre-processing ensures we keep relevant words which can then be used to train an ML algorithm. One such technique is called Stemming.

As defined in Wikipedia - "In linguistics, a stem is a part of a word used with slightly different meanings and would depend on the morphology of the language in question."

A simple example is the words approximation, approximate, approximated and approximately have the same root. So as part of Stemming, we need to make sure these words should be taken as a single word instead of four different words.

So how do we achieve this in Python? The NLTK library provides support for Stemming. There are two algorithms for stemming - PorterStemmer and LancasterStemmer. I'll not get into the details of how these two works, but at a high level LancasterStemmer does a lot more iterations compared to PorterStemmer.

Let's get into the programming to understand how stemming works.

First step is to load necessary library from nltk. I'm taking an example of PorterStemmer.

import nltk

from nltk.stem import PorterStemmer

Then create an object.

porter = PorterStemmer()

Now I'll use stem method.

print(porter.stem("approximation"))

approxim

print(porter.stem("approximation"))
approxim

print(porter.stem("approximated"))
approxim

print(porter.stem("approximately"))
approxim

As you can see, the output for each of these words is the same word "approxim". This word will be used instead of four individual words.

Stemming works only on individual words and it will not work on the entire sentence. So if a sentence has to be used for stemming, we have to tokenize the sentence and then stem.

Comments

Popular posts from this blog

Ordinal Encoder, OneHotEncoder and LabelBinarizer in Python

Natural Language Toolkit (NLTK)

Data Visualization using Pandas - Univariate Plots