Removing Stop Words in NLTK library
In the previous example, I learnt how to do basic text processing using NLTK library. To set the context, I have a sentence something like this.
sentence = "I'm learning Natural Language Processing. Natural Language Processing is also called as NLP"
After tokenizing this text into words, this is the output I got.
['I',
However we are not interested in stop words as they do not carry much information. Stop words are commonly used words in a language such as "a", "the", "i" etc. Normally such words are filtered out while processing the natural language.
The NLTK library provides support to filter out stop words. In fact NLTK library has defined a set of such stop words in multiple languages. It also means we can customize that list to cater to our needs.
For the learning purpose, I'll use the built-in stop words for now. First we need to import relevant module.
from nltk.corpus import stopwords
Let's set the language to English
stop_words = set(stopwords.words('english'))
If you print stop_words variable, you can see the list of stop words defined in NLTK library.
Next I'll go through each word in tokens variable and filter out stop words.
newTokens = []
In the above code, do note that each word is converted to lowercase. If I print newTokens variable, this is the output I get.
["'m",
As you can observe, the words "I", "is", "as" are removed. Interestingly, "'m" is not removed. So I can add this to the stop words list to customize my requirement. With this exercise I learnt about stop words and how they can be removed in NLTK library. This is just a second step in the NLP journey and I'm enjoying! Will be back with more learning and sharing.
Comments