Stemming in Python

While working on NLP, we come across huge volume of words. Before using these words to the ML algorithm, a lot of pre-processing happens. These pre-processing ensures we keep relevant words which can then be used to train an ML algorithm. One such technique is called Stemming . As defined in Wikipedia - "In linguistics, a stem is a part of a word used with slightly different meanings and would depend on the morphology of the language in question." A simple example is the words approximation , approximate , approximated and approximately have the same root. So as part of Stemming, we need to make sure these words should be taken as a single word instead of four different words. So how do we achieve this in Python? The NLTK library provides support for Stemming. There are two algorithms for stemming - PorterStemmer and LancasterStemmer . I'll not get into the details of how these two works, but at a high level LancasterStemmer does a lot more iterations compared to Porte

Removing Stop Words in NLTK library

In the previous example , I learnt how to do basic text processing using NLTK library. To set the context, I have a sentence something like this. sentence =  "I'm learning Natural Language Processing. Natural Language Processing is also called as NLP" After tokenizing this text into words, this is the output I got. [ 'I' ,   "'m" ,  'learning' ,   'Natural' ,   'Language' ,   'Processing' ,   '.' ,   'Natural' ,   'Language' ,   'Processing' ,   'is' ,   'also' ,   'called' ,   'as' ,   'NLP' ] However we are not interested in stop words as they do not carry much information. Stop words are commonly used words in a language such as "a", "the", "i" etc. Normally such words are filtered out while processing the natural language. The NLTK library provides support to filter out stop words. In fact NLTK library has define

Natural Language Toolkit (NLTK)

To work on Natural Language Processing (NLP), we need a library. One popular Python library is Natural Language Toolkit or NLTK. As mentioned in the NLTK website , NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Some of the terms like stemming, semantic reasoning are new to me. I'll explore them in future. For now I wanted to test this library so I took the example provided in that website. First I imported nltk library import  nltk Then, formed a sample sentence sentence =  "I'm learning Natural Language Processing. Natural Language Processing is also called as NLP" Then, tokenize the text into words tokens = nltk.word_tokenize(sentence) When I ran this code, I got this error. LookupError: ********

Natural Language Processing (NLP)

So far I learnt how to apply Machine Learning for specific problems. I have primarily taken Classifier examples to understand the ML and how it is applied. In all these examples, I worked on "clean data". That means, data was already processed and it was available for ML algorithms. In reality, it will never be the case. We have to extract data and do a lot of processing before that can be applied to the ML models. This pre-processing turns out to be a major chunk of work in a ML project. So as a next step, I need to start learning these techniques which involves data extraction, cleansing etc... There are specific terminology for these processes which I will start using as I go through the learning. For now, I'll keep it in a layman's language. When I was discussing this with my colleague who is a ML engineer, he suggested me to get into NLP. He said 'if you get into NLP, you will end up working on end-to-end project from data extraction to prediction/classific

Data Analysis and Manipulation using Pandas

So far I have learnt how to apply ML algorithms on "clean data". In practical, we don't get clean data. We get a lot of raw data and we have to analyze and transform the data which can then be used for training. What I got to know is Pandas library can be used for data analysis and manipulation. So I spent some time learning the useful commands in Pandas. In this post, I'll explain some of these commands. For this exercise, I'll use a familiar Iris data set. I used this data set during my initial days of learning ML. The very first thing we need to do is to load the Pandas library. import  pandas  as  pd from  pd  import  read_csv Then I'll read data from the CSV file and store it into a DataFrame object. url =  "" names = [ 'sepal-length' ,  'sepal-width' ,  'petal-length' ,  'petal-width' ,  'class' ] df = pd.read_csv(url,  names =names) Data