Ordinal Encoder, OneHotEncoder and LabelBinarizer in Python

Usually in the data set we work on, we will have both numerical data as well as categorical data (non-numeric). There are algorithms which do not work on categorical data. These algorithms are designed that way to enhance efficiency. So in such cases how do we train the model with categorical data? This is where we need to apply a technique called Encoding. Encoding transforms categorical data into numeric. There are various Encoders available in sci-kit learn library. In this post, I'll share my learning on three Encoders. Ordinal Encoder This performs ordinal (integer) encoding of categorical data. Let's look at an example. I have an array of categorical data. This is a 2 dimensional array. multiArray = np.array([[ 'Karnataka' ,  'KA' ], [ 'Maharastra' ,  'MH' ], [ 'Gujarat' ,  'GJ' ]]) Then I use OrdinalEncoder to transform this data. ordinalEncoder = OrdinalEncoder() ordinalEncoderArray = ordinalEncoder.fit_transform(mult

My first Kaggle competition - Women in Data Science (WiDS) 2021

As a beginner to the Data Science and ML world, I had no idea how these competitions will be. I always thought such competitions were for the experts. And then I happened to attend a session from Kaggle Expert Usha who decoded many aspects of a competition in Kaggle and motivated me and others to take part in the competition. The competition was " Women in Data Science (WiDS) Datathon 2021 ". This competition's purpose was to inspire women to learn about data science and also to create a supportive environment. This competition was open to males as well, however at least 50% of the members in the team should be women. So I partnered with my friend and ex-colleague. This year's competition focused on models to determine whether a patient admitted to an ICU has been diagnosed with a particular type of diabetes, Diabetes Mellitus. Usha had shared a notebook with a lot of background work done for this problem. So her notebook became a bible to understand the whole proces

Joined Kaggle

From the last 2-3 months, I was playing around Kaggle . Kaggle is the world's largest data science community. What is exciting about this site is it provides a platform to participate in data science competitions to learn and evaluate our learning and knowledge.  Another exciting feature in Kaggle is we can create "Notebooks" which is basically a workbench to try and execute the code! As I'm using Python in my learning journey, I was able to write Python code using all the supported libraries to execute my code. There are a lot of courses as well to learn the new skillset, though I've not tried this yet. There is a discussion forum where we can ask questions and get answers from the community. There are a plenty of data sets available which we can use for learning purpose. Gamification is part of this platform, so users get "rankings" based on their activities and how their contributions were received by the community users. That means - all you need to

Stemming in Python

While working on NLP, we come across huge volume of words. Before using these words to the ML algorithm, a lot of pre-processing happens. These pre-processing ensures we keep relevant words which can then be used to train an ML algorithm. One such technique is called Stemming . As defined in Wikipedia - "In linguistics, a stem is a part of a word used with slightly different meanings and would depend on the morphology of the language in question." A simple example is the words approximation , approximate , approximated and approximately have the same root. So as part of Stemming, we need to make sure these words should be taken as a single word instead of four different words. So how do we achieve this in Python? The NLTK library provides support for Stemming. There are two algorithms for stemming - PorterStemmer and LancasterStemmer . I'll not get into the details of how these two works, but at a high level LancasterStemmer does a lot more iterations compared to Porte

Removing Stop Words in NLTK library

In the previous example , I learnt how to do basic text processing using NLTK library. To set the context, I have a sentence something like this. sentence =  "I'm learning Natural Language Processing. Natural Language Processing is also called as NLP" After tokenizing this text into words, this is the output I got. [ 'I' ,   "'m" ,  'learning' ,   'Natural' ,   'Language' ,   'Processing' ,   '.' ,   'Natural' ,   'Language' ,   'Processing' ,   'is' ,   'also' ,   'called' ,   'as' ,   'NLP' ] However we are not interested in stop words as they do not carry much information. Stop words are commonly used words in a language such as "a", "the", "i" etc. Normally such words are filtered out while processing the natural language. The NLTK library provides support to filter out stop words. In fact NLTK library has define

Natural Language Toolkit (NLTK)

To work on Natural Language Processing (NLP), we need a library. One popular Python library is Natural Language Toolkit or NLTK. As mentioned in the NLTK website , NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Some of the terms like stemming, semantic reasoning are new to me. I'll explore them in future. For now I wanted to test this library so I took the example provided in that website. First I imported nltk library import  nltk Then, formed a sample sentence sentence =  "I'm learning Natural Language Processing. Natural Language Processing is also called as NLP" Then, tokenize the text into words tokens = nltk.word_tokenize(sentence) When I ran this code, I got this error. LookupError: ********

Natural Language Processing (NLP)

So far I learnt how to apply Machine Learning for specific problems. I have primarily taken Classifier examples to understand the ML and how it is applied. In all these examples, I worked on "clean data". That means, data was already processed and it was available for ML algorithms. In reality, it will never be the case. We have to extract data and do a lot of processing before that can be applied to the ML models. This pre-processing turns out to be a major chunk of work in a ML project. So as a next step, I need to start learning these techniques which involves data extraction, cleansing etc... There are specific terminology for these processes which I will start using as I go through the learning. For now, I'll keep it in a layman's language. When I was discussing this with my colleague who is a ML engineer, he suggested me to get into NLP. He said 'if you get into NLP, you will end up working on end-to-end project from data extraction to prediction/classific