Posts

Changes in this site

For almost 2 years, I did not write anything here! Not that I had nothing to learn or write, but I was heavily occupied with some professional engagements. I'm back now and with some changes. When I created this site, I was planning to keep this site dedicated to the world of Data Science and ML. But a lot of things have changed in my professional career. Hence, I'll be changing a few things in this site as well. Let me first share what has changed in my professional life. I was a SharePoint professional for more than a decade. I started my journey with SharePoint somewhere in 2007 as a Developer and grew as a Team Lead, Architect and then Senior Architect. Anything around SharePoint was my comfort zone! If people come to me with a requirement, I can confidently say whether SharePoint is the right tool for this or not. And if it is the right tool, what would be the architecture and design of the application would be. As they say, being in a comfort zone for long may not be good

Ordinal Encoder, OneHotEncoder and LabelBinarizer in Python

Usually in the data set we work on, we will have both numerical data as well as categorical data (non-numeric). There are algorithms which do not work on categorical data. These algorithms are designed that way to enhance efficiency. So in such cases how do we train the model with categorical data? This is where we need to apply a technique called Encoding. Encoding transforms categorical data into numeric. There are various Encoders available in sci-kit learn library. In this post, I'll share my learning on three Encoders. Ordinal Encoder This performs ordinal (integer) encoding of categorical data. Let's look at an example. I have an array of categorical data. This is a 2 dimensional array. multiArray = np.array([[ 'Karnataka' ,  'KA' ], [ 'Maharastra' ,  'MH' ], [ 'Gujarat' ,  'GJ' ]]) Then I use OrdinalEncoder to transform this data. ordinalEncoder = OrdinalEncoder() ordinalEncoderArray = ordinalEncoder.fit_transform(mult

My first Kaggle competition - Women in Data Science (WiDS) 2021

As a beginner to the Data Science and ML world, I had no idea how these competitions will be. I always thought such competitions were for the experts. And then I happened to attend a session from Kaggle Expert Usha who decoded many aspects of a competition in Kaggle and motivated me and others to take part in the competition. The competition was " Women in Data Science (WiDS) Datathon 2021 ". This competition's purpose was to inspire women to learn about data science and also to create a supportive environment. This competition was open to males as well, however at least 50% of the members in the team should be women. So I partnered with my friend and ex-colleague. This year's competition focused on models to determine whether a patient admitted to an ICU has been diagnosed with a particular type of diabetes, Diabetes Mellitus. Usha had shared a notebook with a lot of background work done for this problem. So her notebook became a bible to understand the whole proces

Joined Kaggle

From the last 2-3 months, I was playing around Kaggle . Kaggle is the world's largest data science community. What is exciting about this site is it provides a platform to participate in data science competitions to learn and evaluate our learning and knowledge.  Another exciting feature in Kaggle is we can create "Notebooks" which is basically a workbench to try and execute the code! As I'm using Python in my learning journey, I was able to write Python code using all the supported libraries to execute my code. There are a lot of courses as well to learn the new skillset, though I've not tried this yet. There is a discussion forum where we can ask questions and get answers from the community. There are a plenty of data sets available which we can use for learning purpose. Gamification is part of this platform, so users get "rankings" based on their activities and how their contributions were received by the community users. That means - all you need to

Stemming in Python

While working on NLP, we come across huge volume of words. Before using these words to the ML algorithm, a lot of pre-processing happens. These pre-processing ensures we keep relevant words which can then be used to train an ML algorithm. One such technique is called Stemming . As defined in Wikipedia - "In linguistics, a stem is a part of a word used with slightly different meanings and would depend on the morphology of the language in question." A simple example is the words approximation , approximate , approximated and approximately have the same root. So as part of Stemming, we need to make sure these words should be taken as a single word instead of four different words. So how do we achieve this in Python? The NLTK library provides support for Stemming. There are two algorithms for stemming - PorterStemmer and LancasterStemmer . I'll not get into the details of how these two works, but at a high level LancasterStemmer does a lot more iterations compared to Porte

Removing Stop Words in NLTK library

In the previous example , I learnt how to do basic text processing using NLTK library. To set the context, I have a sentence something like this. sentence =  "I'm learning Natural Language Processing. Natural Language Processing is also called as NLP" After tokenizing this text into words, this is the output I got. [ 'I' ,   "'m" ,  'learning' ,   'Natural' ,   'Language' ,   'Processing' ,   '.' ,   'Natural' ,   'Language' ,   'Processing' ,   'is' ,   'also' ,   'called' ,   'as' ,   'NLP' ] However we are not interested in stop words as they do not carry much information. Stop words are commonly used words in a language such as "a", "the", "i" etc. Normally such words are filtered out while processing the natural language. The NLTK library provides support to filter out stop words. In fact NLTK library has define

Natural Language Toolkit (NLTK)

To work on Natural Language Processing (NLP), we need a library. One popular Python library is Natural Language Toolkit or NLTK. As mentioned in the NLTK website , NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Some of the terms like stemming, semantic reasoning are new to me. I'll explore them in future. For now I wanted to test this library so I took the example provided in that website. First I imported nltk library import  nltk Then, formed a sample sentence sentence =  "I'm learning Natural Language Processing. Natural Language Processing is also called as NLP" Then, tokenize the text into words tokens = nltk.word_tokenize(sentence) When I ran this code, I got this error. LookupError: ********