My first Kaggle competition - Women in Data Science (WiDS) 2021

As a beginner to the Data Science and ML world, I had no idea how these competitions will be. I always thought such competitions were for the experts. And then I happened to attend a session from Kaggle Expert Usha who decoded many aspects of a competition in Kaggle and motivated me and others to take part in the competition.

The competition was "Women in Data Science (WiDS) Datathon 2021". This competition's purpose was to inspire women to learn about data science and also to create a supportive environment. This competition was open to males as well, however at least 50% of the members in the team should be women. So I partnered with my friend and ex-colleague.

This year's competition focused on models to determine whether a patient admitted to an ICU has been diagnosed with a particular type of diabetes, Diabetes Mellitus. Usha had shared a notebook with a lot of background work done for this problem. So her notebook became a bible to understand the whole process.

I always heard from people that feature engineering is a major task in an ML project and I experienced myself why people say so. 

The training data and test data were shared as part of the competition. There were 180 features and a label. The training data had 130157 rows with label and the test data had 10234 rows without label, of course.

As part of the feature engineering, the first step was to understand these features. Without a domain knowledge it was not an easy task to understand how many of these features are useful or not. As a first step, I had to understand both numerical and categorical data. Then removed a few columns which I thought do not add value to the model like hospital room number. Then removed a few columns which had more than 70% missing values. After every step I was checking the accuracy to understand the changes it brings to the model. As suggested by experts, we used CatBoost algorithm. CatBoost is an open source algorithm and is based on gradient boosted decision trees. With some trial and error, we got an accuracy of 83.941% and we were in top 64% in Leaderboard.

Keeping aside the position in Leaderboard, key takeaway was understanding how to solve a business problem using ML. I got to know how to apply feature engineering and exploratory data analysis. Due to other commitments, I could not spend more time on this competition but whatever I spent I got a good experience on a ML project. I'm excited to compete in more competitions and that will only improve my knowledge and experience in ML.   

Comments

Popular posts from this blog

Ordinal Encoder, OneHotEncoder and LabelBinarizer in Python

Natural Language Toolkit (NLTK)

Failed to create Anaconda menus