Data Visualization using Pandas - Univariate Plots

Data visualization is an important step in ML. That helps us to understand the data in visual representation. Why does it matter? Because it helps us to understand the distribution of the data, any outliers, and also relationship between multiple attributes. Experts also mention that understanding the distribution helps us in choosing correct algorithm. I cannot comment on this statement as of now, as I do not have enough knowledge on that. When I reach that stage, I’ll definitely share.

Pandas library provides methods to plot data in multiple ways. In fact, I did use one of those plots called Box and Whisker Plots in my previous examples. However, those plots were used to see the difference between multiple algorithms. In this post, I’ll share how to visualize the data which is an important step much before we apply any ML algorithm.

I’ll use the data set which was used for Binary Classification. The data set is  Pima Indians Diabetes Database which has 8 features.

If we plot attributes individually, they are called Univariate Plots. If we need to understand the relationship between multiple attributes, we need to create Multivariate Plots. In this post, I’ll concentrate on Univariate Plots.

Before we get into the plot, we need to import necessary libraries and load the data. Even though this piece of code is provided in all the previous examples, I’ll highlight these lines of code here.

# Load Libraries
import pandas
import matplotlib.pyplot as plt
from pandas import read_csv

# Load dataset
url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv”
names = [‘preg’, 'plas’, 'pres’, 'skin’, 'test’, 'mass’, 'pedi’, 'age’, 'class’]
dataset = read_csv(url, names=names)

Univariate Plots

I’m going to draw 3 types of plots which helps us understand the distribution of data.

In Histograms, data is put into bins. The two lines of code to plot Histograms is:

dataset.hist()
plt.show()

This is the output:


Unfortunately, the plots are overlapped but it should give a meaningful understanding of data distribution. For instance, the attributes Age,  Diabetes pedigree function (pedi), and  2-Hour serum insulin (test) have an exponential distribution. The attributes Body mass index (mass),  Diastolic blood pressure (pres) and  Plasma glucose concentration (plas) have Gaussian or normal distribution. The attributes Number of times pregnant (preg) and Triceps skin fold thickness (skin) seems to be skewed distribution.

Let’s look at through Density Plots.

dataset.plot(kind=‘density’, subplots=True, layout=(3,3), sharex=False)
plt.show()


 As you can see, if we draw a curve on the tips of bars in Histograms, we get Density Plots.

Similarly there is another plot Box and Whisker Plots.

dataset.plot(kind=‘box’, subplots=True, layout=(3,3), sharex=False, sharey=False)
plt.show()


 

Comments

Popular posts from this blog

Ordinal Encoder, OneHotEncoder and LabelBinarizer in Python

Natural Language Toolkit (NLTK)

Failed to create Anaconda menus