Data Visualization using Pandas - Multivariate Plots

In the previous post, I explained about data visualization specifically Univariate plots. In this post, I’ll describe Multivariate plots.

Multivariate plots shows correlation between multiple variables. I have used the same Pima Indians Diabetes Database.

Correlation Matrix Plot

Correlation shows how two variables are related for changes. Let’s take an example. I’ve taken the code as it is from https://machinelearningmastery.com/.

# Correction Matrix Plot
import matplotlib.pyplot as plt
import pandas
import numpy
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
correlations = data.corr()
# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()

The output looks like this:


I spent some time to understand how to read this plot. The classes are shown at the top from left to right, as well as in left from top to bottom.

The legends for these colors are shown in the right side. Positive 1 means highly correlated, and negative 1 means not at all correlated. So obviously, we can see yellow blocks from top-left corner to bottom-right corner diagonally, as each variable is correlated to its own.

So, if we observe variables “age” and “number of times pregenant”, they are highly correlated comapred to others. Similarly, variables “Triceps skin fold thickness” and “Age” are less correlated comparatively. It turns out understanding these correlation helps in choosing the right algorithm.

Scatterplot Matrix

This plot shows relationship between two variables as dots. I’ll take the code example as it is.

# Scatterplot Matrix
import matplotlib.pyplot as plt
import pandas
from pandas.plotting import scatter_matrix
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
scatter_matrix(data)
plt.show()

The output looks like this:


If you observe the diagonal line from top-left to bottom-right, it is a histogram. This is exactly same as what I got in Univariate plot as shown in my previous post.

Similarly, if we observe the plot for variables “age” and  “number of times pregenant”, we can draw a line among these dots to summarize a relationship between these two variables.

With these exercises, I have learnt concepts of visualization and why it is important. However, only when I apply these concepts as part of “Data Analysis”, I can appreciate its usefulness. I’m hoping I will do that soon. Of course, I will keep sharing what I have learnt.

Comments

Popular posts from this blog

Ordinal Encoder, OneHotEncoder and LabelBinarizer in Python

Failed to create Anaconda menus

Natural Language Toolkit (NLTK)