Data Analysis and Manipulation using Pandas

October 31, 2020

So far I have learnt how to apply ML algorithms on "clean data". In practical, we don't get clean data. We get a lot of raw data and we have to analyze and transform the data which can then be used for training.

What I got to know is Pandas library can be used for data analysis and manipulation. So I spent some time learning the useful commands in Pandas. In this post, I'll explain some of these commands.

For this exercise, I'll use a familiar Iris data set. I used this data set during my initial days of learning ML.

The very first thing we need to do is to load the Pandas library.

import pandas as pd
from pd import read_csv

Then I'll read data from the CSV file and store it into a DataFrame object.

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
df = pd.read_csv(url, names=names)

Data Exploration

DataFrame object is like a matrix or a table with rows and columns. Now lets say I want to check how the data looks like. I can use head method of DataFrame object...

df.head()

This method returns top 5 records as shown below.

	sepal-length	sepal-width	petal-length	petal-width	class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

I can pass a number to the head method to get specific number of rows. For example,

df.head(10)

returns top 10 rows. There is another method called tail which returns bottom n rows.

df.tail(10)

	sepal-length	sepal-width	petal-length	petal-width	class
140	6.7	3.1	5.6	2.4	Iris-virginica
141	6.9	3.1	5.1	2.3	Iris-virginica
142	5.8	2.7	5.1	1.9	Iris-virginica
143	6.8	3.2	5.9	2.3	Iris-virginica
144	6.7	3.3	5.7	2.5	Iris-virginica
145	6.7	3.0	5.2	2.3	Iris-virginica
146	6.3	2.5	5.0	1.9	Iris-virginica
147	6.5	3.0	5.2	2.0	Iris-virginica
148	6.2	3.4	5.4	2.3	Iris-virginica
149	5.9	3.0	5.1	1.8	Iris-virginica

If you want to see data of a specific column, you can provide the column name:

df["class"]

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: class, Length: 150, dtype: object

I can also fetch rows based on index/ position using iloc method.

df.iloc[1:3]

The above command returns this result.

	sepal-length	sepal-width	petal-length	petal-width	class
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa

As you can observe it returned rows from position 1 to position 2 (3 - 1). Similarly you can specify positions for columns as well.

df.iloc[:,:2]

This returns only first 2 columns.

	sepal-length	sepal-width
0	5.1	3.5
1	4.9	3.0
2	4.7	3.2
3	4.6	3.1
4	5.0	3.6
...	...	...
145	6.7	3.0
146	6.3	2.5
147	6.5	3.0
148	6.2	3.4
149	5.9	3.0

Filter

Another feature I really liked is ability to apply Filters to the DataFrame object.

For instance, with this command I could apply a filter on class column.

df[df["class"] == 'Iris-virginica']

	sepal-length	sepal-width	petal-length	petal-width	class
100	6.3	3.3	6.0	2.5	Iris-virginica
101	5.8	2.7	5.1	1.9	Iris-virginica
102	7.1	3.0	5.9	2.1	Iris-virginica
103	6.3	2.9	5.6	1.8	Iris-virginica

Data Aggregation

I can also check certain aggregation values on the data set. Lets first look at on the individual columns.

This command gets me the mean value of "sepal-length" column.

df["sepal-length"].mean()

5.843333

If I want to get all numerical data for that column, I can use describe method

df["sepal-length"].describe()

count    150.000000
mean       5.843333
std        0.828066
min        4.300000
25%        5.100000
50%        5.800000
75%        6.400000
max        7.900000
Name: sepal-length, dtype: float64

What if I apply describe method on a non-numerical column?

df["class"].describe()

count             150
unique              3
top       Iris-setosa
freq               50
Name: class, dtype: object

As you can observe, for text columns it shows relevant data points.

We can use corr method to understand correlation between columns.

df.corr()

	sepal-length	sepal-width	petal-length	petal-width
sepal-length	1.000000	-0.109369	0.871754	0.817954
sepal-width	-0.109369	1.000000	-0.420516	-0.356544
petal-length	0.871754	-0.420516	1.000000	0.962757
petal-width	0.817954	-0.356544	0.962757	1.000000

This method shows correlation matrix.

Sorting

We can also do data sorting.

This command sorts on class column by Descending order.

df.sort_values(by='class', ascending=False)

	sepal-length	sepal-width	petal-length	petal-width	class
149	5.9	3.0	5.1	1.8	Iris-virginica
111	6.4	2.7	5.3	1.9	Iris-virginica
122	7.7	2.8	6.7	2.0	Iris-virginica
121	5.6	2.8	4.9	2.0	Iris-virginica
120	6.9	3.2	5.7	2.3	Iris-virginica
...	...	...	...	...	...
31	5.4	3.4	1.5	0.4	Iris-setosa
30	4.8	3.1	1.6	0.2	Iris-setosa
29	4.7	3.2	1.6	0.2	Iris-setosa
28	5.2	3.4	1.4	0.2	Iris-setosa
0	5.1	3.5	1.4	0.2	Iris-setosa

In case you had sorted on columns and you want to reset the order, you can sort by index. That is if you look at index column in the above result, the index is not sorted as we sorted based on class column. Now lets look at the below example.

df.sort_index()

	sepal-length	sepal-width	petal-length	petal-width	class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	Iris-virginica
146	6.3	2.5	5.0	1.9	Iris-virginica
147	6.5	3.0	5.2	2.0	Iris-virginica
148	6.2	3.4	5.4	2.3	Iris-virginica
149	5.9	3.0	5.1	1.8	Iris-virginica

As you can observe, the data is sorted on index. This is the same order as the original DataFrame.

This is not an exhaustive list of commands available for data analysis and manipulation. I picked some of them to understand the capabilities of Pandas library. You can refer Pandas documentation to get the complete list of commands.

The ML Journey of a Developer

Data Analysis and Manipulation using Pandas

Data Exploration

Filter

Data Aggregation

Sorting

Comments

Popular posts from this blog

Ordinal Encoder, OneHotEncoder and LabelBinarizer in Python

Data Visualization using Pandas - Univariate Plots

Natural Language Processing (NLP)