Data Analysis and Manipulation using Pandas

So far I have learnt how to apply ML algorithms on "clean data". In practical, we don't get clean data. We get a lot of raw data and we have to analyze and transform the data which can then be used for training.

What I got to know is Pandas library can be used for data analysis and manipulation. So I spent some time learning the useful commands in Pandas. In this post, I'll explain some of these commands.

For this exercise, I'll use a familiar Iris data set. I used this data set during my initial days of learning ML.

The very first thing we need to do is to load the Pandas library.

import pandas as pd
from pd import read_csv

Then I'll read data from the CSV file and store it into a DataFrame object.

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length''sepal-width''petal-length''petal-width''class']
df = pd.read_csv(url, names=names)

Data Exploration 

DataFrame object is like a matrix or a table with rows and columns. Now lets say I want to check how the data looks like. I can use head method of DataFrame object...

df.head()

This method returns top 5 records as shown below.

sepal-lengthsepal-widthpetal-lengthpetal-widthclass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa


I can pass a number to the head method to get specific number of rows. For example,

df.head(10)

returns top 10 rows. There is another method called tail which returns bottom n rows.

df.tail(10)

sepal-lengthsepal-widthpetal-lengthpetal-widthclass
1406.73.15.62.4Iris-virginica
1416.93.15.12.3Iris-virginica
1425.82.75.11.9Iris-virginica
1436.83.25.92.3Iris-virginica
1446.73.35.72.5Iris-virginica
1456.73.05.22.3Iris-virginica
1466.32.55.01.9Iris-virginica
1476.53.05.22.0Iris-virginica
1486.23.45.42.3Iris-virginica
1495.93.05.11.8Iris-virginica

If you want to see data of a specific column, you can provide the column name:

df["class"]

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: class, Length: 150, dtype: object

I can also fetch rows based on index/ position using iloc method.

df.iloc[1:3]

The above command returns this result.

sepal-lengthsepal-widthpetal-lengthpetal-widthclass
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa

As you can observe it returned rows from position 1 to position 2 (3 - 1). Similarly you can specify positions for columns as well.

df.iloc[:,:2]

This returns only first 2 columns.

sepal-lengthsepal-width
05.13.5
14.93.0
24.73.2
34.63.1
45.03.6
.........
1456.73.0
1466.32.5
1476.53.0
1486.23.4
1495.93.0


Filter

Another feature I really liked is ability to apply Filters to the DataFrame object. 

For instance, with this command I could apply a filter on class column.

df[df["class"] == 'Iris-virginica']

sepal-lengthsepal-widthpetal-lengthpetal-widthclass
1006.33.36.02.5Iris-virginica
1015.82.75.11.9Iris-virginica
1027.13.05.92.1Iris-virginica
1036.32.95.61.8Iris-virginica

Data Aggregation

I can also check certain aggregation values on the data set. Lets first look at on the individual columns.

This command gets me the mean value of "sepal-length" column.

df["sepal-length"].mean()

5.843333

If I want to get all numerical data for that column, I can use describe method

df["sepal-length"].describe()

count    150.000000
mean       5.843333
std        0.828066
min        4.300000
25%        5.100000
50%        5.800000
75%        6.400000
max        7.900000
Name: sepal-length, dtype: float64

What if I apply describe method on a non-numerical column?

df["class"].describe()

count             150
unique              3
top       Iris-setosa
freq               50
Name: class, dtype: object

As you can observe, for text columns it shows relevant data points.

We can use corr method to understand correlation between columns.

df.corr()

sepal-lengthsepal-widthpetal-lengthpetal-width
sepal-length1.000000-0.1093690.8717540.817954
sepal-width-0.1093691.000000-0.420516-0.356544
petal-length0.871754-0.4205161.0000000.962757
petal-width0.817954-0.3565440.9627571.000000

This method shows correlation matrix.

Sorting


We can also do data sorting.

This command sorts on class column by Descending order.

df.sort_values(by='class'ascending=False)

sepal-lengthsepal-widthpetal-lengthpetal-widthclass
1495.93.05.11.8Iris-virginica
1116.42.75.31.9Iris-virginica
1227.72.86.72.0Iris-virginica
1215.62.84.92.0Iris-virginica
1206.93.25.72.3Iris-virginica
..................
315.43.41.50.4Iris-setosa
304.83.11.60.2Iris-setosa
294.73.21.60.2Iris-setosa
285.23.41.40.2Iris-setosa
05.13.51.40.2Iris-setosa

In case you had sorted on columns and you want to reset the order, you can sort by index. That is if you look at index column in the above result, the index is not sorted as we sorted based on class column. Now lets look at the below example.

df.sort_index()

sepal-lengthsepal-widthpetal-lengthpetal-widthclass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
..................
1456.73.05.22.3Iris-virginica
1466.32.55.01.9Iris-virginica
1476.53.05.22.0Iris-virginica
1486.23.45.42.3Iris-virginica
1495.93.05.11.8Iris-virginica

 
As you can observe, the data is sorted on index. This is the same order as the original DataFrame.

This is not an exhaustive list of commands available for data analysis and manipulation. I picked some of them to understand the capabilities of Pandas library. You can refer Pandas documentation to get the complete list of commands.

Comments

Popular posts from this blog

Ordinal Encoder, OneHotEncoder and LabelBinarizer in Python

Natural Language Toolkit (NLTK)

Failed to create Anaconda menus