So far I have learnt how to apply ML algorithms on "clean data". In practical, we don't get clean data. We get a lot of raw data and we have to analyze and transform the data which can then be used for training.
What I got to know is Pandas library can be used for data analysis and manipulation. So I spent some time learning the useful commands in Pandas. In this post, I'll explain some of these commands.
For this exercise, I'll use a familiar Iris data set. I used this data set during my initial days of learning ML.
The very first thing we need to do is to load the Pandas library.
import pandas as pd
from pd import read_csv
Then I'll read data from the CSV file and store it into a DataFrame object.
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
df = pd.read_csv(url, names=names)
Data Exploration
DataFrame object is like a matrix or a table with rows and columns. Now lets say I want to check how the data looks like. I can use head method of DataFrame object...
df.head()
This method returns top 5 records as shown below.
| sepal-length | sepal-width | petal-length | petal-width | class |
---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
---|
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
---|
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
---|
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
---|
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
---|
I can pass a number to the head method to get specific number of rows. For example,
returns top 10 rows. There is another method called tail which returns bottom n rows.
| sepal-length | sepal-width | petal-length | petal-width | class |
---|
140 | 6.7 | 3.1 | 5.6 | 2.4 | Iris-virginica |
---|
141 | 6.9 | 3.1 | 5.1 | 2.3 | Iris-virginica |
---|
142 | 5.8 | 2.7 | 5.1 | 1.9 | Iris-virginica |
---|
143 | 6.8 | 3.2 | 5.9 | 2.3 | Iris-virginica |
---|
144 | 6.7 | 3.3 | 5.7 | 2.5 | Iris-virginica |
---|
145 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
---|
146 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
---|
147 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
---|
148 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
---|
149 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
---|
If you want to see data of a specific column, you can provide the column name:
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
Name: class, Length: 150, dtype: object
I can also fetch rows based on index/ position using iloc method.
The above command returns this result.
| sepal-length | sepal-width | petal-length | petal-width | class |
---|
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
---|
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
---|
As you can observe it returned rows from position 1 to position 2 (3 - 1). Similarly you can specify positions for columns as well.
This returns only first 2 columns.
| sepal-length | sepal-width |
---|
0 | 5.1 | 3.5 |
---|
1 | 4.9 | 3.0 |
---|
2 | 4.7 | 3.2 |
---|
3 | 4.6 | 3.1 |
---|
4 | 5.0 | 3.6 |
---|
... | ... | ... |
---|
145 | 6.7 | 3.0 |
---|
146 | 6.3 | 2.5 |
---|
147 | 6.5 | 3.0 |
---|
148 | 6.2 | 3.4 |
---|
149 | 5.9 | 3.0
|
---|
Filter
Another feature I really liked is ability to apply Filters to the DataFrame object.
For instance, with this command I could apply a filter on class column.
df[df["class"] == 'Iris-virginica']
| sepal-length | sepal-width | petal-length | petal-width | class |
---|
100 | 6.3 | 3.3 | 6.0 | 2.5 | Iris-virginica |
---|
101 | 5.8 | 2.7 | 5.1 | 1.9 | Iris-virginica |
---|
102 | 7.1 | 3.0 | 5.9 | 2.1 | Iris-virginica |
---|
103 | 6.3 | 2.9 | 5.6 | 1.8 | Iris-virginica |
---|
Data Aggregation
I can also check certain aggregation values on the data set. Lets first look at on the individual columns.
This command gets me the mean value of "sepal-length" column.
df["sepal-length"].mean()
5.843333
If I want to get all numerical data for that column, I can use describe method
df["sepal-length"].describe()
count 150.000000
mean 5.843333
std 0.828066
min 4.300000
25% 5.100000
50% 5.800000
75% 6.400000
max 7.900000
Name: sepal-length, dtype: float64
What if I apply describe method on a non-numerical column?
count 150
unique 3
top Iris-setosa
freq 50
Name: class, dtype: object
As you can observe, for text columns it shows relevant data points.
We can use corr method to understand correlation between columns.
| sepal-length | sepal-width | petal-length | petal-width |
---|
sepal-length | 1.000000 | -0.109369 | 0.871754 | 0.817954 |
---|
sepal-width | -0.109369 | 1.000000 | -0.420516 | -0.356544 |
---|
petal-length | 0.871754 | -0.420516 | 1.000000 | 0.962757 |
---|
petal-width | 0.817954 | -0.356544 | 0.962757 | 1.000000 |
---|
This method shows correlation matrix.
Sorting
We can also do data sorting.
This command sorts on class column by Descending order.
df.sort_values(by='class', ascending=False)
| sepal-length | sepal-width | petal-length | petal-width | class |
---|
149 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
---|
111 | 6.4 | 2.7 | 5.3 | 1.9 | Iris-virginica |
---|
122 | 7.7 | 2.8 | 6.7 | 2.0 | Iris-virginica |
---|
121 | 5.6 | 2.8 | 4.9 | 2.0 | Iris-virginica |
---|
120 | 6.9 | 3.2 | 5.7 | 2.3 | Iris-virginica |
---|
... | ... | ... | ... | ... | ... |
---|
31 | 5.4 | 3.4 | 1.5 | 0.4 | Iris-setosa |
---|
30 | 4.8 | 3.1 | 1.6 | 0.2 | Iris-setosa |
---|
29 | 4.7 | 3.2 | 1.6 | 0.2 | Iris-setosa |
---|
28 | 5.2 | 3.4 | 1.4 | 0.2 | Iris-setosa |
---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
---|
In case you had sorted on columns and you want to reset the order, you can sort by index. That is if you look at index column in the above result, the index is not sorted as we sorted based on class column. Now lets look at the below example.
| sepal-length | sepal-width | petal-length | petal-width | class |
---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
---|
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
---|
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
---|
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
---|
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
---|
... | ... | ... | ... | ... | ... |
---|
145 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
---|
146 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
---|
147 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
---|
148 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
---|
149 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica
|
---|
As you can observe, the data is sorted on index. This is the same order as the original DataFrame.
This is not an exhaustive list of commands available for data analysis and manipulation. I picked some of them to understand the capabilities of Pandas library. You can refer Pandas documentation to get the complete list of commands.
Comments