Chapter 8 - Plotting and Visualization

In [1]:
from pandas import DataFrame, Series
import pandas as pd
iris_data = pd.read_csv("/home/mridul/nilmtk/iris.data", names=['Sepal Length', 'Sepal Width', 'Petal Length', \
                                                                'Petal Width', 'Class'])

We extract the Data given and convert it into a Dataframe object

In [2]:
iris_data[:5]
Out[2]:
Sepal Length Sepal Width Petal Length Petal Width Class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
In [55]:
print 'Distribution of Dataset as per classes'
class_count = iris_data['Class'].value_counts()
print class_count
Distribution of Dataset as per classes
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
dtype: int64

Initial Distribution of Dataset into their Respective Classes

In [20]:
%matplotlib inline
import matplotlib.pyplot as plt
class_count.plot(kind='barh', rot=0, xlim=(0,60))
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x9777450>
In [34]:
print "Printing the Total area of the length summations of all lengths and widths"
iris_data.plot(kind='area', color=['red','blue','green','orange'], ylim=(0,30), title="Area Graph")
plt.show()
Printing the Total area of the length summations of all lengths and widths

Noticing certain trends in the Dataset w.r.t. the total summation of all lengths and breaths according to class as the dataset is sorted by the Class itself.

In [52]:
print "We will now try to measure the overall and class-wise medians and standard deviations respectively."
print iris_data.ix[[2,78,132]] #random data to test ix
We will now try to measure the overall and class-wise medians and standard deviations respectively.
     Sepal Length  Sepal Width  Petal Length  Petal Width            Class
2             4.7          3.2           1.3          0.2      Iris-setosa
78            6.0          2.9           4.5          1.5  Iris-versicolor
132           6.4          2.8           5.6          2.2   Iris-virginica

In [53]:
data_mean = iris_data.mean()
data_std = iris_data.std()
print "Iris Dataset overall mean\n",data_mean
print '\nIris Dataset overall standard deviation\n',data_std
Iris Dataset overall mean
Sepal Length    5.843333
Sepal Width     3.054000
Petal Length    3.758667
Petal Width     1.198667
dtype: float64

Iris Dataset overall standard deviation
Sepal Length    0.828066
Sepal Width     0.433594
Petal Length    1.764420
Petal Width     0.763161
dtype: float64

In [60]:
class_val =  iris_data[:-1]['Class'].unique()
for i in class_val:
    print i,
Iris-setosa Iris-versicolor Iris-virginica

In [65]:
for i in class_val:    
    cur_class = iris_data[iris_data['Class'] == i]
    print cur_class[:2]
   Sepal Length  Sepal Width  Petal Length  Petal Width        Class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
    Sepal Length  Sepal Width  Petal Length  Petal Width            Class
50           7.0          3.2           4.7          1.4  Iris-versicolor
51           6.4          3.2           4.5          1.5  Iris-versicolor
     Sepal Length  Sepal Width  Petal Length  Petal Width           Class
100           6.3          3.3           6.0          2.5  Iris-virginica
101           5.8          2.7           5.1          1.9  Iris-virginica

Finding the Mean and Deviation for every single class and making a new dataframe.

This also includes the Deviation of these values from the overall mean and standard deviation.

In [108]:
data_mean = iris_data.mean()
data_std = iris_data.std()
data = []
for j in list(iris_data.columns)[:-1]:
    data+=[('All', j, data_mean[j], 0.0, data_std[j], 0.0)]
#Initialized the Dataset to be added to the Dataframe

for i in list(class_val):
    data_mean_temp = iris_data[iris_data['Class'] == i].mean()
    data_std_temp = iris_data[iris_data['Class'] == i].std()
    for j in list(iris_data.columns)[:-1]:
        mean_diff_temp = data_mean_temp[j] - data_mean[j]
        std_diff_temp = data_std_temp[j] - data_std[j]
        data+=[(i, j, data_mean_temp[j], mean_diff_temp, data_std_temp[j], std_diff_temp)]
plot_df = pd.DataFrame(data,columns=['Class','Type','Mean','Mean Var','Deviation','Dev Var'])
print plot_df.sort('Class')
              Class          Type      Mean  Mean Var  Deviation   Dev Var
0               All  Sepal Length  5.843333  0.000000   0.828066  0.000000
1               All   Sepal Width  3.054000  0.000000   0.433594  0.000000
2               All  Petal Length  3.758667  0.000000   1.764420  0.000000
3               All   Petal Width  1.198667  0.000000   0.763161  0.000000
4       Iris-setosa  Sepal Length  5.006000 -0.837333   0.352490 -0.475576
5       Iris-setosa   Sepal Width  3.418000  0.364000   0.381024 -0.052570
6       Iris-setosa  Petal Length  1.464000 -2.294667   0.173511 -1.590909
7       Iris-setosa   Petal Width  0.244000 -0.954667   0.107210 -0.655951
8   Iris-versicolor  Sepal Length  5.936000  0.092667   0.516171 -0.311895
9   Iris-versicolor   Sepal Width  2.770000 -0.284000   0.313798 -0.119796
10  Iris-versicolor  Petal Length  4.260000  0.501333   0.469911 -1.294509
11  Iris-versicolor   Petal Width  1.326000  0.127333   0.197753 -0.565408
12   Iris-virginica  Sepal Length  6.588000  0.744667   0.635880 -0.192187
13   Iris-virginica   Sepal Width  2.974000 -0.080000   0.322497 -0.111098
14   Iris-virginica  Petal Length  5.552000  1.793333   0.551895 -1.212526
15   Iris-virginica   Petal Width  2.026000  0.827333   0.274650 -0.488511

In [145]:
for j in list(iris_data.columns)[:-4]:
    plot_df[plot_df['Type'] == j].plot(kind='bar', title=j,x='Class',figsize=(9, 4))
    plt.axhline(data_mean[j], color='black')
    plt.axhline(data_std[j], color='black')

By above analysis, we can determine for each particular class which Type of measurement is it above the average or below the average. Also, we can see for ourselves the standard deviations and determine how distant the dataset is in terms of ranges from the overall values.

Example: We notice that the mean width of Petal width of Iris-Virginica is much more that the average mean. However the standard deviation is much lesser that the average values, suggesting more closed values with high magnitude.

In [154]:
iris_data['Total Length'] = iris_data['Sepal Length'] + iris_data['Petal Length']
iris_data['Total Width'] = iris_data['Sepal Width'] + iris_data['Petal Width']
iris_data['Total'] = iris_data['Total Length'] + iris_data['Total Width']
print iris_data[:5]
   Sepal Length  Sepal Width  Petal Length  Petal Width        Class  \
0           5.1          3.5           1.4          0.2  Iris-setosa   
1           4.9          3.0           1.4          0.2  Iris-setosa   
2           4.7          3.2           1.3          0.2  Iris-setosa   
3           4.6          3.1           1.5          0.2  Iris-setosa   
4           5.0          3.6           1.4          0.2  Iris-setosa   

   Total Length  Total Width  Total  
0           6.5          3.7   10.2  
1           6.3          3.2    9.5  
2           6.0          3.4    9.4  
3           6.1          3.3    9.4  
4           6.4          3.8   10.2  

Finding Percentile (Upper or Lower) can be obtained by sorting according to the field and obtaining the upper or lower fields.

In [188]:
print "Calculating Upper Percentile for any field that is required. \nFor eg. the total lengths summed up coming in the top 25%"
n = 25
values = int(iris_data.shape[0]*n/100.0)
print "Adding top", values,"values"
print iris_data[:-1].sort(ascending=False, columns='Total')['Total'][:values+1]
print "\n\nCalculating Lower Percentile for any field that is required. \nFor eg. the total lengths summed up coming in the last 15%"
m = 15
values = int(iris_data.shape[0]*m/100.0)
print "Adding last", values,"values"
print iris_data[:-1].sort(ascending=True, columns='Total')['Total'][:values+1]
Calculating Upper Percentile for any field that is required. 
For eg. the total lengths summed up coming in the top 25%
Adding top 37 values
117    20.4
131    20.1
118    19.5
109    19.4
105    19.3
122    19.2
135    19.1
107    18.3
125    18.2
130    18.2
144    18.2
143    18.2
100    18.1
120    18.1
102    18.1
124    17.8
140    17.8
136    17.7
129    17.6
139    17.5
104    17.5
141    17.4
112    17.4
148    17.3
145    17.2
115    17.2
132    17.0
128    16.9
116    16.8
110    16.8
108    16.8
137    16.8
147    16.7
103    16.6
77     16.4
52     16.4
50     16.3
111    16.3
Name: Total, dtype: float64


Calculating Lower Percentile for any field that is required. 
For eg. the total lengths summed up coming in the last 15%
Adding last 22 values
41     8.4
13     8.5
38     8.9
8      8.9
42     9.1
12     9.3
22     9.4
3      9.4
2      9.4
47     9.4
45     9.5
1      9.5
9      9.6
34     9.6
35     9.6
37     9.6
6      9.7
29     9.7
30     9.7
25     9.8
49     9.9
11    10.0
7     10.1
Name: Total, dtype: float64

Final Plotting to compare overall lengths, widths and summation of all dimensions in each class.

We also compare all graphs as one to estimate which class has the most and least dimensions on an average.

In [199]:
for i in iris_data[:-1]['Class'].unique():
    (iris_data[iris_data['Class'] == i])[['Total Length', 'Total Width', 'Total']].plot(kind="area", stacked=True, title=i)
iris_data[['Total Length', 'Total Width', 'Total']].plot(kind="area", stacked=True, title="Overall")
Out[199]:
<matplotlib.axes._subplots.AxesSubplot at 0xea5cc10>

Mridul

blogroll

social