Chapter 10 - Time Series

Primitive Examples to see Datetime operations and Parser functionalities

In [4]:
from datetime import datetime
now = datetime.now()
print now
2014-09-24 08:41:11.787335

In [5]:
from dateutil.parser import parse
parse('December 23rd 1994')
Out[5]:
datetime.datetime(1994, 12, 23, 0, 0)

Using a Sample Dataset to check for graphing abilities while iterating over Datetime objects

Labelling also checked and sample graphs plotted.

In [6]:
from pandas import DataFrame, Series
import pandas as pd
import numpy as np
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7), \
         datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]
sample_series = Series(np.random.randn(len(dates)), index=dates)
In [12]:
%matplotlib inline
import matplotlib.pyplot as plt
print sample_series
sample_series.plot(kind="line")
2011-01-02    0.454966
2011-01-05    1.157900
2011-01-07    0.211562
2011-01-08    1.562106
2011-01-10   -0.738559
2011-01-12   -1.094254
dtype: float64

Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x4c2d210>

Using the UCI Machine Learning Dataset on Power consumptions for many time series operations.

Owing to the large dataset size (>= 2000000) we slice the dataset and only consider the top 1000 instances

In [29]:
house_data = pd.read_csv("/home/mridul/nilmtk/household_power_consumption.txt", sep = ';', names=['Date', 'Time','Global Active Power','Global Reactive Power', 'Voltage', 'Global Intensity', 'Sub-metering 1', 'Sub-metering 2', 'Sub-metering 3'])[1:1001]
print house_data[:5]
         Date      Time Global Active Power Global Reactive Power  Voltage  \
1  16/12/2006  17:24:00               4.216                 0.418  234.840   
2  16/12/2006  17:25:00               5.360                 0.436  233.630   
3  16/12/2006  17:26:00               5.374                 0.498  233.290   
4  16/12/2006  17:27:00               5.388                 0.502  233.740   
5  16/12/2006  17:28:00               3.666                 0.528  235.680   

  Global Intensity Sub-metering 1 Sub-metering 2 Sub-metering 3  
1           18.400          0.000          1.000         17.000  
2           23.000          0.000          1.000         16.000  
3           23.000          0.000          2.000         17.000  
4           23.000          0.000          1.000         17.000  
5           15.800          0.000          1.000         17.000  

In [54]:
house_data['Timestamp'] = pd.DatetimeIndex(pd.to_datetime(house_data['Date']+' '+house_data['Time']))
In [73]:
len(global_ap['2006-12-16'])
Out[73]:
396

Comparing the overall Active vs Reactive power through graphical representation.

In [103]:
global_ap = Series(house_data['Global Active Power'].astype('float64').tolist(), index=pd.DatetimeIndex(house_data['Timestamp']),)
global_ap.name = 'Global Active Power'
ax1 = global_ap.plot(kind="area",color="red")
global_rp = Series(house_data['Global Reactive Power'].astype('float64').tolist(), index=pd.DatetimeIndex(house_data['Timestamp']))
global_rp.name = 'Global Reactive Power'
print global_ap.name,'vs', global_rp.name
global_rp.plot(kind="area",color="green")

#Combining 2 Series Area graphs requires custom labels to be made
handles, labels = ax1.get_legend_handles_labels()
ax1.legend([handle for i,handle in enumerate(handles)],\
          [label for label in list(house_data.columns.values)[2:4]])
Global Active Power vs Global Reactive Power

Out[103]:
<matplotlib.legend.Legend at 0x91eddd0>

Resampling with different periods and frequencies (hour, half hour, day etc.) to arrive at different datasets.

In [113]:
temp_ap = global_ap.resample('H')
temp_ap.plot(kind="bar")
Out[113]:
<matplotlib.axes._subplots.AxesSubplot at 0x111d8550>
In [114]:
temp2_ap = global_ap.resample('D')
temp2_ap.plot(kind="bar",color='green')
Out[114]:
<matplotlib.axes._subplots.AxesSubplot at 0x18af95d0>

Some primitive operations on the Datasets to understand filtering and splitting based on numerous Timestamp and Datatime inputs.

In [116]:
from pandas.tseries.offsets import Hour, Minute
minute = Minute(30)
minute
Out[116]:
<30 * Minutes>
In [129]:
time_temp = pd.date_range('2006-12-16','2006-12-17',freq = minute)
time_local =  pd.date_range('2006-12-16','2006-12-17',freq = 'D', tz='Asia/Kolkata')
print time_temp
print time_local
<class 'pandas.tseries.index.DatetimeIndex'>
[2006-12-16 00:00:00, ..., 2006-12-17 00:00:00]
Length: 49, Freq: 30T, Timezone: None
<class 'pandas.tseries.index.DatetimeIndex'>
[2006-12-16 00:00:00+05:30, 2006-12-17 00:00:00+05:30]
Length: 2, Freq: D, Timezone: Asia/Kolkata

In [132]:
time_temp = time_temp.tz_convert('US/Eastern')
print time_temp
<class 'pandas.tseries.index.DatetimeIndex'>
[2006-12-15 13:30:00-05:00, ..., 2006-12-16 13:30:00-05:00]
Length: 49, Freq: 30T, Timezone: US/Eastern

Time Zone operations - Localize, Convert, Perform operations, Concatinate etc.

In [138]:
sample_timestamp = time_local[0] + 3*minute
time_local[0], sample_timestamp
Out[138]:
(Timestamp('2006-12-16 00:00:00+0530', tz='Asia/Kolkata', offset='D'),
 Timestamp('2006-12-16 01:30:00+0530', tz='Asia/Kolkata', offset='D'))
In [143]:
time_result = time_temp[0].tz_convert('Asia/Kolkata') - time_local[0]
result
Out[143]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2006-12-16 00:00:00+05:30, ..., 2006-12-17 00:00:00+05:30]
Length: 49, Freq: 30T, Timezone: Asia/Kolkata

Trying to calculate the means for different periods of frequencies for the given Datasets and drawing conclusions

We have calculated the mean Active Power consumption for an hourly period with uniform frequency. The sudden power surge in the end of the plot possibly suggests excessive load consumption by a faulty device.

Also, the density of the graphs is more in the middle region which possibly suggests lot of fluctuations in power consumption as compared to the other timestamps.

In [186]:
#plot_data[['Global Active Power','Global Reactive Power', 'Voltage']] = plot_data[['Global Active Power','Global Reactive Power', 'Voltage']].astype('float64')
#print plot_data['Global Active Power'].dtype
#print plot_data['Global Reactive Power'].dtype
#print plot_data['Voltage'].dtype
#print plot_data['Datetime'].dtype

i='Global Active Power'
temp_data = Series(house_data[i].astype('float64').tolist(), index=pd.DatetimeIndex(house_data['Timestamp']))
temp_mean = temp_data.resample('H')
temp_data.plot(kind='line')
temp_mean.plot(kind='line', linewidth=4,color='red', title=i+": Actual and mean values")

#global_ap_mean = global_ap.resample('H')
#print global_ap_mean
#plot_data['Global Active Power'].plot(kind='line')
#plot_data_hour_mean['Global Active Power'].plot(kind='line', color='red')
Out[186]:
<matplotlib.axes._subplots.AxesSubplot at 0x7007850>

Similar Approach for the Reactive Power

In [187]:
i='Global Reactive Power'
temp_data = Series(house_data[i].astype('float64').tolist(), index=pd.DatetimeIndex(house_data['Timestamp']))
temp_mean = temp_data.resample('H')
temp_data.plot(kind='line',color='green')
temp_mean.plot(kind='line', linewidth=4,color='red', title=i+": Actual and mean values")
Out[187]:
<matplotlib.axes._subplots.AxesSubplot at 0x6ff38d0>

The rise in the voltage usage by the appliances is clearly visible in the graph.

In [189]:
i='Voltage'
temp_data = Series(house_data[i].astype('float64').tolist(), index=pd.DatetimeIndex(house_data['Timestamp']))
temp_mean = temp_data.resample('H')
temp_data.plot(kind='line',color='purple')
temp_mean.plot(kind='line', linewidth=4,color='red', title=i+": Actual and mean values")
Out[189]:
<matplotlib.axes._subplots.AxesSubplot at 0x6ff3210>

Mridul

blogroll

social