Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hourly frequency count with Python

I have this Hourly csv datas sorted like this day by day for hundreds days:

2011.05.16,00:00,1.40893
2011.05.16,01:00,1.40760
2011.05.16,02:00,1.40750
2011.05.16,03:00,1.40649

I want to make a count of how many times per hour the daily maximum value has been set, so if on 00:00 i had the maximum value of 2011.05.16 day i add 1 to 00:00 and so on. To do this i used a loop to count hours like indexes in this way:

def graph():    
Date, Time,  High = np.genfromtxt(myPath, delimiter=",",
                                  unpack = True,  converters={0:date_converter})                                                                           
numList = [""] * 24
index=0
hour=0    
count = [0] * 24

for eachHour in Time:        
    numList[hour] += str(High[index])        
    index += 1
    hour +=1        

    if hour == 24:           
        higher = (numList.index(max(numList)))
        count[higher] += 1            
        hour = 0            
        numList = [""] * 24

The problem is that in my datas often there's a gap with some hours missing, but the loop can't recognize it and continue putting values in the next hour index. I've searched everywhere, but i'm new to programming and this is my first "complex" work so i need more specific answers to my case for understand how it works. So how do you make an hourly frequency count like explained? The final result should be like:

00:00 n time max of the day   
01:00 n time max of the day   
02:00 n time max of the day  
etc
like image 292
pietrovismara Avatar asked Dec 28 '13 21:12

pietrovismara


2 Answers

First read in the csv:

In [11]: df = pd.read_csv('foo.csv', sep=',', header=None, parse_dates=[[0, 1]])

In [12]: df.columns = ['date', 'val']

In [13]: df.set_index('date', inplace=True)

In [14]: df
Out[14]: 
                         val
date                        
2011-05-16 00:00:00  1.40893
2011-05-16 01:00:00  1.40760
2011-05-16 02:00:00  1.40750
2011-05-16 03:00:00  1.40649

Use resample to get each days maximum:

In [15]: day_max = df.resample('D', how='max')

Check whether each value is the day max:

In [16]: df['is_day_max'] = day_max.lookup(df.index.normalize(), len(df) * ['val']) == df.val

In [17]: df
Out[17]: 
                         val is_day_max
date                                   
2011-05-16 00:00:00  1.40893       True
2011-05-16 01:00:00  1.40760      False
2011-05-16 02:00:00  1.40750      False
2011-05-16 03:00:00  1.40649      False

And then sum these over each hour:

In [18]: df.groupby(df.index.time)['is_day_max'].sum()
Out[18]: 
00:00:00    1
01:00:00    0
02:00:00    0
03:00:00    0
Name: is_day_max, dtype: float64
like image 158
Andy Hayden Avatar answered Oct 12 '22 02:10

Andy Hayden


A solution with pandas: Supposing you have a dataframe with the date is index, you can first add a column to indicate the maximum value per day, and then groupby by hour and sum the occurrences:

In [32]: df['daily_max'] = df.groupby(df.index.date).transform(lambda x: x==x.max())
In [33]: df
Out[33]: 
                       value daily_max
date_time                             
2011-05-16 00:00:00  1.40893      True
2011-05-16 01:00:00  1.40760     False
2011-05-16 02:00:00  1.40750     False
2011-05-16 03:00:00  1.40649     False
2011-05-17 02:00:00  1.40893      True
2011-05-17 03:00:00  1.40760     False
2011-05-17 04:00:00  1.40750     False
2011-05-17 05:00:00  1.40649     False
2011-05-18 02:00:00  1.40893      True
2011-05-18 03:00:00  1.40760     False
2011-05-18 04:00:00  1.40750     False
2011-05-18 05:00:00  1.40649     False

In [34]: df.groupby(df.index.time)['daily_max'].sum()
Out[34]: 
00:00:00    1
01:00:00    0
02:00:00    2
03:00:00    0
04:00:00    0
05:00:00    0
Name: daily_max, dtype: float64

With older pandas version, this will give the same result as above (assuming your df has a DatetimeIndex):

df['date'] = [t.date() for t in df.index.to_pydatetime()]
df['time'] = [t.time() for t in df.index.to_pydatetime()]
df['daily_max'] = df.groupby('date')['value'].transform(lambda x: x==x.max())
df.groupby('time')['daily_max'].sum()

The dataframe I used in this example:

from StringIO import StringIO

s="""2011.05.16,00:00,1.40893
2011.05.16,01:00,1.40760
2011.05.16,02:00,1.40750
2011.05.16,03:00,1.40649
2011.05.17,02:00,1.40893
2011.05.17,03:00,1.40760
2011.05.17,04:00,1.40750
2011.05.17,05:00,1.40649
2011.05.18,02:00,1.40893
2011.05.18,03:00,1.40760
2011.05.18,04:00,1.40750
2011.05.18,05:00,1.40649"""

df = pd.read_csv(StringIO(s), header=None, names=['date', 'time', 'value'], parse_dates=[['date', 'time']])
df = df.set_index('date_time')
like image 41
joris Avatar answered Oct 12 '22 01:10

joris