I have this Hourly csv datas sorted like this day by day for hundreds days:
2011.05.16,00:00,1.40893
2011.05.16,01:00,1.40760
2011.05.16,02:00,1.40750
2011.05.16,03:00,1.40649
I want to make a count of how many times per hour the daily maximum value has been set, so if on 00:00 i had the maximum value of 2011.05.16 day i add 1 to 00:00 and so on. To do this i used a loop to count hours like indexes in this way:
def graph():
Date, Time, High = np.genfromtxt(myPath, delimiter=",",
unpack = True, converters={0:date_converter})
numList = [""] * 24
index=0
hour=0
count = [0] * 24
for eachHour in Time:
numList[hour] += str(High[index])
index += 1
hour +=1
if hour == 24:
higher = (numList.index(max(numList)))
count[higher] += 1
hour = 0
numList = [""] * 24
The problem is that in my datas often there's a gap with some hours missing, but the loop can't recognize it and continue putting values in the next hour index. I've searched everywhere, but i'm new to programming and this is my first "complex" work so i need more specific answers to my case for understand how it works. So how do you make an hourly frequency count like explained? The final result should be like:
00:00 n time max of the day
01:00 n time max of the day
02:00 n time max of the day
etc
First read in the csv:
In [11]: df = pd.read_csv('foo.csv', sep=',', header=None, parse_dates=[[0, 1]])
In [12]: df.columns = ['date', 'val']
In [13]: df.set_index('date', inplace=True)
In [14]: df
Out[14]:
val
date
2011-05-16 00:00:00 1.40893
2011-05-16 01:00:00 1.40760
2011-05-16 02:00:00 1.40750
2011-05-16 03:00:00 1.40649
Use resample to get each days maximum:
In [15]: day_max = df.resample('D', how='max')
Check whether each value is the day max:
In [16]: df['is_day_max'] = day_max.lookup(df.index.normalize(), len(df) * ['val']) == df.val
In [17]: df
Out[17]:
val is_day_max
date
2011-05-16 00:00:00 1.40893 True
2011-05-16 01:00:00 1.40760 False
2011-05-16 02:00:00 1.40750 False
2011-05-16 03:00:00 1.40649 False
And then sum these over each hour:
In [18]: df.groupby(df.index.time)['is_day_max'].sum()
Out[18]:
00:00:00 1
01:00:00 0
02:00:00 0
03:00:00 0
Name: is_day_max, dtype: float64
A solution with pandas: Supposing you have a dataframe with the date is index, you can first add a column to indicate the maximum value per day, and then groupby by hour and sum the occurrences:
In [32]: df['daily_max'] = df.groupby(df.index.date).transform(lambda x: x==x.max())
In [33]: df
Out[33]:
value daily_max
date_time
2011-05-16 00:00:00 1.40893 True
2011-05-16 01:00:00 1.40760 False
2011-05-16 02:00:00 1.40750 False
2011-05-16 03:00:00 1.40649 False
2011-05-17 02:00:00 1.40893 True
2011-05-17 03:00:00 1.40760 False
2011-05-17 04:00:00 1.40750 False
2011-05-17 05:00:00 1.40649 False
2011-05-18 02:00:00 1.40893 True
2011-05-18 03:00:00 1.40760 False
2011-05-18 04:00:00 1.40750 False
2011-05-18 05:00:00 1.40649 False
In [34]: df.groupby(df.index.time)['daily_max'].sum()
Out[34]:
00:00:00 1
01:00:00 0
02:00:00 2
03:00:00 0
04:00:00 0
05:00:00 0
Name: daily_max, dtype: float64
With older pandas version, this will give the same result as above (assuming your df
has a DatetimeIndex):
df['date'] = [t.date() for t in df.index.to_pydatetime()]
df['time'] = [t.time() for t in df.index.to_pydatetime()]
df['daily_max'] = df.groupby('date')['value'].transform(lambda x: x==x.max())
df.groupby('time')['daily_max'].sum()
The dataframe I used in this example:
from StringIO import StringIO
s="""2011.05.16,00:00,1.40893
2011.05.16,01:00,1.40760
2011.05.16,02:00,1.40750
2011.05.16,03:00,1.40649
2011.05.17,02:00,1.40893
2011.05.17,03:00,1.40760
2011.05.17,04:00,1.40750
2011.05.17,05:00,1.40649
2011.05.18,02:00,1.40893
2011.05.18,03:00,1.40760
2011.05.18,04:00,1.40750
2011.05.18,05:00,1.40649"""
df = pd.read_csv(StringIO(s), header=None, names=['date', 'time', 'value'], parse_dates=[['date', 'time']])
df = df.set_index('date_time')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With