Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas/Python - Group data by same period in time

Tags:

python

pandas

I have some financial data and want to get only the last transaction from a specific period of time (hours, days, months...).

Example:

>>df
      time  price_BRL     qt              time_dt
1312001297      23.49   1.00  2011-07-30 04:48:17
1312049148      23.40   1.00  2011-07-30 18:05:48
1312121523      23.49   2.00  2011-07-31 14:12:03
1312121523      23.50   6.50  2011-07-31 14:12:03
1312177622      23.40   2.00  2011-08-01 05:47:02
1312206416      23.25   1.00  2011-08-01 13:46:56
1312637929      18.95   1.50  2011-08-06 13:38:49
1312637929      18.95   4.00  2011-08-06 13:38:49
1312817114       0.80   0.01  2011-08-08 15:25:14
1312818289       0.10   0.01  2011-08-08 15:44:49
1312819795       6.00   0.09  2011-08-08 16:09:55
1312847064      16.00   0.86  2011-08-08 23:44:24
1312849282      16.00   6.14  2011-08-09 00:21:22
1312898146      19.90   1.00  2011-08-09 13:55:46
1312915666       6.00   0.01  2011-08-09 18:47:46
1312934897      19.90   1.00  2011-08-10 00:08:17
>>filter_by_last_day(df)
      time  price_BRL     qt              time_dt
1312049148      23.40   1.00  2011-07-30 18:05:48
1312121523      23.50   6.50  2011-07-31 14:12:03
1312206416      23.25   1.00  2011-08-01 13:46:56
1312637929      18.95   4.00  2011-08-06 13:38:49
1312847064      16.00   0.86  2011-08-08 23:44:24
1312915666       6.00   0.01  2011-08-09 18:47:46
1312934897      19.90   1.00  2011-08-10 00:08:17

I was thinking in using groupby() and get the mean() for that day (this solution is also possible for my problem, but not exactly) but don't know how to select the day like df.groupby('time.day').last()

like image 502
Jader Martins Avatar asked Oct 18 '22 19:10

Jader Martins


1 Answers

You can use groupby by dt.date and aggregate by last:

#if necessery convert to datetime
df.time_dt = pd.to_datetime(df.time_dt)

df = df.groupby(df.time_dt.dt.date).last().reset_index(drop=True)
print (df)
         time  price_BRL    qt             time_dt
0  1312049148      23.40  1.00 2011-07-30 18:05:48
1  1312121523      23.50  6.50 2011-07-31 14:12:03
2  1312206416      23.25  1.00 2011-08-01 13:46:56
3  1312637929      18.95  4.00 2011-08-06 13:38:49
4  1312847064      16.00  0.86 2011-08-08 23:44:24
5  1312915666       6.00  0.01 2011-08-09 18:47:46
6  1312934897      19.90  1.00 2011-08-10 00:08:17

Thank you MaxU for another solution - add parameter as_index=False for return DataFrame:

df = df.groupby(df.time_dt.dt.date, as_index=False).last()
print (df)
         time  price_BRL    qt             time_dt
0  1312049148      23.40  1.00 2011-07-30 18:05:48
1  1312121523      23.50  6.50 2011-07-31 14:12:03
2  1312206416      23.25  1.00 2011-08-01 13:46:56
3  1312637929      18.95  4.00 2011-08-06 13:38:49
4  1312847064      16.00  0.86 2011-08-08 23:44:24
5  1312915666       6.00  0.01 2011-08-09 18:47:46
6  1312934897      19.90  1.00 2011-08-10 00:08:17

Solution with resample, but is necessery remove NaN rows by dropna:

df = df.resample('d', on='time_dt').last().dropna(how='all').reset_index(drop=True)
#cast column time to int
df.time = df.time.astype(int)
print (df)
         time  price_BRL    qt             time_dt
0  1312049148      23.40  1.00 2011-07-30 18:05:48
1  1312121523      23.50  6.50 2011-07-31 14:12:03
2  1312206416      23.25  1.00 2011-08-01 13:46:56
3  1312637929      18.95  4.00 2011-08-06 13:38:49
4  1312847064      16.00  0.86 2011-08-08 23:44:24
5  1312915666       6.00  0.01 2011-08-09 18:47:46
6  1312934897      19.90  1.00 2011-08-10 00:08:17

---

You can also use dt.month:

df = df.groupby(df.time_dt.dt.month).last().reset_index(drop=True)
print (df)
         time  price_BRL   qt             time_dt
0  1312121523       23.5  6.5 2011-07-31 14:12:03
1  1312934897       19.9  1.0 2011-08-10 00:08:17

With hours it is a bit complicated, if need groupby by date and hours together, solution is replace minutes and seconds to 0 by astype:

hours = df.time_dt.values.astype('<M8[h]')
print (hours)
['2011-07-30T04' '2011-07-30T18' '2011-07-31T14' '2011-07-31T14'
 '2011-08-01T05' '2011-08-01T13' '2011-08-06T13' '2011-08-06T13'
 '2011-08-08T15' '2011-08-08T15' '2011-08-08T16' '2011-08-08T23'
 '2011-08-09T00' '2011-08-09T13' '2011-08-09T18' '2011-08-10T00']

df = df.groupby(hours).last().reset_index(drop=True)
print (df)
          time  price_BRL    qt             time_dt
0   1312001297      23.49  1.00 2011-07-30 04:48:17
1   1312049148      23.40  1.00 2011-07-30 18:05:48
2   1312121523      23.50  6.50 2011-07-31 14:12:03
3   1312177622      23.40  2.00 2011-08-01 05:47:02
4   1312206416      23.25  1.00 2011-08-01 13:46:56
5   1312637929      18.95  4.00 2011-08-06 13:38:49
6   1312818289       0.10  0.01 2011-08-08 15:44:49
7   1312819795       6.00  0.09 2011-08-08 16:09:55
8   1312847064      16.00  0.86 2011-08-08 23:44:24
9   1312849282      16.00  6.14 2011-08-09 00:21:22
10  1312898146      19.90  1.00 2011-08-09 13:55:46
11  1312915666       6.00  0.01 2011-08-09 18:47:46
12  1312934897      19.90  1.00 2011-08-10 00:08:17
like image 88
jezrael Avatar answered Oct 20 '22 11:10

jezrael