I have some hierarchical data which bottoms out into time series data which looks something like this:
df = pandas.DataFrame( {'value_a': values_a, 'value_b': values_b}, index=[states, cities, dates]) df.index.names = ['State', 'City', 'Date'] df value_a value_b State City Date Georgia Atlanta 2012-01-01 0 10 2012-01-02 1 11 2012-01-03 2 12 2012-01-04 3 13 Savanna 2012-01-01 4 14 2012-01-02 5 15 2012-01-03 6 16 2012-01-04 7 17 Alabama Mobile 2012-01-01 8 18 2012-01-02 9 19 2012-01-03 10 20 2012-01-04 11 21 Montgomery 2012-01-01 12 22 2012-01-02 13 23 2012-01-03 14 24 2012-01-04 15 25
I'd like to perform time resampling per city, so something like
df.resample("2D", how="sum")
would output
value_a value_b State City Date Georgia Atlanta 2012-01-01 1 21 2012-01-03 5 25 Savanna 2012-01-01 9 29 2012-01-03 13 33 Alabama Mobile 2012-01-01 17 37 2012-01-03 21 41 Montgomery 2012-01-01 25 45 2012-01-03 29 49
as is, df.resample('2D', how='sum')
gets me
TypeError: Only valid with DatetimeIndex or PeriodIndex
Fair enough, but I'd sort of expect this to work:
>>> df.swaplevel('Date', 'State').resample('2D', how='sum') TypeError: Only valid with DatetimeIndex or PeriodIndex
at which point I'm really running out of ideas... is there some way stack and unstack might be able to help me?
Resample Pandas time-series data. The resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
Multi-index allows you to select more than one row and column in your index. It is a multi-level or hierarchical object for pandas object. Now there are various methods of multi-index that are used such as MultiIndex.
As we can see in the output, midx MultiIndex has 3 levels.
pd.Grouper
allows you to specify a "groupby instruction for a target object". In particular, you can use it to group by dates even if df.index
is not a DatetimeIndex
:
df.groupby(pd.Grouper(freq='2D', level=-1))
The level=-1
tells pd.Grouper
to look for the dates in the last level of the MultiIndex. Moreover, you can use this in conjunction with other level values from the index:
level_values = df.index.get_level_values result = (df.groupby([level_values(i) for i in [0,1]] +[pd.Grouper(freq='2D', level=-1)]).sum())
It looks a bit awkward, but using_Grouper
turns out to be much faster than my original suggestion, using_reset_index
:
import numpy as np import pandas as pd import datetime as DT def using_Grouper(df): level_values = df.index.get_level_values return (df.groupby([level_values(i) for i in [0,1]] +[pd.Grouper(freq='2D', level=-1)]).sum()) def using_reset_index(df): df = df.reset_index(level=[0, 1]) return df.groupby(['State','City']).resample('2D').sum() def using_stack(df): # http://stackoverflow.com/a/15813787/190597 return (df.unstack(level=[0,1]) .resample('2D').sum() .stack(level=[2,1]) .swaplevel(2,0)) def make_orig(): values_a = range(16) values_b = range(10, 26) states = ['Georgia']*8 + ['Alabama']*8 cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4 dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4) df = pd.DataFrame( {'value_a': values_a, 'value_b': values_b}, index = [states, cities, dates]) df.index.names = ['State', 'City', 'Date'] return df def make_df(N): dates = pd.date_range('2000-1-1', periods=N) states = np.arange(50) cities = np.arange(10) index = pd.MultiIndex.from_product([states, cities, dates], names=['State', 'City', 'Date']) df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index, columns=['value_a', 'value_b']) return df df = make_orig() print(using_Grouper(df))
yields
value_a value_b State City Date Alabama Mobile 2012-01-01 17 37 2012-01-03 21 41 Montgomery 2012-01-01 25 45 2012-01-03 29 49 Georgia Atlanta 2012-01-01 1 21 2012-01-03 5 25 Savanna 2012-01-01 9 29 2012-01-03 13 33
Here is a benchmark comparing using_Grouper
, using_reset_index
, using_stack
on a 5000-row DataFrame:
In [30]: df = make_df(10) In [34]: len(df) Out[34]: 5000 In [32]: %timeit using_Grouper(df) 100 loops, best of 3: 6.03 ms per loop In [33]: %timeit using_stack(df) 10 loops, best of 3: 22.3 ms per loop In [31]: %timeit using_reset_index(df) 1 loop, best of 3: 659 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With