Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Resampling Within a Pandas MultiIndex

I have some hierarchical data which bottoms out into time series data which looks something like this:

df = pandas.DataFrame(     {'value_a': values_a, 'value_b': values_b},     index=[states, cities, dates]) df.index.names = ['State', 'City', 'Date'] df                                 value_a  value_b State   City       Date                         Georgia Atlanta    2012-01-01        0       10                    2012-01-02        1       11                    2012-01-03        2       12                    2012-01-04        3       13         Savanna    2012-01-01        4       14                    2012-01-02        5       15                    2012-01-03        6       16                    2012-01-04        7       17 Alabama Mobile     2012-01-01        8       18                    2012-01-02        9       19                    2012-01-03       10       20                    2012-01-04       11       21         Montgomery 2012-01-01       12       22                    2012-01-02       13       23                    2012-01-03       14       24                    2012-01-04       15       25 

I'd like to perform time resampling per city, so something like

df.resample("2D", how="sum") 

would output

                             value_a  value_b State   City       Date                         Georgia Atlanta    2012-01-01        1       21                    2012-01-03        5       25         Savanna    2012-01-01        9       29                    2012-01-03       13       33 Alabama Mobile     2012-01-01       17       37                    2012-01-03       21       41         Montgomery 2012-01-01       25       45                    2012-01-03       29       49 

as is, df.resample('2D', how='sum') gets me

TypeError: Only valid with DatetimeIndex or PeriodIndex 

Fair enough, but I'd sort of expect this to work:

>>> df.swaplevel('Date', 'State').resample('2D', how='sum') TypeError: Only valid with DatetimeIndex or PeriodIndex 

at which point I'm really running out of ideas... is there some way stack and unstack might be able to help me?

like image 574
Snakes McGee Avatar asked Apr 03 '13 22:04

Snakes McGee


People also ask

How do I resample data in pandas?

Resample Pandas time-series data. The resample() function is used to resample time-series data. Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.

What does PD MultiIndex do?

Multi-index allows you to select more than one row and column in your index. It is a multi-level or hierarchical object for pandas object. Now there are various methods of multi-index that are used such as MultiIndex.

How many levels are in MultiIndex pandas?

As we can see in the output, midx MultiIndex has 3 levels.


1 Answers

pd.Grouper allows you to specify a "groupby instruction for a target object". In particular, you can use it to group by dates even if df.index is not a DatetimeIndex:

df.groupby(pd.Grouper(freq='2D', level=-1)) 

The level=-1 tells pd.Grouper to look for the dates in the last level of the MultiIndex. Moreover, you can use this in conjunction with other level values from the index:

level_values = df.index.get_level_values result = (df.groupby([level_values(i) for i in [0,1]]                       +[pd.Grouper(freq='2D', level=-1)]).sum()) 

It looks a bit awkward, but using_Grouper turns out to be much faster than my original suggestion, using_reset_index:

import numpy as np import pandas as pd import datetime as DT  def using_Grouper(df):     level_values = df.index.get_level_values     return (df.groupby([level_values(i) for i in [0,1]]                        +[pd.Grouper(freq='2D', level=-1)]).sum())  def using_reset_index(df):     df = df.reset_index(level=[0, 1])     return df.groupby(['State','City']).resample('2D').sum()  def using_stack(df):     # http://stackoverflow.com/a/15813787/190597     return (df.unstack(level=[0,1])               .resample('2D').sum()               .stack(level=[2,1])               .swaplevel(2,0))  def make_orig():     values_a = range(16)     values_b = range(10, 26)     states = ['Georgia']*8 + ['Alabama']*8     cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4     dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)     df = pd.DataFrame(         {'value_a': values_a, 'value_b': values_b},         index = [states, cities, dates])     df.index.names = ['State', 'City', 'Date']     return df  def make_df(N):     dates = pd.date_range('2000-1-1', periods=N)     states = np.arange(50)     cities = np.arange(10)     index = pd.MultiIndex.from_product([states, cities, dates],                                         names=['State', 'City', 'Date'])     df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,                       columns=['value_a', 'value_b'])     return df  df = make_orig() print(using_Grouper(df)) 

yields

                               value_a  value_b State   City       Date                         Alabama Mobile     2012-01-01       17       37                    2012-01-03       21       41         Montgomery 2012-01-01       25       45                    2012-01-03       29       49 Georgia Atlanta    2012-01-01        1       21                    2012-01-03        5       25         Savanna    2012-01-01        9       29                    2012-01-03       13       33 

Here is a benchmark comparing using_Grouper, using_reset_index, using_stack on a 5000-row DataFrame:

In [30]: df = make_df(10)  In [34]: len(df) Out[34]: 5000  In [32]: %timeit using_Grouper(df) 100 loops, best of 3: 6.03 ms per loop  In [33]: %timeit using_stack(df) 10 loops, best of 3: 22.3 ms per loop  In [31]: %timeit using_reset_index(df) 1 loop, best of 3: 659 ms per loop 
like image 96
unutbu Avatar answered Oct 06 '22 05:10

unutbu