Resampling Within a Pandas MultiIndex

Tags:

I have some hierarchical data which bottoms out into time series data which looks something like this:

df = pandas.DataFrame(     {'value_a': values_a, 'value_b': values_b},     index=[states, cities, dates]) df.index.names = ['State', 'City', 'Date'] df                                 value_a  value_b State   City       Date                         Georgia Atlanta    2012-01-01        0       10                    2012-01-02        1       11                    2012-01-03        2       12                    2012-01-04        3       13         Savanna    2012-01-01        4       14                    2012-01-02        5       15                    2012-01-03        6       16                    2012-01-04        7       17 Alabama Mobile     2012-01-01        8       18                    2012-01-02        9       19                    2012-01-03       10       20                    2012-01-04       11       21         Montgomery 2012-01-01       12       22                    2012-01-02       13       23                    2012-01-03       14       24                    2012-01-04       15       25

I'd like to perform time resampling per city, so something like

df.resample("2D", how="sum")

would output

                             value_a  value_b State   City       Date                         Georgia Atlanta    2012-01-01        1       21                    2012-01-03        5       25         Savanna    2012-01-01        9       29                    2012-01-03       13       33 Alabama Mobile     2012-01-01       17       37                    2012-01-03       21       41         Montgomery 2012-01-01       25       45                    2012-01-03       29       49

as is, df.resample('2D', how='sum') gets me

TypeError: Only valid with DatetimeIndex or PeriodIndex

Fair enough, but I'd sort of expect this to work:

>>> df.swaplevel('Date', 'State').resample('2D', how='sum') TypeError: Only valid with DatetimeIndex or PeriodIndex

at which point I'm really running out of ideas... is there some way stack and unstack might be able to help me?

574

asked Apr 03 '13 22:04

Snakes McGee

1 Answers

pd.Grouper allows you to specify a "groupby instruction for a target object". In particular, you can use it to group by dates even if df.index is not a DatetimeIndex:

df.groupby(pd.Grouper(freq='2D', level=-1))

The level=-1 tells pd.Grouper to look for the dates in the last level of the MultiIndex. Moreover, you can use this in conjunction with other level values from the index:

level_values = df.index.get_level_values result = (df.groupby([level_values(i) for i in [0,1]]                       +[pd.Grouper(freq='2D', level=-1)]).sum())

It looks a bit awkward, but using_Grouper turns out to be much faster than my original suggestion, using_reset_index:

import numpy as np import pandas as pd import datetime as DT  def using_Grouper(df):     level_values = df.index.get_level_values     return (df.groupby([level_values(i) for i in [0,1]]                        +[pd.Grouper(freq='2D', level=-1)]).sum())  def using_reset_index(df):     df = df.reset_index(level=[0, 1])     return df.groupby(['State','City']).resample('2D').sum()  def using_stack(df):     # http://stackoverflow.com/a/15813787/190597     return (df.unstack(level=[0,1])               .resample('2D').sum()               .stack(level=[2,1])               .swaplevel(2,0))  def make_orig():     values_a = range(16)     values_b = range(10, 26)     states = ['Georgia']*8 + ['Alabama']*8     cities = ['Atlanta']*4 + ['Savanna']*4 + ['Mobile']*4 + ['Montgomery']*4     dates = pd.DatetimeIndex([DT.date(2012,1,1)+DT.timedelta(days = i) for i in range(4)]*4)     df = pd.DataFrame(         {'value_a': values_a, 'value_b': values_b},         index = [states, cities, dates])     df.index.names = ['State', 'City', 'Date']     return df  def make_df(N):     dates = pd.date_range('2000-1-1', periods=N)     states = np.arange(50)     cities = np.arange(10)     index = pd.MultiIndex.from_product([states, cities, dates],                                         names=['State', 'City', 'Date'])     df = pd.DataFrame(np.random.randint(10, size=(len(index),2)), index=index,                       columns=['value_a', 'value_b'])     return df  df = make_orig() print(using_Grouper(df))

yields

                               value_a  value_b State   City       Date                         Alabama Mobile     2012-01-01       17       37                    2012-01-03       21       41         Montgomery 2012-01-01       25       45                    2012-01-03       29       49 Georgia Atlanta    2012-01-01        1       21                    2012-01-03        5       25         Savanna    2012-01-01        9       29                    2012-01-03       13       33

Here is a benchmark comparing using_Grouper, using_reset_index, using_stack on a 5000-row DataFrame:

In [30]: df = make_df(10)  In [34]: len(df) Out[34]: 5000  In [32]: %timeit using_Grouper(df) 100 loops, best of 3: 6.03 ms per loop  In [33]: %timeit using_stack(df) 10 loops, best of 3: 22.3 ms per loop  In [31]: %timeit using_reset_index(df) 1 loop, best of 3: 659 ms per loop

answered Oct 06 '22 05:10

unutbu

Related questions
                            
                                How to set the value of dataclass field in __post_init__ when frozen=True?
                            
                                What is the most efficient graph data structure in Python? [closed]
                            
                                set pythonpath before import statements
                            
                                Direct assignment to the forward side of a many-to-many set is prohibited. Use emails_for_help.set() instead
                            
                                Should I ignore the .idea folder when using PyCharm with Git?
                            
                                Understanding Popen.communicate
                            
                                dynamically add field to a form
                            
                                Trying to import module with the same name as a built-in module causes an import error
                            
                                python argh/argparse: How can I pass a list as a command-line argument?
                            
                                sys.path different in Jupyter and Python - how to import own modules in Jupyter?
                            
                                Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn
                            
                                Different object size of True and False in Python 3
                            
                                How to get NaN when I divide by zero
                            
                                Homebrew brew doctor warning about /Library/Frameworks/Python.framework, even with brew's Python installed
                            
                                numpy array TypeError: only integer scalar arrays can be converted to a scalar index
                            
                                How to make numpy.argmax return all occurrences of the maximum?
                            
                                How to make global imports from a function?
                            
                                What does the delayed() function do (when used with joblib in Python)
                            
                                What is the correct way to make my PyQt application quit when killed from the console (Ctrl-C)?
                            
                                OSError: [Errno 18] Invalid cross-device link

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Resampling Within a Pandas MultiIndex

Tags:

python

pandas

hierarchical-data

multi-index

time-series

Snakes McGee

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us