If I do a groupby() followed by a rolling() calculation with a multi-level index, one of the levels in the index is repeated - most odd. I am using Pandas 0.18.1 <pre class="prettyprint"><code>import pandas as pd df = pd.DataFrame(data=[[1, 1, 10, 20], [1, 2, 30, 40], [1, 3, 50, 60], [2, 1, 11, 21], [2, 2, 31, 41], [2, 3, 51, 61]], columns=['id', 'date', 'd1', 'd2']) df.set_index(['id', 'date'], inplace=True) df = df.groupby(level='id').rolling(window=2)['d1'].sum() print(df) print(df.index) </code></pre> The output is as follows <pre class="prettyprint"><code>id id date 1 1 1 NaN 2 40.0 3 80.0 2 2 1 NaN 2 42.0 3 82.0 Name: d1, dtype: float64 MultiIndex(levels=[[1, 2], [1, 2], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]], names=[u'id', u'id', u'date']) </code></pre> What is odd is that the id column now shows up twice in the multi-index. Moving the ['d1'] column selection around doesn't make any difference. Any help would be much appreciated. Thanks Paul

It is bug. But version with <code>apply</code> works nice, this alternative is here (only <code>d1</code> was moved to <code>apply</code>): <pre class="prettyprint"><code>df = df.groupby(level='id').d1.apply(lambda x: x.rolling(window=2).sum()) print(df) id date 1 1 NaN 2 40.0 3 80.0 2 1 NaN 2 42.0 3 82.0 Name: d1, dtype: float64 </code></pre>

Doing a groupby and rolling window on a Pandas Dataframe with a multilevel index leads to a duplicated index entry

Tags:

pandas

python-2.7

If I do a groupby() followed by a rolling() calculation with a multi-level index, one of the levels in the index is repeated - most odd. I am using Pandas 0.18.1

import pandas as pd
df = pd.DataFrame(data=[[1, 1, 10, 20], [1, 2, 30, 40], [1, 3, 50, 60],
                        [2, 1, 11, 21], [2, 2, 31, 41], [2, 3, 51, 61]],
                  columns=['id', 'date', 'd1', 'd2'])

df.set_index(['id', 'date'], inplace=True)
df = df.groupby(level='id').rolling(window=2)['d1'].sum()
print(df)
print(df.index)

The output is as follows

id  id  date
1   1   1        NaN
        2       40.0
        3       80.0
2   2   1        NaN
        2       42.0
        3       82.0
Name: d1, dtype: float64
MultiIndex(levels=[[1, 2], [1, 2], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
           names=[u'id', u'id', u'date'])

What is odd is that the id column now shows up twice in the multi-index. Moving the ['d1'] column selection around doesn't make any difference.

Any help would be much appreciated.

Thanks Paul

306

asked Feb 08 '17 17:02

Paul H

1 Answers

It is bug.

But version with apply works nice, this alternative is here (only d1 was moved to apply):

df = df.groupby(level='id').d1.apply(lambda x: x.rolling(window=2).sum())
print(df)
id  date
1   1        NaN
    2       40.0
    3       80.0
2   1        NaN
    2       42.0
    3       82.0
Name: d1, dtype: float64

106

answered Oct 23 '22 05:10

jezrael

Related questions
                            
                                Why I receive CERTIFICATE_VERIFY_FAILED from google adwords api?
                            
                                Memory Error Python When Processing Files
                            
                                How to save in openpyxl without losing formulae?
                            
                                Cannot start ipython notebook by missing a module called zmq.eventloop
                            
                                Vaultier is unusable for docker/ubuntu/debian (Python)
                            
                                Why doesn't os.chflags() work under Linux?
                            
                                Odoo - Internal Server Error on custom module uninstall
                            
                                using self-signed certificates with requests in python
                            
                                Python Error name 'runfile' is not defined in Spyder
                            
                                Python google api
                            
                                Celery time statistics per-task-name
                            
                                Can metaclass be any callable?
                            
                                Theano: Where to put .theanorc file for Anaconda installation? (Windows)
                            
                                How can I get the actual axis limits when using ax.axis('equal')?
                            
                                Global name 'camera' is not defined in python
                            
                                Handling empty case with tuple filtering and unpacking
                            
                                Check sklearn version before loading model using joblib
                            
                                Simultaneously iterate over multiple list and capture difference in values
                            
                                Making arctan2() continuous beyond 2pi
                            
                                Python - read 1000 lines from a file at a time

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With