Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Doing a groupby and rolling window on a Pandas Dataframe with a multilevel index leads to a duplicated index entry

If I do a groupby() followed by a rolling() calculation with a multi-level index, one of the levels in the index is repeated - most odd. I am using Pandas 0.18.1

import pandas as pd
df = pd.DataFrame(data=[[1, 1, 10, 20], [1, 2, 30, 40], [1, 3, 50, 60],
                        [2, 1, 11, 21], [2, 2, 31, 41], [2, 3, 51, 61]],
                  columns=['id', 'date', 'd1', 'd2'])

df.set_index(['id', 'date'], inplace=True)
df = df.groupby(level='id').rolling(window=2)['d1'].sum()
print(df)
print(df.index)

The output is as follows

id  id  date
1   1   1        NaN
        2       40.0
        3       80.0
2   2   1        NaN
        2       42.0
        3       82.0
Name: d1, dtype: float64
MultiIndex(levels=[[1, 2], [1, 2], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
           names=[u'id', u'id', u'date'])

What is odd is that the id column now shows up twice in the multi-index. Moving the ['d1'] column selection around doesn't make any difference.

Any help would be much appreciated.

Thanks Paul

like image 306
Paul H Avatar asked Feb 08 '17 17:02

Paul H


People also ask

Does pandas groupby preserve index?

The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.

What does the groupby function do in pandas?

Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes.

What is possible using groupby () method of pandas?

groupby() can accept several different arguments: A column or list of columns. A dict or pandas Series. A NumPy array or pandas Index , or an array-like iterable of these.

What are the three phases of the pandas groupby () function?

The “group by” process: split-apply-combine (1) Splitting the data into groups. (2). Applying a function to each group independently, (3) Combining the results into a data structure.


1 Answers

It is bug.

But version with apply works nice, this alternative is here (only d1 was moved to apply):

df = df.groupby(level='id').d1.apply(lambda x: x.rolling(window=2).sum())
print(df)
id  date
1   1        NaN
    2       40.0
    3       80.0
2   1        NaN
    2       42.0
    3       82.0
Name: d1, dtype: float64
like image 106
jezrael Avatar answered Oct 23 '22 05:10

jezrael