Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to create groups with duplicate keys in pandas groupby? [duplicate]

Tags:

python

pandas

What I wish to do is create groups based on the last column (key) such that in the first group I get for example the row 1. For the second group I get rows 1 and 2. And in the 3rd group I get the last two rows.

I tried to do it with pandas groupby method, grouped = df1.groupby('Key') but as expected it returns only 2 groups.

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3','A4'],
   ...:                     'B': ['B0', 'B1', 'B2', 'B3', 'B4'],
   ...:                     'C': ['C0', 'C1', 'C2', 'C3', 'C4'],
   ...:                     'D': ['D0', 'D1', 'D2', 'D3', 'D4'],
                            'Key': ['K0', 'K1', 'K1', 'K0', 'K0']},
   ...:                     index=[0, 1, 2, 3, 4])


df1
Out[89]: 
    A   B   C   D Key
0  A0  B0  C0  D0  K0
1  A1  B1  C1  D1  K1
2  A2  B2  C2  D2  K1
3  A3  B3  C3  D3  K0
4  A4  B4  C4  D4  K0
like image 968
vampiretap Avatar asked Jul 24 '18 17:07

vampiretap


1 Answers

You can use shift and cumsum then groupby

grouped = df.groupby((df.Key != df.Key.shift()).cumsum())
grouped.groups

{1: Int64Index([0], dtype='int64'),
 2: Int64Index([1, 2], dtype='int64'),
 3: Int64Index([3, 4], dtype='int64')}

To show why this works:

(df.Key != df.Key.shift()).cumsum()

0    1
1    2
2    2
3    3
4    3
Name: Key, dtype: int32

Each time Key changes, the cumulative sum of the column will increase by one, even if the value has been seen before. This allows you to groupby this new series, which only groups sequential repeated keys.

like image 112
user3483203 Avatar answered Oct 24 '22 08:10

user3483203