Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Enumerate duplicates in index

Let's say I have a list of events that happen on different keys.

data = [
    {"key": "A", "event": "created"},
    {"key": "A", "event": "updated"},
    {"key": "A", "event": "updated"},
    {"key": "A", "event": "updated"},
    {"key": "B", "event": "created"},
    {"key": "B", "event": "updated"},
    {"key": "B", "event": "updated"},
    {"key": "C", "event": "created"},
    {"key": "C", "event": "updated"},
    {"key": "C", "event": "updated"},
    {"key": "C", "event": "updated"},
    {"key": "C", "event": "updated"},
    {"key": "C", "event": "updated"},
]

df = pandas.DataFrame(data)

I would like to index my DataFrame on the key first and then an enumeration. It looks like a simple unstack operation, but I'm unable to find how to do it properly.

The best I could do was

df.set_index("key", append=True).swaplevel(0, 1)

          event
key            
A   0   created
    1   updated
    2   updated
    3   updated
B   4   created
    5   updated
    6   updated
C   7   created
    8   updated
    9   updated
    10  updated
    11  updated
    12  updated

but what I'm expecting is

          event
key            
A   0   created
    1   updated
    2   updated
    3   updated
B   0   created
    1   updated
    2   updated
C   0   created
    1   updated
    2   updated
    3   updated
    4   updated
    5   updated

I also tried something like

df.groupby("key")["key"].count().apply(range).apply(pandas.Series).stack()

but the order is not preserved, so I can't apply the result as an index. Besides, I feel it overkill for an operation that looks quite standard...

Any idea?

like image 233
Cilyan Avatar asked Nov 15 '18 21:11

Cilyan


People also ask

How count duplicate rows in pandas?

To count the number of duplicate rows, use the DataFrame's duplicated(~) method. Here, rows a and c are duplicates.

Can a pandas index have duplicates?

Indicate duplicate index values. Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated. The value or values in a set of duplicates to mark as missing.


1 Answers

groupby + cumcount

Here are a couple of ways:

# new version thanks @ScottBoston
df = df.set_index(['key', df.groupby('key').cumcount()])\
       .rename_axis(['key','count'])

# original version
df = df.assign(count=df.groupby('key').cumcount())\
       .set_index(['key', 'count'])

print(df)

             event
key count         
A   0      created
    1      updated
    2      updated
    3      updated
B   0      created
    1      updated
    2      updated
C   0      created
    1      updated
    2      updated
    3      updated
    4      updated
    5      updated
like image 130
jpp Avatar answered Oct 04 '22 14:10

jpp