I've consistently run into this issue of having to assign a unique ID to each group in a data set. I've used this when zero padding for RNN's, generating graphs, and many other occasions.
This can usually be done by concatenating the values in each pd.groupby
column. However, it is often the case the number of columns that define a group, their dtype, or the value sizes make concatenation an impractical solution that needlessly uses up memory.
I was wondering if there was an easy way to assign a unique numeric ID to groups in pandas.
You just need ngroup
data from seeiespi (or pd.factorize
)
df.groupby('C').ngroup()
Out[322]:
0 0
1 0
2 2
3 1
4 1
5 1
6 1
7 2
8 2
dtype: int64
More Option
pd.factorize(df.C)[0]
Out[323]: array([0, 0, 1, 2, 2, 2, 2, 1, 1], dtype=int64)
df.C.astype('category').cat.codes
Out[324]:
0 0
1 0
2 2
3 1
4 1
5 1
6 1
7 2
8 2
dtype: int8
I managed a simple solution that I constantly reference and wanted to share:
df = pd.DataFrame({'A':[1,2,3,4,6,3,7,3,2],'B':[4,3,8,2,6,3,9,1,0], 'C':['a','a','c','b','b','b','b','c','c']})
df = df.sort_values('C')
df['gid'] = (df.groupby(['C']).cumcount()==0).astype(int)
df['gid'] = df['gid'].cumsum()
In [17]: df
Out[17]:
A B C gid
0 1 4 a 1
1 2 3 a 1
2 3 8 b 2
3 4 2 b 2
4 6 6 b 2
5 3 3 b 2
6 7 9 c 3
7 3 1 c 3
8 2 0 c 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With