Let's say I have a pandas df like so:
Index A B
0 foo 3
1 foo 2
2 foo 5
3 bar 3
4 bar 4
5 baz 5
What's a good fast way to add a column like so:
Index A B Aidx
0 foo 3 0
1 foo 2 0
2 foo 5 0
3 bar 3 1
4 bar 4 1
5 baz 5 2
I.e. adding an increasing index for each unique value?
I know I could use df.unique(), then use a dict and enumerate to create a lookup, and then apply that dictionary lookup to create the column. But I feel like there should be faster way, possibly involving groupby with some special function?
No need groupby using
Method 1factorize
pd.factorize(df.A)[0]
array([0, 0, 0, 1, 1, 2], dtype=int64)
#df['Aidx']=pd.factorize(df.A)[0]
Method 2 sklearn
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.A)
LabelEncoder()
le.transform(df.A)
array([2, 2, 2, 0, 0, 1])
Method 3 cat.codes
df.A.astype('category').cat.codes
Method 4 map + unique
l=df.A.unique()
df.A.map(dict(zip(l,range(len(l)))))
0 0
1 0
2 0
3 1
4 1
5 2
Name: A, dtype: int64
Method 5 np.unique
x,y=np.unique(df.A.values,return_inverse=True)
y
array([2, 2, 2, 0, 0, 1], dtype=int64)
EDIT: Some timings with OP's dataframe
'''
%timeit pd.factorize(view.Company)[0]
The slowest run took 6.68 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 155 µs per loop
%timeit view.Company.astype('category').cat.codes
The slowest run took 4.48 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 449 µs per loop
from itertools import izip
%timeit l = view.Company.unique(); view.Company.map(dict(izip(l,xrange(len(l)))))
1000 loops, best of 3: 666 µs per loop
import numpy as np
%timeit np.unique(view.Company.values, return_inverse=True)
The slowest run took 8.08 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.7 µs per loop
Seems like numpy wins.
One way is to use ngroup. Just remember you have to make sure your groupby isn't resorting the groups to get your desired output, so set sort=False:
df['Aidx'] = df.groupby('A',sort=False).ngroup()
>>> df
Index A B Aidx
0 0 foo 3 0
1 1 foo 2 0
2 2 foo 5 0
3 3 bar 3 1
4 4 bar 4 1
5 5 baz 5 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With