Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A pythonic and uFunc-y way to turn pandas column into "increasing" index? [duplicate]

Tags:

python

pandas

Let's say I have a pandas df like so:

Index   A     B
0      foo    3
1      foo    2
2      foo    5
3      bar    3
4      bar    4
5      baz    5

What's a good fast way to add a column like so:

Index   A     B    Aidx
0      foo    3    0
1      foo    2    0
2      foo    5    0
3      bar    3    1
4      bar    4    1
5      baz    5    2

I.e. adding an increasing index for each unique value?

I know I could use df.unique(), then use a dict and enumerate to create a lookup, and then apply that dictionary lookup to create the column. But I feel like there should be faster way, possibly involving groupby with some special function?

like image 325
Lagerbaer Avatar asked Dec 14 '18 00:12

Lagerbaer


2 Answers

No need groupby using


Method 1factorize

pd.factorize(df.A)[0]
array([0, 0, 0, 1, 1, 2], dtype=int64)
#df['Aidx']=pd.factorize(df.A)[0]

Method 2 sklearn

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.A)
LabelEncoder()
le.transform(df.A)
array([2, 2, 2, 0, 0, 1])

Method 3 cat.codes

df.A.astype('category').cat.codes

Method 4 map + unique

l=df.A.unique()
df.A.map(dict(zip(l,range(len(l)))))
0    0
1    0
2    0
3    1
4    1
5    2
Name: A, dtype: int64

Method 5 np.unique

x,y=np.unique(df.A.values,return_inverse=True)
y
array([2, 2, 2, 0, 0, 1], dtype=int64)

EDIT: Some timings with OP's dataframe

'''

%timeit pd.factorize(view.Company)[0]

The slowest run took 6.68 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 155 µs per loop

%timeit view.Company.astype('category').cat.codes

The slowest run took 4.48 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 449 µs per loop

from itertools import izip

%timeit l = view.Company.unique(); view.Company.map(dict(izip(l,xrange(len(l)))))

1000 loops, best of 3: 666 µs per loop

import numpy as np

%timeit np.unique(view.Company.values, return_inverse=True)

The slowest run took 8.08 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.7 µs per loop

Seems like numpy wins.

like image 90
BENY Avatar answered Sep 22 '22 02:09

BENY


One way is to use ngroup. Just remember you have to make sure your groupby isn't resorting the groups to get your desired output, so set sort=False:

df['Aidx'] = df.groupby('A',sort=False).ngroup()
>>> df
   Index    A  B  Aidx
0      0  foo  3     0
1      1  foo  2     0
2      2  foo  5     0
3      3  bar  3     1
4      4  bar  4     1
5      5  baz  5     2
like image 37
sacuL Avatar answered Sep 23 '22 02:09

sacuL