Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Speed up for converting a categorical variable to it's numerical index

I need to convert a column of categorical variables in a Pandas data frame into a numerical value that corresponds to the index into an array of the unique categorical variables in the column (long story !) and here's a code snippet that accomplishes that:

import pandas as pd
import numpy as np

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
uniq_lab = np.unique(df['col'])

for lab in uniq_lab:
    df['col'].replace(lab,np.where(uniq_lab == lab)[0][0].astype(float),inplace=True)

which converts the data frame:

    col
 0  baked
 1  beans
 2  baked
 3  baked
 4  beans

into the data frame:

    col
 0  0.0
 1  1.0
 2  0.0
 3  0.0
 4  1.0

as desired. But my problem is that my dumb little for loop (the only way I've thought of to do this) is slow as molasses when I try to run similar code on big data files. I was just curious as to whether anyone had any thoughts on whether there were any ways to do this more efficiently. Thanks in advance for any thoughts.

like image 460
Karl Young Avatar asked Jun 07 '16 07:06

Karl Young


2 Answers

Use factorize:

df['col'] = pd.factorize(df.col)[0]
print (df)
   col
0    0
1    1
2    0
3    0
4    1

Docs

EDIT:

As Jeff mentioned in comment, then the best is convert column to categorical mainly because less memory usage:

df['col'] = df['col'].astype("category")

Timings:

It is interesting, in large df pandas is faster as numpy. I cant believe it.

len(df)=500k:

In [29]: %timeit (a(df1))
100 loops, best of 3: 9.27 ms per loop

In [30]: %timeit (a1(df2))
100 loops, best of 3: 9.32 ms per loop

In [31]: %timeit (b(df3))
10 loops, best of 3: 24.6 ms per loop

In [32]: %timeit (b1(df4))
10 loops, best of 3: 24.6 ms per loop  

len(df)=5k:

In [38]: %timeit (a(df1))
1000 loops, best of 3: 274 µs per loop

In [39]: %timeit (a1(df2))
The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 273 µs per loop

In [40]: %timeit (b(df3))
The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 295 µs per loop

In [41]: %timeit (b1(df4))
1000 loops, best of 3: 294 µs per loop

len(df)=5:

In [46]: %timeit (a(df1))
1000 loops, best of 3: 206 µs per loop

In [47]: %timeit (a1(df2))
1000 loops, best of 3: 204 µs per loop

In [48]: %timeit (b(df3))
The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop

In [49]: %timeit (b1(df4))
The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop

Code for testing:

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
print (df)
df = pd.concat([df]*100000).reset_index(drop=True)
#test for 5k
#df = pd.concat([df]*1000).reset_index(drop=True)


df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()

def a(df):
    df['col'] = pd.factorize(df.col)[0]
    return df

def a1(df):
    idx,_ = pd.factorize(df.col)
    df['col'] = idx
    return df

def b(df):
    df['col'] = np.unique(df['col'],return_inverse=True)[1]
    return df

def b1(df):
    _,idx = np.unique(df['col'],return_inverse=True)
    df['col'] = idx    
    return df

print (a(df1))    
print (a1(df2))   
print (b(df3))   
print (b1(df4))  
like image 173
jezrael Avatar answered Sep 28 '22 20:09

jezrael


You can use np.unique's optional argument return_inverse to ID each string based on their uniqueness among others and set those in the input dataframe, like so -

_,idx = np.unique(df['col'],return_inverse=True)
df['col'] = idx

Please note that the IDs correspond to a unique alphabetically sorted array of the strings. If you have to get that unique array, you can replace _ with it, like so -

uniq_lab,idx = np.unique(df['col'],return_inverse=True)

Sample run -

>>> d = {'col': ["baked","beans","baked","baked","beans"]}
>>> df = pd.DataFrame(data=d)
>>> df
     col
0  baked
1  beans
2  baked
3  baked
4  beans
>>> _,idx = np.unique(df['col'],return_inverse=True)
>>> df['col'] = idx
>>> df
   col
0    0
1    1
2    0
3    0
4    1
like image 33
Divakar Avatar answered Sep 28 '22 19:09

Divakar