Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe encode Categorical variable with thousands of unique values

I have a dataframe about data on schools for a few thousands cities. The school is the row identifier and the city is encoded as follow:

school city          category   capacity
1      azez6576sebd  45         23
2      dsqozbc765aj  12         236
3      sqdqsd12887s  8          63 
4      azez6576sebd  7          234 
...

How can I convert the city variable to numeric knowing that I have a few thousand cities ? I guess one-hot encoding is not appropriate as I will have too many columns. What is the general approach to convert categorical variable with thousand of levels to numeric ?

Thank you.

like image 458
roqds Avatar asked Feb 03 '18 01:02

roqds


2 Answers

You can using category dtype in sklearn , it should be labelencoder

df.city=df.city.astype('category').cat.codes
df
Out[385]: 
   school  city  category  capacity
0       1     0        45        23
1       2     1        12       236
2       3     2         8        63
3       4     0         7       234
like image 190
BENY Avatar answered Oct 23 '22 14:10

BENY


A few thousand columns is still manageable in the context of ML classifiers. Although you'd want to watch out for the curse of dimensionality.

That aside, you wouldn't want a get_dummies call to result in a memory blowout, so you could generate a SparseDataFrame instead -

v = pd.get_dummies(df.set_index('school').city, sparse=True)
v

        azez6576sebd  dsqozbc765aj  sqdqsd12887s
school                                          
1                  1             0             0
2                  0             1             0
3                  0             0             1
4                  1             0             0

type(v)
pandas.core.sparse.frame.SparseDataFrame

You can generate a sparse matrix using sdf.to_coo -

v.to_coo()

<4x3 sparse matrix of type '<class 'numpy.uint8'>'
    with 4 stored elements in COOrdinate format>
like image 5
cs95 Avatar answered Oct 23 '22 14:10

cs95