I have a dataframe about data on schools for a few thousands cities. The school is the row identifier and the city is encoded as follow:
school city          category   capacity
1      azez6576sebd  45         23
2      dsqozbc765aj  12         236
3      sqdqsd12887s  8          63 
4      azez6576sebd  7          234 
...
How can I convert the city variable to numeric knowing that I have a few thousand cities ? I guess one-hot encoding is not appropriate as I will have too many columns. What is the general approach to convert categorical variable with thousand of levels to numeric ?
Thank you.
You can using category dtype in sklearn , it should be labelencoder
df.city=df.city.astype('category').cat.codes
df
Out[385]: 
   school  city  category  capacity
0       1     0        45        23
1       2     1        12       236
2       3     2         8        63
3       4     0         7       234
                        A few thousand columns is still manageable in the context of ML classifiers. Although you'd want to watch out for the curse of dimensionality.
That aside, you wouldn't want a get_dummies call to result in a memory blowout, so you could generate a SparseDataFrame instead -
v = pd.get_dummies(df.set_index('school').city, sparse=True)
v
        azez6576sebd  dsqozbc765aj  sqdqsd12887s
school                                          
1                  1             0             0
2                  0             1             0
3                  0             0             1
4                  1             0             0
type(v)
pandas.core.sparse.frame.SparseDataFrame
You can generate a sparse matrix using sdf.to_coo -
v.to_coo()
<4x3 sparse matrix of type '<class 'numpy.uint8'>'
    with 4 stored elements in COOrdinate format>
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With