I have a dataframe about data on schools for a few thousands cities. The school is the row identifier and the city is encoded as follow:
school city category capacity
1 azez6576sebd 45 23
2 dsqozbc765aj 12 236
3 sqdqsd12887s 8 63
4 azez6576sebd 7 234
...
How can I convert the city variable to numeric knowing that I have a few thousand cities ? I guess one-hot encoding is not appropriate as I will have too many columns. What is the general approach to convert categorical variable with thousand of levels to numeric ?
Thank you.
You can using category dtype in sklearn , it should be labelencoder
df.city=df.city.astype('category').cat.codes
df
Out[385]:
school city category capacity
0 1 0 45 23
1 2 1 12 236
2 3 2 8 63
3 4 0 7 234
A few thousand columns is still manageable in the context of ML classifiers. Although you'd want to watch out for the curse of dimensionality.
That aside, you wouldn't want a get_dummies
call to result in a memory blowout, so you could generate a SparseDataFrame
instead -
v = pd.get_dummies(df.set_index('school').city, sparse=True)
v
azez6576sebd dsqozbc765aj sqdqsd12887s
school
1 1 0 0
2 0 1 0
3 0 0 1
4 1 0 0
type(v)
pandas.core.sparse.frame.SparseDataFrame
You can generate a sparse matrix using sdf.to_coo
-
v.to_coo()
<4x3 sparse matrix of type '<class 'numpy.uint8'>'
with 4 stored elements in COOrdinate format>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With