I have a DataFrame where I want to change several columns from type 'object' to 'category'.
I can change several columns at the same time for float,
dftest[['col3', 'col4', 'col5', 'col6']] = \
dftest[['col3', 'col4', 'col5', 'col6']].astype(float)
For 'category' I can not do it the same, I need to do one by one (or in a loop like here).
for col in ['col1', 'col2']:
dftest[col] = dftest[col].astype('category')
Question: Is there any way of doing the change for all wanted columns at once like in the 'float' example?
If I try to do several columns at the same time I have:
dftest[['col1','col2']] = dftest[['col1','col2']].astype('category')
## NotImplementedError: > 1 ndim Categorical are not supported at this time
My current working test code:
import numpy as np
import pandas as pd
factors= np.array([
['a', 'xx'],
['a', 'xx'],
['ab', 'xx'],
['ab', 'xx'],
['ab', 'yy'],
['cc', 'yy'],
['cc', 'zz'],
['d', 'zz'],
['d', 'zz'],
['g', 'zz']
])
values = np.random.randn(10,4).round(2)
dftest = pd.DataFrame(np.hstack([factors,values]),
columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6'])
#dftest[['col1','col2']] = dftest[['col1','col2']].astype('category')
## NotImplementedError: > 1 ndim Categorical are not supported at this time
## it works with individual astype
#dftest['col2'] = dftest['col2'].astype('category')
#dftest['col1'] = dftest['col1'].astype('category')
print(dftest)
## doing a loop
for col in ['col1', 'col2']:
dftest[col] = dftest[col].astype('category')
dftest[['col3', 'col4', 'col5', 'col6']] = \
dftest[['col3', 'col4', 'col5', 'col6']].astype(float)
dftest.dtypes
output:
col1 category
col2 category
col3 float64
col4 float64
col5 float64
col6 float64
dtype: object
== [update] ==
I don't have a problem using the loop now that I know the trick, but I asked the question because I wanted to learn/understand WHY I need to do a loop for 'category' and not for float, if there is no other way of doing it.
to_numeric() The best way to convert one or more columns of a DataFrame to numeric values is to use pandas. to_numeric() . This function will try to change non-numeric objects (such as strings) into integers or floating-point numbers as appropriate.
Pandas replace multiple values in column replace. By using DataFrame. replace() method we will replace multiple values with multiple new strings or text for an individual DataFrame column. This method searches the entire Pandas DataFrame and replaces every specified value.
To convert the data type of multiple columns to float, use Pandas' apply(~) method with to_numeric(~) .
It's not immediately clear what the result of dftest[['col1','col2']].astype('category')
should be, i.e. whether the resulting columns should share the same categories or not.
Looping over columns make each column have a separate set of categories. (I believe this is a desired result in your example.)
On the other hand, .astype(float)
works differently: it ravels the underlying values to a 1d array, casts it to floats, and then reshape it back to the original shape. This way it may be faster than just iterating over columns. You can emulate this behaviour for category
with higher level functions:
result = dftest[['col1', 'col2']].stack().astype('category').unstack()
But then you get a single set of categories shared by the both columns:
result['col1']
Out[36]:
0 a
1 a
2 ab
3 ab
4 ab
5 cc
6 cc
7 d
8 d
9 g
Name: col1, dtype: category
Categories (8, object): [a < ab < cc < d < g < xx < yy < zz]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With