I want to have ha elegant function to cast all object columns in a pandas data frame to categories
df[x] = df[x].astype("category")
performs the type cast
df.select_dtypes(include=['object'])
would sub-select all categories columns. However this results in a loss of the other columns / a manual merge is required. Is there a solution which "just works in place" or does not require a manual cast?
I am looking for something similar as http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.convert_objects.html for a conversion to categorical data
Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame's columns to column-specific types.
The astype() method returns a new DataFrame where the data types has been changed to the specified type. You can cast the entire DataFrame to one specific data type, or you can use a Python Dictionary to specify a data type for each column, like this: { 'Duration': 'int64', 'Pulse' : 'float', 'Calories': 'int64' }
tolist()[source] Return a list of the values. These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for Timestamp/Timedelta/Interval/Period) Returns list.
use apply
and pd.Series.astype
with dtype='category'
Consider the pd.DataFrame
df
df = pd.DataFrame(dict(
A=[1, 2, 3, 4],
B=list('abcd'),
C=[2, 3, 4, 5],
D=list('defg')
))
df
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
A 4 non-null int64
B 4 non-null object
C 4 non-null int64
D 4 non-null object
dtypes: int64(2), object(2)
memory usage: 200.0+ bytes
Lets use select_dtypes
to include all 'object'
types to convert and recombine with a select_dtypes
to exclude them.
df = pd.concat([
df.select_dtypes([], ['object']),
df.select_dtypes(['object']).apply(pd.Series.astype, dtype='category')
], axis=1).reindex_axis(df.columns, axis=1)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
A 4 non-null int64
B 4 non-null category
C 4 non-null int64
D 4 non-null category
dtypes: category(2), int64(2)
memory usage: 208.0 bytes
I think that this is a more elegant way:
df = pd.DataFrame(dict(
A=[1, 2, 3, 4],
B=list('abcd'),
C=[2, 3, 4, 5],
D=list('defg')
))
df.info()
df.loc[:, df.dtypes == 'object'] =\
df.select_dtypes(['object'])\
.apply(lambda x: x.astype('category'))
df.info()
The accepted answer doesn't work for pandas version 0.25 and higher. Use .reindex
instead of reindex_axis
. See here for more information:
https://github.com/scikit-hep/root_pandas/issues/82
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With