Pandas cast all object columns to category

Tags:

I want to have ha elegant function to cast all object columns in a pandas data frame to categories

df[x] = df[x].astype("category") performs the type cast df.select_dtypes(include=['object']) would sub-select all categories columns. However this results in a loss of the other columns / a manual merge is required. Is there a solution which "just works in place" or does not require a manual cast?

edit

I am looking for something similar as http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.convert_objects.html for a conversion to categorical data

256

asked Oct 06 '16 20:10

Georg Heiler

3 Answers

use apply and pd.Series.astype with dtype='category'

Consider the pd.DataFrame df

df = pd.DataFrame(dict(
        A=[1, 2, 3, 4],
        B=list('abcd'),
        C=[2, 3, 4, 5],
        D=list('defg')
    ))
df

enter image description here

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
A    4 non-null int64
B    4 non-null object
C    4 non-null int64
D    4 non-null object
dtypes: int64(2), object(2)
memory usage: 200.0+ bytes

Lets use select_dtypes to include all 'object' types to convert and recombine with a select_dtypes to exclude them.

df = pd.concat([
        df.select_dtypes([], ['object']),
        df.select_dtypes(['object']).apply(pd.Series.astype, dtype='category')
        ], axis=1).reindex_axis(df.columns, axis=1)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
A    4 non-null int64
B    4 non-null category
C    4 non-null int64
D    4 non-null category
dtypes: category(2), int64(2)
memory usage: 208.0 bytes

answered Sep 28 '22 02:09

piRSquared

I think that this is a more elegant way:

df = pd.DataFrame(dict(
        A=[1, 2, 3, 4],
        B=list('abcd'),
        C=[2, 3, 4, 5],
        D=list('defg')
    ))

df.info()

df.loc[:, df.dtypes == 'object'] =\
    df.select_dtypes(['object'])\
    .apply(lambda x: x.astype('category'))

df.info()

answered Sep 28 '22 01:09

KG in Chicago

Wish I could add this as a comment, but can't.

The accepted answer doesn't work for pandas version 0.25 and higher. Use .reindex instead of reindex_axis. See here for more information: https://github.com/scikit-hep/root_pandas/issues/82

answered Sep 28 '22 02:09

a Data Head

Related questions
                            
                                Pyodbc- If table exist then don't create in SSMS
                            
                                Pandas: dataframe to long format
                            
                                Given a list of values remove first occurrence
                            
                                How do I remove transparency from a histogram created using Seaborn in python?
                            
                                Commit changes for only one SQLAlchemy model instance when multiple have changed
                            
                                Download S3 Files with Boto
                            
                                Using matplotlib *without* TCL
                            
                                statsmodels logistic regression odds ratio
                            
                                Use alias for column name in SQLAlchemy
                            
                                How to set request.body in django Request factory post request?
                            
                                Stripe, PayPal, integration with django-rest-framework
                            
                                Connect to a different database in django shell
                            
                                "-bash: python2: command not found" on OS X
                            
                                Dynamically generate Flask routes
                            
                                How to turn off autoscaling in matplotlib.pyplot
                            
                                Changing iterable variable during loop
                            
                                How to call all functions with name starting with given prefix?
                            
                                jupyter notebook starting directory
                            
                                NaN from sparse_softmax_cross_entropy_with_logits in Tensorflow
                            
                                Precise nth root

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas cast all object columns to category

Tags:

python

casting

pandas

categorical-data