Retaining categorical dtype upon dataframe concatenation

Tags:

I have two dataframes with identical column names and dtypes, similar to the following:

A             object
B             category
C             category

The categories are not identical in each of the dataframes.

When normally concatinating, pandas outputs:

A             object
B             object
C             object

Which is the expected behaviour as per the documentation.

However, I wish to keep the categorisation and wish to union the categories, so I have tried the union_categoricals across the columns in the dataframe which are both categorical. cdf and df are my two dataframes.

for column in df:
    if df[column].dtype.name == "category" and cdf[column].dtype.name == "category":
        print (column)
        union_categoricals([cdf[column], df[column]], ignore_order=True)

cdf = pd.concat([cdf,df])

This is still not providing me with a categorical output.

990

asked Aug 11 '17 16:08

tom

2 Answers

I don't think this is completely obvious from the documentation, but you could do something like the following. Here's some sample data:

df1=pd.DataFrame({'x':pd.Categorical(['dog','cat'])})
df2=pd.DataFrame({'x':pd.Categorical(['cat','rat'])})

Use union_categoricals1 to get consistent categories accros dataframes. Try df.x.cat.codes if you need to convince yourself that this works.

from pandas.api.types import union_categoricals

uc = union_categoricals([df1.x,df2.x])
df1.x = pd.Categorical( df1.x, categories=uc.categories )
df2.x = pd.Categorical( df2.x, categories=uc.categories )

Concatenate and verify the dtype is categorical.

df3 = pd.concat([df1,df2])

df3.x.dtypes
category

As @C8H10N4O2 suggests, you could also just coerce from objects back to categoricals after concatenating. Honestly, for smaller datasets I think that's the best way to do it just because it's simpler. But for larger dataframes, using union_categoricals should be much more memory efficient.

187

answered Oct 17 '22 05:10

JohnE

To complement JohnE's answer, here's a function that does the job by converting to union_categoricals all the category columns present on all input dataframes:

def concatenate(dfs):
    """Concatenate while preserving categorical columns.

    NB: We change the categories in-place for the input dataframes"""
    from pandas.api.types import union_categoricals
    import pandas as pd
    # Iterate on categorical columns common to all dfs
    for col in set.intersection(
        *[
            set(df.select_dtypes(include='category').columns)
            for df in dfs
        ]
    ):
        # Generate the union category across dfs for this column
        uc = union_categoricals([df[col] for df in dfs])
        # Change to union category for all dataframes
        for df in dfs:
            df[col] = pd.Categorical(df[col].values, categories=uc.categories)
    return pd.concat(dfs)

Note the categories are changed in place in the input list:

df1=pd.DataFrame({'a': [1, 2],
                  'x':pd.Categorical(['dog','cat']),
                  'y': pd.Categorical(['banana', 'bread'])})
df2=pd.DataFrame({'x':pd.Categorical(['rat']),
                  'y': pd.Categorical(['apple'])})

concatenate([df1, df2]).dtypes

answered Oct 17 '22 05:10

Tom Bug

Related questions
                            
                                xlsxwriter and LibreOffice not showing formula's result
                            
                                Python @patch not working
                            
                                How to do "(df1 & not df2)" dataframe merge in pandas?
                            
                                How to horizontally center a widget using grid()?
                            
                                How to share object from fixture to all tests using pytest?
                            
                                call a setter from __init__ in Python
                            
                                Kurtosis on groupby of pandas dataframe doesn't work
                            
                                Spurious newlines added in Django management commands
                            
                                Why is flask's jsonify method slow?
                            
                                Most efficient way to search in list of dicts
                            
                                Pyplot errorbar keeps connecting my points with lines?
                            
                                How to operate logic operation of all columns of a 2D numpy array
                            
                                Why can you loop through an implicit tuple in a for loop, but not a comprehension in Python?
                            
                                Divide Dataframe by a series sharing index
                            
                                pandas shift converts my column from integer to float.
                            
                                What is the difference between <class 'str'> and <type 'str'>
                            
                                How to pass a function with more than one argument to python concurrent.futures.ProcessPoolExecutor.map()?
                            
                                VotingClassifier: Different Feature Sets
                            
                                Merge DataFrames with Matching Values From Two Different Columns - Pandas [duplicate]
                            
                                convert array into DataFrame in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Retaining categorical dtype upon dataframe concatenation

Tags:

python

pandas

dataframe

tom

People also ask

2 Answers

JohnE

Tom Bug

Recent Activity

Donate For Us